All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-comment] [PATCH v3 0/8] Introduce device migration support commands
@ 2023-10-30 13:19 Parav Pandit
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 1/8] admin: Add theory of operation for device migration Parav Pandit
                   ` (7 more replies)
  0 siblings, 8 replies; 157+ messages in thread
From: Parav Pandit @ 2023-10-30 13:19 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck
  Cc: sburla, shahafs, maorg, yishaih, lingshan.zhu, jasowang, Parav Pandit

This series introduces administration commands for member device migration
for PCI transport; when needed it can be extended for other transports
too.

It takes inspiration from the similar idea presented at KVM Forum at [1].

Use case requirements:
======================
1. A hypervisor system needs to provide a PCI VF as passthrough
   device to the guest virtual machine and also support live
   migration of this virtual machine.
   A passthrough device has typically only PCI configuration space
   and MSI-X table emulated by hypervisor. No virtio native interface
   offered by the virtio member device is trapped and/or emulated.
   This includes utilizing member device's native virtio common and
   device config region, device specific cvq, data vqs without any
   VMEXIT from the guest virtual machine and without any device type
   specific code in hypervisor; this is because it is already present
   in the owner and member device natively as unified interface for
   guest virtual machines, containers and may be more use cases.
2. A virtual machine may have one or more such passthrough
   virtio devices.
3. A virtual machine may have other PCI passthrough device
   which may also interact with virtio device.
4. A hypervisor runs a generic device type agnostic driver with
   extension to support device migration.
5. A PCI VF passthrough device needs to support transparent
   device reset and PCI FLR while the device migration is
   ongoing.
6. A owner driver do not involve in device operations mediation
   for the passthrough device at virtio interface level.
7. Mechanism is generic enough that applies to large family of
   virtio devices and it does not involve trapping any virtio
   device interfaces for the passthrough devices.

Overview:
=========
Above usecase requirements is solved by PCI PF group owner driver
facilitating the member device migration functionality using
administration commands.

There are three major functionalities.

1. Suspend and resume the device operation
2. Read and Write the device context containing all the information
   that can be transferred from source to destination to migrate to
   a member device
3. Track pages written by the device during device migration is
   ongoing

This comprehensive series introduces 4 infrastructure pieces
covering PCI transport, peer to peer PCI devices, page write tracking
(aka dirty page tracking) and generic virtio device context.

1. Device mode get,set (active, stop, freeze)
2. Device context read and write
3. Defines device context and compatibility command
4. Write reporting to track page addresses

This series enables virtio PCI SR-IOV member device to member device
migration. It can also be used to/from migrate from PCI SR-IOV member
device to software composed PCI device if/when needed which can
parse and compose software based PCI virtio device.

This can also be useful for accessing member devices using some variant
of only data path acceleration instead of of major passthrough functionality.

In future, for nested environment may be able to utilize the same
infrastructure with VF capable of supporting nested VF with SR-IOV
capability.

Example flow:
=============
Source hypervisor:
1. Instructs device to start tracking pages it is writing
2. Periodically query the addresses of the written pages
3. Suspend the device operation
4. Read the device context and transfer to destination hypervisor

Destination hypervisor:
5. Write the device context received from source
6. Resume the device that has newly written device context

Patch summary:
==============
patch-1: Adds theory of operation for device migration commands 
patch-2: Redefine reserved2 to command output field
patch-3: Defines short device context for split virtqueues
patch-4: Adds device migration commands
patch-5: Adds requirements for device migration commands
patch-6: Adds theory of operation for write reporting commands
patch-7: Adds write reporting commands
patch-8: Adds requirements for write reporting commands

Please review.

Changelog:
==========
v2->v3:
- updated cover letter to reflect the use for passthrough and for
  only data path acceleration
- updated cover letter to utilize same infra for nested
- Addressed comments from Michael
- updated read context command to not depend on returned data
  length for closing the read context stream, instead depend on
  explicit read response with zero length
- fixed copy paste errors in write context command for fields
  description
- added device and driver normatives for {_START,_END}_MARKER fields
- wrote member VF device instead of VF device
v1->v2:
- Addressed comments from Michael and Jason
- replaced iova to page/physical address range in write recording commands
- several device specific requirements added to clarify, interaction of
  device reset, FLR, PCI PM and admin commands
- added device context fields query command to learn compatibility
- split device context field type range into generic and device specific
- added device context extension section to maintain backward and future
  compatibility
- several rewording in theory of operation
- added requirements to cover config space read/write interaction with
  device context commands
- added assumption about pci config space and msix table not present in
  device context, which can be added when hypervisor need them
v0->v1:
- enrich device context to cover device configuration layout, feature bits
- fixed alignment of device context fields
- added missing Sign-off for the joint work done with Satananda
- added link to the github issue

[1] https://static.sched.com/hosted_files/kvmforum2022/3a/KVM22-Migratable-Vhost-vDPA.pdf

Parav Pandit (8):
  admin: Add theory of operation for device migration
  admin: Redefine reserved2 as command specific output
  device-context: Define the device context fields for device migration
  admin: Add device migration admin commands
  admin: Add requirements of device migration commands
  admin: Add theory of operation for write recording commands
  admin: Add write recording commands
  admin: Add requirements of write reporting commands

 admin-cmds-device-migration.tex | 672 ++++++++++++++++++++++++++++++++
 admin.tex                       |  40 +-
 content.tex                     |   1 +
 device-context.tex              | 241 ++++++++++++
 4 files changed, 947 insertions(+), 7 deletions(-)
 create mode 100644 admin-cmds-device-migration.tex
 create mode 100644 device-context.tex

-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] [PATCH v3 1/8] admin: Add theory of operation for device migration
  2023-10-30 13:19 [virtio-comment] [PATCH v3 0/8] Introduce device migration support commands Parav Pandit
@ 2023-10-30 13:19 ` Parav Pandit
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 2/8] admin: Redefine reserved2 as command specific output Parav Pandit
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Parav Pandit @ 2023-10-30 13:19 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck
  Cc: sburla, shahafs, maorg, yishaih, lingshan.zhu, jasowang, Parav Pandit

One or more passthrough PCI VF devices are ubiquitous for virtual
machines usage using generic kernel framework.

A passthrough PCI VF device is fully owned by the virtual machine
device driver. This passthrough device controls its own device
reset flow, basic functionality as PCI VF function level reset
and rest of the virtio device functionality such as control vq,
config space access, data path descriptors handling.

Additionally, VM live migration using a precopy method is also widely used.

To support a VM live migration for such passthrough virtio member devices,
the owner PCI PF device administers the device migration flow.

This patch introduces the basic theory of operation which describes the flow
and supporting administration commands.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
changelog:
v0->v1:
- addressed comments from Jason
- simplified commit log to remove wording of flow
- added link to the device reset section
- addressed comments from Michael
---
 admin-cmds-device-migration.tex | 95 +++++++++++++++++++++++++++++++++
 admin.tex                       |  1 +
 2 files changed, 96 insertions(+)
 create mode 100644 admin-cmds-device-migration.tex

diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
new file mode 100644
index 0000000..d172130
--- /dev/null
+++ b/admin-cmds-device-migration.tex
@@ -0,0 +1,95 @@
+\subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device / Device groups / Group
+administration commands / Device Migration}
+
+In some systems, there is a need to migrate a running virtual machine
+from one to another system. A running virtual machine has one or more
+passthrough virtio member devices attached to it. A passthrough device
+is entirely operated by the guest virtual machine. For example, with
+the SR-IOV group type, group member device VF undergos device reset
+\ref{sec:Basic Facilities of a Virtio Device / Device Reset}
+and may also undergo PCI function level reset(FLR). Such operations
+are in control of the guest virtual machine which must comply to the
+device reset requirements and the PCI standard; at the same time those
+operations must not obstruct the device migration. In such a scenario,
+a group owner device can provide the administration command interface
+to facilitate the device migration related operations.
+
+When a virtual machine migrates from one hypervisor to another hypervisor,
+these hypervisors are named as source and destination hypervisor respectively.
+In such a scenario, a source hypervisor administers the
+member device to suspend the device and preserves the device context.
+Subsequently, a destination hypervisor administers the member device to
+setup a device context and resumes the member device. The source hypervisor
+reads the member device context and the destination hypervisor writes the member
+device context. The method to transfer the member device context from the source
+to the destination hypervisor is outside the scope of this specification.
+
+The member device can be in any of the three migration modes. The owner driver
+sets the member device in one of the following modes during device migration flow.
+
+\begin{tabularx}{\textwidth}{ |l||l|X| }
+\hline
+Value & Name & Description \\
+\hline \hline
+0x0   & Active &
+  It is the default mode after instantiation of the member device. \\
+\hline
+0x1   & Stop &
+ In this mode, the member device does not send any notifications,
+ and it does not access any driver memory.
+ The member device may receive driver notifications in this mode,
+ the member device context and device configuration space may change. \\
+\hline
+0x2   & Freeze &
+ In this mode, the member device does not accept any driver notifications,
+ it ignores any device configuration space writes,
+ the device do not have any changes in the device context. The
+ member device is not accessed in the system through the virtio interface. \\
+\hline
+\hline
+0x03-0xFF   & -    & reserved for future use \\
+\hline
+\end{tabularx}
+
+When the owner driver wants to stop the operation of the
+device, the owner driver sets the device mode to \field{Stop}. Once the
+device is in the \field{Stop} mode, the device does not initiate any notifications
+or does not access any driver memory. Since the member driver may be still
+active which may send further driver notifications to the device, the device
+context may be updated. When the member driver has stopped accessing the
+device, the owner driver sets the device to \field{Freeze} mode indicating
+to the device that no more driver access occurs. In the \field{Freeze} mode,
+no more changes occur in the device context. At this point, the device ensures
+that there will not be any update to the device context.
+
+The member device has a device context which the owner driver can either
+read or write. The member device context consist of any device specific
+data which is needed by the device to resume its operation when the device mode
+is changed from \field{Stop} to \field{Active} or from \field{Freeze}
+to \field{Active}.
+
+Once the device context is read, it is cleared from the device. Typically, on
+the source hypervisor, the owner driver reads the device context once when
+the device is in \field{Active} or \field{Stop} mode and later once the member
+device is in \field{Freeze} mode.
+
+Typically, the device context is read and written one time on the source and
+the destination hypervisor respectively once the device is in \field{Freeze}
+mode. On the destination hypervisor, after writing the device context,
+when the device mode set to \field{Active}, the device uses the most recently
+set device context and resumes the device operation.
+
+In an alternative flow, on the source hypervisor the owner driver may choose
+to read the device context first time while the device is in \field{Active} mode
+and second time once the device is in \field{Freeze} mode. Similarly, on the
+destination hypervisor writes the device context first time while the device
+is still running in \field{Active} mode on the source hypervisor and writes
+the device context second time while the device is in \field{Freeze} mode.
+This flow may result in very short setup time as the device context likely
+have minimal changes from the previously written device context. This flow may
+reduce the device migration time significantly and may have near constant
+device activation time regardless of number of virtqueues, resources and
+passthough devices in use by the migrating virtual machine.
+
+The owner driver can discard any partially read or written device context when
+any of the device migration flow should be aborted.
diff --git a/admin.tex b/admin.tex
index 0803c26..6eeef58 100644
--- a/admin.tex
+++ b/admin.tex
@@ -297,6 +297,7 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
 might differ between different group types.
 
 \input{admin-cmds-legacy-interface.tex}
+\input{admin-cmds-device-migration.tex}
 
 \devicenormative{\subsubsection}{Group administration commands}{Basic Facilities of a Virtio Device / Device groups / Group administration commands}
 
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [virtio-comment] [PATCH v3 2/8] admin: Redefine reserved2 as command specific output
  2023-10-30 13:19 [virtio-comment] [PATCH v3 0/8] Introduce device migration support commands Parav Pandit
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 1/8] admin: Add theory of operation for device migration Parav Pandit
@ 2023-10-30 13:19 ` Parav Pandit
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 3/8] device-context: Define the device context fields for device migration Parav Pandit
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Parav Pandit @ 2023-10-30 13:19 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck
  Cc: sburla, shahafs, maorg, yishaih, lingshan.zhu, jasowang, Parav Pandit

Currently when a command wants to get two distinct types of data in
the result, such as one consumed by the driver, other to be zero
copied to some user buffers, the driver needs to prepare an
extra descriptor for driver consumed field. When such a field is
<= 4 bytes, extra descriptor is an overhead.

virtio_admin_command already has 4B of reserved for the device
writable area. Utilize it to define as device writable output.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
 admin.tex | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/admin.tex b/admin.tex
index 6eeef58..c86813d 100644
--- a/admin.tex
+++ b/admin.tex
@@ -90,8 +90,7 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
         /* Device-writable part */
         le16 status;
         le16 status_qualifier;
-        /* unused, reserved for future extensions */
-        u8 reserved2[4];
+        u8 command_specific_output[4];
         u8 command_specific_result[];
 };
 \end{lstlisting}
@@ -192,11 +191,15 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
 \hline
 \end{tabularx}
 
-Each command uses a different \field{command_specific_data} and
-\field{command_specific_result} structures and the length of
+Each command uses a different \field{command_specific_data},
+\field{command_specific_output} and
+\field{command_specific_result} fields. The length of
 \field{command_specific_data} and \field{command_specific_result}
-depends on these structures and is described separately or is
-implicit in the structure description.
+depends on the command and is described separately or is
+implicit in the structure description. The \field{command_specific_output}
+describes any command specific output which is up to 4 bytes size. The
+\field{command_specific_output} contain one or more command specific
+fields.
 
 Before sending any group administration commands to the device, the driver
 needs to communicate to the device which commands it is going to
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [virtio-comment] [PATCH v3 3/8] device-context: Define the device context fields for device migration
  2023-10-30 13:19 [virtio-comment] [PATCH v3 0/8] Introduce device migration support commands Parav Pandit
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 1/8] admin: Add theory of operation for device migration Parav Pandit
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 2/8] admin: Redefine reserved2 as command specific output Parav Pandit
@ 2023-10-30 13:19 ` Parav Pandit
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 4/8] admin: Add device migration admin commands Parav Pandit
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Parav Pandit @ 2023-10-30 13:19 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck
  Cc: sburla, shahafs, maorg, yishaih, lingshan.zhu, jasowang, Parav Pandit

Define the device context and its fields for purpose of device
migration. The device context is read and written by the owner driver
on source and destination hypervisor respectively.

Device context fields will experience a rapid growth post this initial
version to cover many details of the device.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Satananda Burla <sburla@marvell.com>
---
changelog:
v2->v3:
- drop fields_count to support dynamic size device context
- instead, define start and end marker context fields
- added split virtqueue's used ring fields at they are read only
  fields for the device, so the device may not be able to read them
  after device is moved to active mode
- added device context invalidate tlv to speed up context migration
  under device reset flow
- split context types in 3 categories, marker, device common and device
  specific
v1->v2:
- addressed comments from Michael
- dropped layout from the enums and definition
- defined more practical fields type range of 16-bit
- split the range to generic and device type range
- added assumptions and device context extension sections for future
  proofing
v0->v1:
- enrich device context to cover feature bits, device configuration
  fields
- corrected alignment of device context fields
---
 content.tex        |   1 +
 device-context.tex | 241 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 242 insertions(+)
 create mode 100644 device-context.tex

diff --git a/content.tex b/content.tex
index 0a62dce..2698931 100644
--- a/content.tex
+++ b/content.tex
@@ -503,6 +503,7 @@ \section{Exporting Objects}\label{sec:Basic Facilities of a Virtio Device / Expo
 UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
 
 \input{admin.tex}
+\input{device-context.tex}
 
 \chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation}
 
diff --git a/device-context.tex b/device-context.tex
new file mode 100644
index 0000000..06ed43d
--- /dev/null
+++ b/device-context.tex
@@ -0,0 +1,241 @@
+\section{Device Context}\label{sec:Basic Facilities of a Virtio Device / Device Context}
+
+The device context holds the information that a owner driver can use
+to setup a member device and resume its operation. The device context
+of a member device is read or written by the owner driver using
+administration commands.
+
+\begin{lstlisting}
+struct virtio_dev_ctx_field_tlv {
+        le16 type;
+        u8 reserved[6];
+        le64 length;
+        u8 value[];
+};
+
+struct virtio_dev_ctx {
+        struct virtio_dev_ctx_field_tlv fields[];
+};
+
+\end{lstlisting}
+
+The \field{struct virtio_dev_ctx} is the device context of a member device.
+It consist of two or more \field{struct virtio_dev_ctx_field_tlv} fields in it.
+
+The \field{struct virtio_dev_ctx_field_tlv} consist of \field{type} indicating
+what data is contained in the \field{value} of length \field{length}.
+The valid values for \field{type} can be found in the following table:
+
+\begin{table}
+\caption{\label{tab:Device Context Fields} Device Context Fields}
+\begin{tabularx}{\textwidth}{ |l||l|X| }
+\hline
+Type & Name & Description \\
+\hline \hline
+\hline
+0x0 & VIRTIO_DEV_CTX_START_MARKER & Indicates start of the device context \\
+\hline
+0x1 & VIRTIO_DEV_CTX_END_MARKER & Indicates end of the device context \\
+\hline
+0x3 - 0xFF & - & device context markers \\
+\hline
+\hline
+0x100 & VIRTIO_DEV_CTX_DISCARD & Indicates to discard device context \\
+\hline
+0x101 & VIRTIO_DEV_CTX_DEV_FEATURES & Provides device features \\
+\hline
+0x102 & VIRTIO_DEV_CTX_PCI_COMMON_CFG & Provides common configuration space of device for PCI transport \\
+\hline
+0x103 & VIRTIO_DEV_CTX_PCI_VQ_CFG & Provides Virtqueue configuration for PCI transport \\
+\hline
+0x104 & VIRTIO_DEV_CTX_VQ_SPLIT_RUNTIME_CFG & Provides Queue run time state \\
+\hline
+0x105 & VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC & Provides list of virtqueue descriptors owned by device  \\
+\hline
+0x106 - 0xFFF & - & Generic device agnostic range reserved for future \\
+\hline
+\hline
+0x1000 & VIRTIO_DEV_CTX_DEV_CFG & Provides device specific configuration \\
+\hline
+0x1001 - 0x1FFF & - & Device type specific range reserved for future \\
+\hline
+\hline
+0x2000 - 0xFFFF & - & Reserved for future \\
+\hline
+\end{tabularx}
+\end{table}
+
+\subsection{Device Context Fields}\label{sec:Basic Facilities of a Virtio Device / Device Context / Device Context Fields}
+
+Device Context Fields are of three types:
+
+\begin{description}
+\item[Context Marker] type indicates the fields to describe the device context itself.
+     They are in the range 0 to 0xFF.
+\item[Device Common] type indicates the fields which are common across all devices.
+     They are in the range 0x100 to 0x1FFF.
+\item[Device Specific] type indicates the fields which are device type specific.
+     They are in the range 0x1000 to 0x2FFF.
+\end{description}
+
+\subsubsection{Device Context Start Marker}
+For the field VIRTIO_DEV_CTX_START_MARKER, \field{type} is set to 0x0.
+The \field{value} is empty. The \field{length} is set to 0x0.
+
+\field{Device Context Start Marker} indicates the start of the device context.
+All the other device context fields are located after this field.
+
+\subsubsection{Device Context End Marker}
+For the field VIRTIO_DEV_CTX_END_MARKER, \field{type} is set to 0x1.
+The \field{value} is empty. The \field{length} is set to 0x0.
+
+\field{Device Context End Marker} indicates the end of the device context.
+All the other device context fields are located before this field.
+
+\subsubsection{Device Context Discard}
+For the field VIRTIO_DEV_CTX_DISCARD, \field{type} is set to 0x100.
+The \field{value} is empty. The \field{length} is set to 0x0.
+
+\field{Device Context Invalidate} indicates that any previous device context
+fields of in \field{Device Common} and in \field{Device Specific} range are invalid.
+
+\subsubsection{Device Features Context}
+\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ Device Features Context}
+
+For the field VIRTIO_DEV_CTX_DEV_FEATURES, \field{type} is set to 0x101.
+The \field{value} is in format of device feature bits listed in
+\ref{sec:Basic Facilities of a Virtio Device / Feature Bits} in the format of \field{struct virtio_dev_ctx_features}.
+The \field{length} is the length of the \field{value}.
+
+\begin{lstlisting}
+struct virtio_dev_ctx_pci_vq_cfg {
+        le64 feature_bits[];
+};
+\end{lstlisting}
+
+\subsubsection{PCI Common Configuration Context}
+\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ PCI Common Configuration Context}
+
+For the field VIRTIO_DEV_CTX_PCI_COMMON_CFG, \field{type} is set to 0x102.
+The \field{value} is in format of \field{struct virtio_pci_common_cfg}.
+The \field{length} is the length of \field{struct virtio_pci_common_cfg}.
+
+\subsubsection{PCI Virtqueue Configuration Context}
+\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ PCI Virtqueue Configuration Context}
+
+For the field VIRTIO_DEV_CTX_PCI_VQ_CFG, \field{type} is set to 0x103.
+The \field{value} is in format of \field{struct virtio_dev_ctx_pci_vq_cfg}.
+The \field{length} is the length of \field{struct virtio_dev_ctx_pci_vq_cfg}.
+
+\begin{lstlisting}
+struct virtio_dev_ctx_pci_vq_cfg {
+        le16 vq_index;
+        le16 queue_size;
+        le16 queue_msix_vector;
+        le16 reserved;
+        le64 queue_desc;
+        le64 queue_driver;
+        le64 queue_device;
+};
+\end{lstlisting}
+
+One or multiple entries of PCI Virtqueue Configuration Context may exist, each such
+entry corresponds to a unique virtqueue identified by the \field{vq_index}.
+
+\subsubsection{Virtqueue Split Mode Runtime Context}
+\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ Virtqueue Split Mode Runtime Context}
+
+For the field VIRTIO_DEV_CTX_VQ_SPLIT_RUNTIME_CFG, \field{type} is set to 0x104.
+The \field{value} is in format of \field{struct virtio_dev_ctx_vq_split_runtime}.
+The \field{length} is the length of \field{struct virtio_dev_ctx_vq_split_runtime}.
+
+\begin{lstlisting}
+struct virtio_dev_ctx_vq_split_runtime {
+        le16 vq_index;
+        le16 dev_avail_idx;
+        le16 used_flags;
+        le16 used_idx;
+        le16 used_avail_event;
+        u8 enabled;
+        u8 reserved[7];
+};
+\end{lstlisting}
+
+One or multiple entries of \field{struct virtio_dev_ctx_vq_split_runtime}
+may exist, each such entry corresponds to a virtqueue identified
+by the \field{vq_index}.
+
+The \field{dev_avail_idx} indicates the next available index of the virtqueue from which
+the device must start processing the available ring.
+
+The \field{used_flags} indicates the last value written by the device for the
+field \field{flags} in the used ring \field{struct virtq_used}.
+
+The \field{used_idx} indicates the last value written by the device for the field
+\field{idx} in the used ring \field{struct virtq_used}.
+
+The \field{used_avail_event} indicates the last value written by the device in the
+field \field{avail_event} in the used ring \field{struct virtq_used}.
+
+\subsubsection{Virtqueue Split Mode Device owned Descriptors Context}
+
+For the field VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC, \field{type} is set to 0x105.
+The \field{value} is in format of \field{struct virtio_dev_ctx_vq_split_runtime}.
+The \field{length} is the length of \field{struct virtio_dev_ctx_vq_split_dev_descs}.
+
+\begin{lstlisting}
+struct virtio_dev_ctx_vq_split_dev_descs {
+        le16 vq_index;
+        le16 desc_count;
+        le16 desc_idx[];
+};
+\end{lstlisting}
+
+The \field{desc_idx} contains indices of the descriptors in \field{desc_count} of a
+virtqueue identified by \field{vq_index} which is owned by the device.
+
+One or multiple entries of \field{struct virtio_dev_ctx_vq_split_dev_descs} may exist, each such
+entry corresponds to a virtqueue identified by the \field{vq_index}.
+
+\subsubsection{Device Configuration Context}
+\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ Device Configuration Context}
+
+For the field VIRTIO_DEV_CTX_DEV_CFG, \field{type} is set to 0x1000.
+The \field{value} is in format of device specific configuration listed
+in each of the device's device configuration layout section.
+For example, for File System Device, \field{value} is in format of
+\field{struct virtio_fs_config}.
+The \field{length} is the length of the device configuration data of
+\field {value}.
+
+\subsubsection{Device Context Extensions}
+Various considerations are necessary when creating new device context field or
+when extending the device context field structure.
+
+1. How to define a new device context field? \\
+If the new field is generic for all the device types or most of the device types,
+it should be added under the generic field range. If the new field is unique to
+a device type, it should be added under the device range type. \\
+
+2. When to define a new device context field? \\
+When the device context field for a specific field does not exists, one should
+define a new device context field. \\
+
+3. How to avoid duplication of device context field definition with device
+   specific structures which may be present as control vq data structures? \\
+Each device should reuse any existing field definition that may exists as part
+of device control virtqueue or any other request structure. \\
+
+4. How to extend the existing device context field definition? \\
+When a element is missing in already defined field, a new field must be added at
+the end of the device context field. New field MUST not be added at beginning or in
+the middle of the field structure. Any field which is already present MUST NOT
+be removed. \\
+
+\subsubsection{Assumptions}
+For the SR-IOV group type, some hypervisor do not permit the driver to access
+PCI configuration space and MSI-X Table space directly. Such hypervisor handles the
+query and saving of these fields without the need of its existence in device context.
+Hence, this version of the specification do not have it in the device context. A future
+extension of the device context may further include them with new field type for
+each of the field.
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [virtio-comment] [PATCH v3 4/8] admin: Add device migration admin commands
  2023-10-30 13:19 [virtio-comment] [PATCH v3 0/8] Introduce device migration support commands Parav Pandit
                   ` (2 preceding siblings ...)
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 3/8] device-context: Define the device context fields for device migration Parav Pandit
@ 2023-10-30 13:19 ` Parav Pandit
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 5/8] admin: Add requirements of device migration commands Parav Pandit
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Parav Pandit @ 2023-10-30 13:19 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck
  Cc: sburla, shahafs, maorg, yishaih, lingshan.zhu, jasowang, Parav Pandit

A passthrough device is mapped to the guest VM. A passthrough device
accessed by the driver can undergo its own device reset and for PCI
transport it can undergo its PCI FLR while the guest VM migration is
ongoing.
The passhtrough device may not have any direct channel through which
device migration related administrative tasks can be done, and even if
it may have such adminstative task must not be interrupted by the
device reset or VF FLR flow initiated by the passthrough device.

Hence, the owner driver which administers the member devices,
facilitate the device migration flow.

Add device migration administration commands that owner driver can use
for the passthrough device.

Subsequent patch defines the device context in detail.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Satananda Burla <sburla@marvell.com>
---
changelog:
v2->v3:
- updated read context command to not depend on returned data
  length to close the read context stream
- fixed copy paste errors in write context command for fields
  description
v1->v2:
- addressed comments from Michael
- updated commit log to refer to device context in later patch
- moved admin command table opcode to this (right) patch
- added command to query supported fields of the device context
---
 admin-cmds-device-migration.tex | 232 +++++++++++++++++++++++++++++++-
 admin.tex                       |  16 ++-
 device-context.tex              |   6 +-
 3 files changed, 249 insertions(+), 5 deletions(-)

diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index d172130..c5030d2 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -66,7 +66,8 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 read or write. The member device context consist of any device specific
 data which is needed by the device to resume its operation when the device mode
 is changed from \field{Stop} to \field{Active} or from \field{Freeze}
-to \field{Active}.
+to \field{Active}. The device context is described in section
+\ref{sec:Basic Facilities of a Virtio Device / Device Context}.
 
 Once the device context is read, it is cleared from the device. Typically, on
 the source hypervisor, the owner driver reads the device context once when
@@ -93,3 +94,232 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 
 The owner driver can discard any partially read or written device context when
 any of the device migration flow should be aborted.
+
+The owner driver uses following device migration group administration commands.
+
+\begin{enumerate}
+\item Device Mode Get Command
+\item Device Mode Set Command
+\item Device Context Size Get Command
+\item Device Context Read Command
+\item Device Context Write Command
+\item Device Context Supported Fields Query Command
+\item Device Context Discard Command
+\end{enumerate}
+
+These commands are currently only defined for the SR-IOV group type.
+
+\paragraph{Device Mode Get Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Mode Get Command}
+
+This command reads the mode of the device.
+For the command VIRTIO_ADMIN_CMD_DEV_MODE_GET, \field{opcode}
+is set to 0x7.
+The \field{group_member_id} refers to the member device to be accessed.
+This command has no command specific data.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_mode_get_result {
+        u8 mode;
+};
+\end{lstlisting}
+
+When the command completes successfully, \field{command_specific_result}
+is in the format \field{struct virtio_admin_cmd_dev_mode_get_result}
+returned by the device where the device returns the \field{mode} value to
+either \field{Active} or \field{Stop} or \field{Freeze}.
+
+\paragraph{Device Mode Set Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Mode Set Command}
+
+This command sets the mode of the device.
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_dev_mode_set_data} describing the new device mode.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_mode_set_data {
+        u8 mode;
+};
+\end{lstlisting}
+
+For the command VIRTIO_ADMIN_CMD_DEV_MODE_SET, \field{opcode} is set to 0x8.
+The \field{group_member_id} refers to the member device to be accessed.
+The \field{mode} is set to either \field{Active} or \field{Stop} or
+\field{Freeze}.
+
+This command has no command specific result. When the command completes
+successfully, device is set in the new \field{mode}. When the command fails
+the device stays in the previous mode.
+
+\paragraph{Device Context Size Get Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Size Get Command}
+
+This command returns the remaining estimated device context size. The 
+driver can query the remaining estimated device context size
+for the current mode or for the \field{Freeze} mode. While
+reading the device context using VIRTIO_ADMIN_CMD_DEV_CTX_READ command, the
+actual device context size may differ than what is being returned by
+this command. After reading the device context using command
+VIRTIO_ADMIN_CMD_DEV_CTX_READ, the remaining estimated context size
+usually reduces by amount of device context read by the driver using
+VIRTIO_ADMIN_CMD_DEV_CTX_READ command. If the device context is updated
+rapidly the remaining estimated context size may also increase even after
+reading the device context using VIRTIO_ADMIN_CMD_DEV_CTX_READ command.
+
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET, \field{opcode} is set to 0x9.
+The \field{group_member_id} refers to the member device to be accessed.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_ctx_size_get_data {
+        u8 freeze_mode;
+};
+\end{lstlisting}
+
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_dev_ctx_size_get_data}.
+When \field{freeze_mode} is set to 1, the device returns the estimated
+device context size when the device will be in \field{Freeze} mode.
+As the device context is read from the device, the remaining estimated
+context size may decrease. For example, member device mode is
+\field{Stop}, the device has estimated total device context size
+as 12KB; the device would return 12KB for the first
+VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET command, once the driver has
+already read 8KB of device context data using
+VIRTIO_ADMIN_CMD_DEV_CTX_READ command, and the remaining data is
+4KB, hence the device returns 4KB in the subsequent
+VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET command.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_ctx_size_get_result {
+        le64 size;
+};
+\end{lstlisting}
+
+When the command completes successfully, \field{command_specific_result} is in
+the format \field{struct virtio_admin_cmd_dev_ctx_size_get_result}.
+
+Once the device context is fully read, this command returns zero for
+\field{size} until the new device context is generated.
+
+\paragraph{Device Context Read Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Read Command}
+
+This command reads the current device context.
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_READ, \field{opcode} is set to 0xa.
+The \field{group_member_id} refers to the member device to be accessed.
+
+This command has no command specific data.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_ctx_rd_len {
+        le32 context_len;
+};
+
+struct virtio_admin_cmd_dev_ctx_rd_result {
+        u8 data[];
+};
+\end{lstlisting}
+
+When the command completes successfully, \field{command_specific_result}
+is in the format \field{struct virtio_admin_cmd_dev_ctx_rd_result}
+returned by the device containing the device context data and
+\field{command_specific_output} is in format of
+\field{struct virtio_admin_cmd_dev_ctx_rd_len} containing length of
+context data returned by the device in the command response. When the length
+returned is zero the device do not have any device context data left that
+the device can report, at this point the device context stream ends.
+
+The driver can read the whole device context data using one or multiple
+commands. When the device context does not fit in the
+\field{command_specific_result}, driver reads the subsequent remaining
+bytes using one or more subsequent commands.
+
+\paragraph{Device Context Write Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Write Command}
+
+This command writes the device context data. The device context can be written
+only when the device mode is \field{Freeze}.
+
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_WRITE, \field{opcode}
+is set to 0xb.
+The \field{group_member_id} refers to the member device to be accessed.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_ctx_wr_data {
+        u8 data[];
+};
+\end{lstlisting}
+
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_dev_ctx_wr_data} describing
+the access to be performed.
+
+This command has no command specific result.
+The device fails the command when command is executed when the device mode
+is other than \field{Freeze}.
+
+The written device context is effective when the device mode is changed
+from \field{Freeze} to \field{Stop} or from \field{Freeze} to \field{Active}.
+
+The driver can write the whole device context using one or multiple
+commands. When the device context does not fit in one command data
+\field{struct virtio_admin_cmd_dev_ctx_wr_data}, the driver writes the
+subsequent remaining bytes using one or more subsequent commands.
+
+\paragraph{Device Context Supported Fields Query Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Supported Fields Query Command}
+
+This command reads supported fields in the device context.
+Each listed \field{type} of the device context of
+\ref{sec:Basic Facilities of a Virtio Device / Device Context} is represented
+as one entry the command response. When the device support a given \field{type} for the member
+device, corresponding entry is set in the command response.
+
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_FIELDS_QUERY, \field{opcode} is set to 0xc.
+The \field{group_member_id} refers to the member device to be accessed.
+
+This command has no command specific data.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_ctx_supported_field {
+        le16 type;
+        u8 reserved[6];
+        le64 length;
+};
+
+struct virtio_admin_cmd_dev_ctx_supported_fields_result {
+        struct virtio_admin_cmd_dev_ctx_supported_field fields[];
+};
+\end{lstlisting}
+
+When the command completes successfully, \field{command_specific_result}
+is in the format \field{struct virtio_admin_cmd_dev_ctx_supported_fields_result}.
+Each entry in array \field{fields} represents the supported \field{type}
+and its length as described in \ref{tab:Device Context Fields}.
+
+\paragraph{Device Context Discard Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Discard Command}
+
+This command discards any partial device context that is yet to be read
+by the driver and it also discards any device context that is partially written.
+This command can be used by the driver to abort any device context migration
+flow when there may have been any partial context read or write operations
+have occurred.
+
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD, \field{opcode}
+is set to 0xd.
+The \field{group_member_id} refers to the member device to be accessed.
+
+This command has no command specific data.
+This command has no command specific result.
+
+Once this command completes successfully, the device context is
+discarded. If the device context that is discarded was part of the write
+operation, once this command completes, the device functions as if the device
+context was never written. If the device context that is discarded was part
+of the read operation, once this command completes, the device functions as if
+the device context was never read in the given device mode. Once the device
+context is discarded, in subsequent VIRTIO_ADMIN_CMD_DEV_CTX_READ command,
+the device returns new device context entry. Once the device context is
+discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command writes a new device
+context.
diff --git a/admin.tex b/admin.tex
index c86813d..142692c 100644
--- a/admin.tex
+++ b/admin.tex
@@ -126,7 +126,21 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
 \hline
 0x0006 & VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO & Query the notification region information \\
 \hline
-0x0007 - 0x7FFF & - & Commands using \field{struct virtio_admin_cmd}    \\
+0x0007 & VIRTIO_ADMIN_CMD_DEV_MODE_GET & Query the device mode \\
+\hline
+0x0008 & VIRTIO_ADMIN_CMD_DEV_MODE_SET & Set the device mode \\
+\hline
+0x0009 & VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET & Query the device context size \\
+\hline
+0x000a & VIRTIO_ADMIN_CMD_DEV_CTX_READ & Read the device context data \\
+\hline
+0x000b & VIRTIO_ADMIN_CMD_DEV_CTX_WRITE & Write the device context data \\
+\hline
+0x000c & VIRTIO_ADMIN_CMD_DEV_CTX_FIELDS_QUERY & Query Supported fields of device context \\
+\hline
+0x000d & VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD & Clear the device context data \\
+\hline
+0x000e - 0x7FFF & - & Commands using \field{struct virtio_admin_cmd}    \\
 \hline
 0x8000 - 0xFFFF & - & Reserved for future commands (possibly using a different structure)    \\
 \hline
diff --git a/device-context.tex b/device-context.tex
index 06ed43d..1eb71f7 100644
--- a/device-context.tex
+++ b/device-context.tex
@@ -133,9 +133,9 @@ \subsubsection{PCI Virtqueue Configuration Context}
         le16 queue_size;
         le16 queue_msix_vector;
         le16 reserved;
-        le64 queue_desc;
-        le64 queue_driver;
-        le64 queue_device;
+        le64 queue_desc;
+        le64 queue_driver;
+        le64 queue_device;
 };
 \end{lstlisting}
 
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [virtio-comment] [PATCH v3 5/8] admin: Add requirements of device migration commands
  2023-10-30 13:19 [virtio-comment] [PATCH v3 0/8] Introduce device migration support commands Parav Pandit
                   ` (3 preceding siblings ...)
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 4/8] admin: Add device migration admin commands Parav Pandit
@ 2023-10-30 13:19 ` Parav Pandit
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 6/8] admin: Add theory of operation for write recording commands Parav Pandit
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 157+ messages in thread
From: Parav Pandit @ 2023-10-30 13:19 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck
  Cc: sburla, shahafs, maorg, yishaih, lingshan.zhu, jasowang, Parav Pandit

Add device and driver side requirements for the device migration
commands.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
changelog:
v2->v3:
- added device and driver normatives for {_START,_END}_MARKER fields
- wrote member VF device instead of VF device
v1->v2:
- fixed spelling from membe to member
- removed device requirement line of FLR making the device active
  as it was incorrectly written to mix operational and admin state
- added requirements to clarify flr, device reset, pm and admin commands
- group sriov requirements
- added description for device config space access in stop mode
- removed stale requirement around pci ids
- made device context write command requirements more robust
  for future and backward compatibility
---
 admin-cmds-device-migration.tex | 173 ++++++++++++++++++++++++++++++++
 1 file changed, 173 insertions(+)

diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index c5030d2..ed911e4 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -323,3 +323,176 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 the device returns new device context entry. Once the device context is
 discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command writes a new device
 context.
+
+\devicenormative{\paragraph}{Device Migration}{Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration}
+
+A device MUST either support all of, or none of
+VIRTIO_ADMIN_CMD_DEV_MODE_GET,
+VIRTIO_ADMIN_CMD_DEV_MODE_SET,
+VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET,
+VIRTIO_ADMIN_CMD_DEV_READ,
+VIRTIO_ADMIN_CMD_DEV_WRITE and
+VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD commands.
+
+When the device \field{mode} supplied in the command
+VIRTIO_ADMIN_CMD_DEV_MODE_SET is same as what the mode in the device, the device
+MUST complete the command successfully.
+
+The device MUST fail the command VIRTIO_ADMIN_CMD_DEV_MODE_SET when the \field{mode}
+is other than \field{Active} or \field{Stop} or \field{Freeze}.
+
+When changing the device mode using the command VIRTIO_ADMIN_CMD_DEV_MODE_SET,
+if the command fails, the device MUST retain the current device mode.
+
+The device MUST fail VIRTIO_ADMIN_CMD_DEV_MODE_SET command when \field{mode}
+is set to \field{Active} or \field{Stop} and if the device context is
+partially read or written using VIRTIO_ADMIN_CMD_DEV_CTX_READ and
+VIRTIO_ADMIN_CMD_DEV_CTX_WRITE commands respectively.
+
+When VIRTIO_ADMIN_CMD_DEV_CTX_READ command is received multiple times
+in a given mode, and when the complete device context is already read by the
+driver, on subsequent reception of command VIRTIO_ADMIN_CMD_DEV_CTX_READ,
+the device MUST complete the command successfully with
+\field{context_len} set to zero.
+
+The device MUST support reading the device context when the device is
+in any mode \field{Active} or \field{Stop} or \field{Freeze} using command
+VIRTIO_ADMIN_CMD_DEV_CTX_READ.
+
+When the device is in any of the mode, and if the device context is read
+partially using VIRTIO_ADMIN_CMD_DEV_CTX_READ command, the device MUST discard
+the device context when VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD command is executed;
+In subsequent execution of VIRTIO_ADMIN_CMD_DEV_CTX_READ and
+VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET, the device MUST return the remaining
+estimated device context size and the device context respectively for the
+current mode as if VIRTIO_ADMIN_CMD_DEV_CTX_READ was never received by the
+device for the current device mode.
+
+The device MUST set VIRTIO_DEV_CTX_START_MARKER, VIRTIO_DEV_CTX_END_MARKER
+and VIRTIO_DEV_CTX_DISCARD in VIRTIO_ADMIN_CMD_DEV_CTX_FIELDS_QUERY
+command result.
+
+The device MUST support writing the complete device context multiple times
+by the command VIRTIO_ADMIN_CMD_DEV_CTX_WRITE.
+
+The device MUST fail VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command when the device
+mode is not \field{Freeze}.
+
+For the SR-IOV group type,
+\begin{itemize}
+\item the device MUST not initiate any PCI transaction
+      when the device mode is not \field{Active}.
+\item the device MUST finish all the outstanding PCI transactions before completing
+      the command VIRTIO_ADMIN_CMD_DEV_MODE_SET.
+\item when the device mode is \field{Stop}, the device MUST accept driver
+       notifications and the device MAY update any fields of the device context.
+\item the device MUST respond with valid values for PCI read requests when
+      the device mode is \field{Stop}.
+\item the device MUST function same for the PCI architected interfaces
+      regardless of the device mode.
+\item the device MUST not generate any PCI PME when the device is
+      not in \field{Active} state.
+\item the device MUST NOT update any fields of the device context when the
+      device is in \field{Freeze} mode, the device MAY update fields of the
+      device context when the device transitions from \field{Stop} to
+      \field{Freeze} mode.
+\end{itemize}
+
+When the device mode is not \field{Active},
+\begin{itemize}
+\item the device MUST not access any virtqueue memory or any memory referred
+      by the virtqueue when the device mode is not \field{Active}.
+
+\item the device MUST not generate any configuration change notification. 
+\end{itemize}
+
+When the device is in \field{Freeze} mode, and if any device context is
+written partially by VIRTIO_ADMIN_CMD_DEV_CTX_WRITE, the device MUST discard
+the device context when VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD
+command is executed, i.e. the device functions as if the command
+VIRTIO_ADMIN_CMD_DEV_CTX_WRITE was never received.
+
+For the SR-IOV group type,
+\begin{itemize}
+\item when the device is in \field{Freeze} mode, any
+write access to virtio configuration space MUST not update any fields and any
+configuration space read MAY return any value.
+
+\item for the VIRTIO_PCI_CAP_PCI_CFG capability area,
+the device MUST ignore writes when the device mode is set to \field{Freeze}
+and on receiving the reads, the device MUST function same regardless of the
+device mode is \field{Active} or \field{Stop} or \field{Freeze}.
+
+\item the member VF device MUST respond to commands
+VIRTIO_ADMIN_CMD_DEV_MODE_SET, VIRTIO_ADMIN_CMD_DEV_CTX_WRITE and
+VIRTIO_ADMIN_CMD_DEV_CTX_READ after the member VF device FLR completes in the
+device, if the member VF device FLR is in progress when the device receives
+any of these commands.
+
+\item the member device MUST respond to commands
+VIRTIO_ADMIN_CMD_DEV_MODE_SET, VIRTIO_ADMIN_CMD_DEV_CTX_WRITE and
+VIRTIO_ADMIN_CMD_DEV_CTX_READ after the device reset completes in the device, if the
+device reset is in progress when the device receives any of these commands.
+
+\item the member device MUST respond to commands
+VIRTIO_ADMIN_CMD_DEV_MODE_SET, VIRTIO_ADMIN_CMD_DEV_CTX_WRITE and
+VIRTIO_ADMIN_CMD_DEV_CTX_READ after the device power management state transition completes
+in the device, if the power management state transition is in progress
+when the device receives any of these commands.
+\end{itemize}
+
+The device MUST respond with an error for the command
+VIRTIO_ADMIN_CMD_DEV_CTX_WRITE, if there is a mismatch between the
+device context field length supplied in the
+VIRTIO_ADMIN_CMD_DEV_CTX_WRITE data and the length of the field
+in the device.
+
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_READ,
+\begin{itemize}
+\item when there is valid context to respond at the start, the device MUST respond
+\field{field[0]} in \field{struct virtio_dev_ctx} of type
+VIRTIO_DEV_CTX_START_MARKER.
+
+\item the device MUST end the device context \field{struct virtio_dev_ctx},
+with last entry of \field{fields[]} with type VIRTIO_DEV_CTX_END_MARKER.
+
+\item the device MUST NOT respond VIRTIO_DEV_CTX_END_MARKER followed by
+VIRTIO_DEV_CTX_START_MARKER in single command response.
+
+\item the device MAY respond VIRTIO_DEV_CTX_START_MARKER followed by
+VIRTIO_DEV_CTX_END_MARKER in two different command responses.
+
+\item the device MAY respond VIRTIO_DEV_CTX_START_MARKER followed by
+VIRTIO_DEV_CTX_END_MARKER in single command response.
+\end{itemize}
+
+\drivernormative{\paragraph}{Device Migration}{Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration}
+
+The driver SHOULD read the complete device context using one or multiple
+VIRTIO_ADMIN_CMD_DEV_CTX_READ commands.
+
+The driver MAY write the device context before changing the device mode from
+\field{Freeze} to \field{Stop} or from \field{Freeze} to \field{Active};
+the driver MUST write a complete device context using one or multiple
+VIRTIO_ADMIN_CMD_DEV_CTX_WRITE commands.
+
+The driver MUST NOT change the device mode to \field{Stop} or \field{Active}
+in the command VIRTIO_ADMIN_CMD_DEV_MODE_SET when device context is
+partially written.
+
+For the SR-IOV group type, the driver SHOULD NOT access device configuration
+space described in section
+\ref{sec:Basic Facilities of a Virtio Device / Device Configuration Space}
+when the device mode is set to \field{Freeze} or \field{Stop}.
+
+For the SR-IOV group type, the driver MUST NOT write into the
+VIRTIO_PCI_CAP_PCI_CFG capability area when the device mode is set to
+\field{Freeze}.
+
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_WRITE, at start, the driver MUST
+set \field{field[0]} in \field{struct virtio_dev_ctx} of type
+VIRTIO_DEV_CTX_START_MARKER.
+
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_WRITE, the driver MUST
+end the device context \field{struct virtio_dev_ctx}, with last
+entry of \field{fields[]} with type VIRTIO_DEV_CTX_END_MARKER.
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [virtio-comment] [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-10-30 13:19 [virtio-comment] [PATCH v3 0/8] Introduce device migration support commands Parav Pandit
                   ` (4 preceding siblings ...)
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 5/8] admin: Add requirements of device migration commands Parav Pandit
@ 2023-10-30 13:19 ` Parav Pandit
  2023-10-31  1:43   ` [virtio-comment] " Jason Wang
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 7/8] admin: Add " Parav Pandit
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 8/8] admin: Add requirements of write reporting commands Parav Pandit
  7 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-10-30 13:19 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck
  Cc: sburla, shahafs, maorg, yishaih, lingshan.zhu, jasowang, Parav Pandit

During a device migration flow (typically in a precopy phase of the
live migration), a device may write to the guest memory. Some
iommu/hypervisor may not be able to track these written pages.
These pages to be migrated from source to destination hypervisor.

A device which writes to these pages, provides the page address record
of the to the owner device. The owner device starts write
recording for the device and queries all the page addresses written by
the device.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Satananda Burla <sburla@marvell.com>
---
changelog:
v1->v2:
- addressed comments from Michael
- replaced iova with physical address
---
 admin-cmds-device-migration.tex | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index ed911e4..2e32f2c 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -95,6 +95,21 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 The owner driver can discard any partially read or written device context when
 any of the device migration flow should be aborted.
 
+During the device migration flow, a passthrough device may write data to the
+guest virtual machine's memory, a source hypervisor needs to keep track of these
+written memory to migrate such memory to destination hypervisor.
+Some systems may not be able to keep track of such memory write addresses at
+hypervisor level. In such a scenario, a device records and reports these
+written memory addresses to the owner device. The owner driver enables write
+recording for one or more physical address ranges per device during device
+migration flow. The owner driver periodically queries these written physical
+address records from the device. As the driver reads the written address records,
+the device clears those records from the device.
+Once the device reports zero or small number of written address records, the device
+mode is set to \field{Stop} or \field{Freeze}. Once the device is set to \field{Stop}
+or \field{Freeze} mode, and once all the IOVA records are read, the driver stops
+the write recording in the device.
+
 The owner driver uses following device migration group administration commands.
 
 \begin{enumerate}
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [virtio-comment] [PATCH v3 7/8] admin: Add write recording commands
  2023-10-30 13:19 [virtio-comment] [PATCH v3 0/8] Introduce device migration support commands Parav Pandit
                   ` (5 preceding siblings ...)
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 6/8] admin: Add theory of operation for write recording commands Parav Pandit
@ 2023-10-30 13:19 ` Parav Pandit
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 8/8] admin: Add requirements of write reporting commands Parav Pandit
  7 siblings, 0 replies; 157+ messages in thread
From: Parav Pandit @ 2023-10-30 13:19 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck
  Cc: sburla, shahafs, maorg, yishaih, lingshan.zhu, jasowang, Parav Pandit

When migrating a virtual machine with passthrough
virtio devices, the virtio device may write into the guest
memory. Some systems may not be able to keep track of these
pages efficiently.

To facilitate such a system, a device provides the record
of pages which are written by the device.

The owner driver configures the member device for list of address
ranges for which it expects write recording and reporting by the device.

The owner driver periodically queries the written pages address record
which gets cleared from the device upon reading it.

When the write records reduces over the time, at one point write recording
is stopped after the device mode is set to FREEZE.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Satananda Burla <sburla@marvell.com>
---
changelog:
v1->v2:
- addressed comments from Michael
- merged theory of operation changes to previous patch
- replaced iova with physical address
- renamed iova range with a page
- reworded and simplified wording using page
---
 admin-cmds-device-migration.tex | 129 +++++++++++++++++++++++++++++++-
 admin.tex                       |  10 ++-
 2 files changed, 135 insertions(+), 4 deletions(-)

diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index 2e32f2c..f6c2881 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -106,9 +106,8 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 address records from the device. As the driver reads the written address records,
 the device clears those records from the device.
 Once the device reports zero or small number of written address records, the device
-mode is set to \field{Stop} or \field{Freeze}. Once the device is set to \field{Stop}
-or \field{Freeze} mode, and once all the IOVA records are read, the driver stops
-the write recording in the device.
+mode is set to \field{Stop} or \field{Freeze}. Once all the physical address records
+are read, the driver stops the write recording in the device.
 
 The owner driver uses following device migration group administration commands.
 
@@ -120,6 +119,9 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 \item Device Context Write Command
 \item Device Context Supported Fields Query Command
 \item Device Context Discard Command
+\item Device Write Records Start Command
+\item Device Write Records Stop Command
+\item Device Write Records Read Command
 \end{enumerate}
 
 These commands are currently only defined for the SR-IOV group type.
@@ -339,6 +341,127 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command writes a new device
 context.
 
+\paragraph{Device Write Record Capabilities Query Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Record Capabilities Query Command}
+
+This command reads the device write record capabilities.
+For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY, \field{opcode}
+is set to 0xd.
+The \field{group_member_id} refers to the member device to be accessed.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_write_record_cap_result {
+        le32 supported_page_size_bitmap;
+        le32 supported_ranges;
+};
+\end{lstlisting}
+
+When the command completes successfully, \field{command_specific_result}
+is in the format \field{struct virtio_admin_cmd_dev_write_record_cap_result}
+returned by the device. The \field{supported_page_size_bitmap} indicates
+the physical address range named as page size granularity at which the device can record.
+The minimum page size granularity is of 4KB. Each bit represents a
+supported page size. Bit 0 corresponds to 4KB, bit 1 corresponds to 8KB,
+bit 31 corresponds to 4TB. The device support one or more page sizes.
+For page size, the device sets corresponding bit in the
+\field{supported_page_size_bitmap}. The \field{supported_ranges}
+indicates unique (non overlapping) physical address ranges in page granularity
+can be recorded by the device.
+
+\paragraph{Device Write Records Start Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Records Start Command}
+
+This command starts the write recording in the device for the specified
+physical address ranges.
+
+For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START, \field{opcode}
+is set to 0xe.
+The \field{group_member_id} refers to the member device to be accessed.
+
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_write_record_start_data}.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_write_record_start_entry {
+        le64 page_address;
+        le64 page_count;
+};
+
+struct virtio_admin_cmd_write_record_start_data {
+        le64 page_size;
+        le32 count;
+        u8 reserved[4];
+        struct virtio_admin_cmd_write_record_start_entry entries[];
+};
+
+\end{lstlisting}
+
+The \field{count} is set to indicate number of valid \field{entries}.
+The \field{page_address} indicates the start physical address.
+The \field{page_count} indicates number of pages of size \field{page_size}
+starting from \field{page_address} to record. All the \field{entries}
+are unique non overlapping page entries.
+Whenever a memory write occurs by the device in the supplied address range, the
+device records the physical address of the page in which the write occurred
+by the device. These write records can be read by the driver using
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ command.
+
+This command has no command specific result.
+
+\paragraph{Device Write Record Stop Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Record Stop Command}
+
+This command stops the write recording in the device for which was
+previously started using VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.
+
+For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP, \field{opcode}
+is set to 0xf.
+The \field{group_member_id} refers to the member device to be accessed.
+
+This command does not have any command specific data.
+This command has no command specific result.
+
+\paragraph{Device Write Records Read Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Records Read Command}
+
+This command reads the device write records for which the write recording is
+previously started using VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.
+
+For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ, \field{opcode}
+is set to 0x10.
+The \field{group_member_id} refers to the member device to be accessed.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_write_records_read_data {
+        le64 page_address;
+        le64 length;
+};
+
+struct virtio_admin_cmd_dev_write_records_cnt {
+        le32 count;
+};
+
+struct virtio_admin_cmd_dev_write_records_result {
+        le64 address_entries[];
+};
+\end{lstlisting}
+
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_write_records_read_data}. The driver
+sets the \field {page_address} indicating the start page address for up to the
+\field{length} number of bytes. The supplied physical address range can be
+same or smaller than the range supplied when write recording is started by
+the driver in VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command. The \field{length}
+must be same or multiple of any of the page size reported by the device in the
+\field{supported_page_size_bitmap}.
+
+When the command completes successfully, \field{command_specific_result} is in
+format of \field{struct virtio_admin_cmd_dev_write_records_cnt} containing number
+of write records returned by the device and \field{command_specific_result} is
+in the format of \field{struct virtio_admin_cmd_dev_write_records_result}
+When the command completes successfully, the write records which are returned
+in the result are cleared from the device.
+
 \devicenormative{\paragraph}{Device Migration}{Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration}
 
 A device MUST either support all of, or none of
diff --git a/admin.tex b/admin.tex
index 142692c..41cabfe 100644
--- a/admin.tex
+++ b/admin.tex
@@ -140,7 +140,15 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
 \hline
 0x000d & VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD & Clear the device context data \\
 \hline
-0x000e - 0x7FFF & - & Commands using \field{struct virtio_admin_cmd}    \\
+0x000f & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY & Query Write recording capabilities \\
+\hline
+0x0010 & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START & Start Write recording in the device \\
+\hline
+0x0011 & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP & Stop write recording in the device \\
+\hline
+0x0012 & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ & Read and clear write records from the device \\
+\hline
+0x0013 - 0x7FFF & - & Commands using \field{struct virtio_admin_cmd}    \\
 \hline
 0x8000 - 0xFFFF & - & Reserved for future commands (possibly using a different structure)    \\
 \hline
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [virtio-comment] [PATCH v3 8/8] admin: Add requirements of write reporting commands
  2023-10-30 13:19 [virtio-comment] [PATCH v3 0/8] Introduce device migration support commands Parav Pandit
                   ` (6 preceding siblings ...)
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 7/8] admin: Add " Parav Pandit
@ 2023-10-30 13:19 ` Parav Pandit
  7 siblings, 0 replies; 157+ messages in thread
From: Parav Pandit @ 2023-10-30 13:19 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck
  Cc: sburla, shahafs, maorg, yishaih, lingshan.zhu, jasowang, Parav Pandit

Add device and driver requirements for the write reporting commands.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
changelog:
- addressed comments from Michael
- renamed iova range to a page
- removed duplicate device requirement
- allow stopping write recording multiple times even if it is stopped
  so migration driver can start cleanly at beginning
---
 admin-cmds-device-migration.tex | 36 +++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index f6c2881..59b946a 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -604,6 +604,34 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 VIRTIO_DEV_CTX_END_MARKER in single command response.
 \end{itemize}
 
+A device MUST either support all of, or none of
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY,
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START,
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP and
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ commands.
+
+If the device supports VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY
+command, the device MUST set minimum one bit in the
+\field{supported_page_size_bitmap} and set non zero value in the
+\field{supported_ranges}.
+
+The device MUST fail VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ command
+if the write recording is not started by the driver.
+
+The device MUST complete VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP command
+successfully, even if the write recording is not started by the driver
+or write recording is already stopped previously.
+
+For the SR-IOV group type, for the VF member device, VF function level
+reset (FLR) MUST NOT stop write recording on the VF device and it MUST NOT
+clear any write records already gathered by the owner device.
+
+The device MUST clear the write records which are returned in the
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ result. After command completion
+of VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ if new write record is created
+for the same page, the device MUST report such a write record as
+new entry.
+
 \drivernormative{\paragraph}{Device Migration}{Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration}
 
 The driver SHOULD read the complete device context using one or multiple
@@ -634,3 +662,11 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 For the command VIRTIO_ADMIN_CMD_DEV_CTX_WRITE, the driver MUST
 end the device context \field{struct virtio_dev_ctx}, with last
 entry of \field{fields[]} with type VIRTIO_DEV_CTX_END_MARKER.
+
+The driver MUST NOT invoke VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
+for overlapping page ranges, each page range supplied in the command
+MUST be supply unique ranges.
+
+If the write recording is started by the driver using
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START commands, the driver MUST explicitly
+stop the wrie recording using VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP command.
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-10-30 13:19 ` [virtio-comment] [PATCH v3 6/8] admin: Add theory of operation for write recording commands Parav Pandit
@ 2023-10-31  1:43   ` Jason Wang
  2023-10-31  3:27     ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-10-31  1:43 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment, mst, cohuck, sburla, shahafs, maorg, yishaih,
	lingshan.zhu

On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit <parav@nvidia.com> wrote:
>
> During a device migration flow (typically in a precopy phase of the
> live migration), a device may write to the guest memory. Some
> iommu/hypervisor may not be able to track these written pages.
> These pages to be migrated from source to destination hypervisor.
>
> A device which writes to these pages, provides the page address record
> of the to the owner device. The owner device starts write
> recording for the device and queries all the page addresses written by
> the device.
>
> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> Signed-off-by: Satananda Burla <sburla@marvell.com>
> ---
> changelog:
> v1->v2:
> - addressed comments from Michael
> - replaced iova with physical address
> ---
>  admin-cmds-device-migration.tex | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
> index ed911e4..2e32f2c 100644
> --- a/admin-cmds-device-migration.tex
> +++ b/admin-cmds-device-migration.tex
> @@ -95,6 +95,21 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
>  The owner driver can discard any partially read or written device context when
>  any of the device migration flow should be aborted.
>
> +During the device migration flow, a passthrough device may write data to the
> +guest virtual machine's memory, a source hypervisor needs to keep track of these
> +written memory to migrate such memory to destination hypervisor.
> +Some systems may not be able to keep track of such memory write addresses at
> +hypervisor level. In such a scenario, a device records and reports these
> +written memory addresses to the owner device. The owner driver enables write
> +recording for one or more physical address ranges per device during device
> +migration flow. The owner driver periodically queries these written physical
> +address records from the device.

I wonder how PA works in this case. Device uses untranslated requests
so it can only see IOVA. We can't mandate ATS anyhow.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-10-31  1:43   ` [virtio-comment] " Jason Wang
@ 2023-10-31  3:27     ` Parav Pandit
  2023-10-31  7:45       ` [virtio-comment] " Michael S. Tsirkin
  2023-11-01  0:29       ` Jason Wang
  0 siblings, 2 replies; 157+ messages in thread
From: Parav Pandit @ 2023-10-31  3:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu



> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, October 31, 2023 7:13 AM
> 
> On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> > During a device migration flow (typically in a precopy phase of the
> > live migration), a device may write to the guest memory. Some
> > iommu/hypervisor may not be able to track these written pages.
> > These pages to be migrated from source to destination hypervisor.
> >
> > A device which writes to these pages, provides the page address record
> > of the to the owner device. The owner device starts write recording
> > for the device and queries all the page addresses written by the
> > device.
> >
> > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > Signed-off-by: Satananda Burla <sburla@marvell.com>
> > ---
> > changelog:
> > v1->v2:
> > - addressed comments from Michael
> > - replaced iova with physical address
> > ---
> >  admin-cmds-device-migration.tex | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> >
> > diff --git a/admin-cmds-device-migration.tex
> > b/admin-cmds-device-migration.tex index ed911e4..2e32f2c 100644
> > --- a/admin-cmds-device-migration.tex
> > +++ b/admin-cmds-device-migration.tex
> > @@ -95,6 +95,21 @@ \subsubsection{Device Migration}\label{sec:Basic
> > Facilities of a Virtio Device /  The owner driver can discard any
> > partially read or written device context when  any of the device migration flow
> should be aborted.
> >
> > +During the device migration flow, a passthrough device may write data
> > +to the guest virtual machine's memory, a source hypervisor needs to
> > +keep track of these written memory to migrate such memory to destination
> hypervisor.
> > +Some systems may not be able to keep track of such memory write
> > +addresses at hypervisor level. In such a scenario, a device records
> > +and reports these written memory addresses to the owner device. The
> > +owner driver enables write recording for one or more physical address
> > +ranges per device during device migration flow. The owner driver
> > +periodically queries these written physical address records from the device.
> 
> I wonder how PA works in this case. Device uses untranslated requests so it can
> only see IOVA. We can't mandate ATS anyhow.
Michael suggested to keep the language uniform as PA as this is ultimately what the guest driver is supplying during vq creation and in posting buffers as physical address.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-10-31  3:27     ` [virtio-comment] " Parav Pandit
@ 2023-10-31  7:45       ` Michael S. Tsirkin
  2023-10-31  9:32         ` Zhu, Lingshan
  2023-11-01  0:29       ` Jason Wang
  1 sibling, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-10-31  7:45 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Tue, Oct 31, 2023 at 03:27:12AM +0000, Parav Pandit wrote:
> 
> 
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, October 31, 2023 7:13 AM
> > 
> > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > > During a device migration flow (typically in a precopy phase of the
> > > live migration), a device may write to the guest memory. Some
> > > iommu/hypervisor may not be able to track these written pages.
> > > These pages to be migrated from source to destination hypervisor.
> > >
> > > A device which writes to these pages, provides the page address record
> > > of the to the owner device. The owner device starts write recording
> > > for the device and queries all the page addresses written by the
> > > device.
> > >
> > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > Signed-off-by: Satananda Burla <sburla@marvell.com>
> > > ---
> > > changelog:
> > > v1->v2:
> > > - addressed comments from Michael
> > > - replaced iova with physical address
> > > ---
> > >  admin-cmds-device-migration.tex | 15 +++++++++++++++
> > >  1 file changed, 15 insertions(+)
> > >
> > > diff --git a/admin-cmds-device-migration.tex
> > > b/admin-cmds-device-migration.tex index ed911e4..2e32f2c 100644
> > > --- a/admin-cmds-device-migration.tex
> > > +++ b/admin-cmds-device-migration.tex
> > > @@ -95,6 +95,21 @@ \subsubsection{Device Migration}\label{sec:Basic
> > > Facilities of a Virtio Device /  The owner driver can discard any
> > > partially read or written device context when  any of the device migration flow
> > should be aborted.
> > >
> > > +During the device migration flow, a passthrough device may write data
> > > +to the guest virtual machine's memory, a source hypervisor needs to
> > > +keep track of these written memory to migrate such memory to destination
> > hypervisor.
> > > +Some systems may not be able to keep track of such memory write
> > > +addresses at hypervisor level. In such a scenario, a device records
> > > +and reports these written memory addresses to the owner device. The
> > > +owner driver enables write recording for one or more physical address
> > > +ranges per device during device migration flow. The owner driver
> > > +periodically queries these written physical address records from the device.
> > 
> > I wonder how PA works in this case. Device uses untranslated requests so it can
> > only see IOVA. We can't mandate ATS anyhow.
> Michael suggested to keep the language uniform as PA as this is ultimately what the guest driver is supplying during vq creation and in posting buffers as physical address.


Yes the spec calls the address accessed by the device "physical
address". Granted, this is pointless - there is only one
type of address device can access. We can if we want to
replace that with just "address" or "memory address".
I don't think this ever caused confusion though and worth the
churn.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-10-31  7:45       ` [virtio-comment] " Michael S. Tsirkin
@ 2023-10-31  9:32         ` Zhu, Lingshan
  2023-10-31  9:41           ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Zhu, Lingshan @ 2023-10-31  9:32 UTC (permalink / raw)
  To: Michael S. Tsirkin, Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/31/2023 3:45 PM, Michael S. Tsirkin wrote:
> On Tue, Oct 31, 2023 at 03:27:12AM +0000, Parav Pandit wrote:
>>
>>> From: Jason Wang <jasowang@redhat.com>
>>> Sent: Tuesday, October 31, 2023 7:13 AM
>>>
>>> On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit <parav@nvidia.com> wrote:
>>>> During a device migration flow (typically in a precopy phase of the
>>>> live migration), a device may write to the guest memory. Some
>>>> iommu/hypervisor may not be able to track these written pages.
>>>> These pages to be migrated from source to destination hypervisor.
>>>>
>>>> A device which writes to these pages, provides the page address record
>>>> of the to the owner device. The owner device starts write recording
>>>> for the device and queries all the page addresses written by the
>>>> device.
>>>>
>>>> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
>>>> Signed-off-by: Parav Pandit <parav@nvidia.com>
>>>> Signed-off-by: Satananda Burla <sburla@marvell.com>
>>>> ---
>>>> changelog:
>>>> v1->v2:
>>>> - addressed comments from Michael
>>>> - replaced iova with physical address
>>>> ---
>>>>   admin-cmds-device-migration.tex | 15 +++++++++++++++
>>>>   1 file changed, 15 insertions(+)
>>>>
>>>> diff --git a/admin-cmds-device-migration.tex
>>>> b/admin-cmds-device-migration.tex index ed911e4..2e32f2c 100644
>>>> --- a/admin-cmds-device-migration.tex
>>>> +++ b/admin-cmds-device-migration.tex
>>>> @@ -95,6 +95,21 @@ \subsubsection{Device Migration}\label{sec:Basic
>>>> Facilities of a Virtio Device /  The owner driver can discard any
>>>> partially read or written device context when  any of the device migration flow
>>> should be aborted.
>>>> +During the device migration flow, a passthrough device may write data
>>>> +to the guest virtual machine's memory, a source hypervisor needs to
>>>> +keep track of these written memory to migrate such memory to destination
>>> hypervisor.
>>>> +Some systems may not be able to keep track of such memory write
>>>> +addresses at hypervisor level. In such a scenario, a device records
>>>> +and reports these written memory addresses to the owner device. The
>>>> +owner driver enables write recording for one or more physical address
>>>> +ranges per device during device migration flow. The owner driver
>>>> +periodically queries these written physical address records from the device.
>>> I wonder how PA works in this case. Device uses untranslated requests so it can
>>> only see IOVA. We can't mandate ATS anyhow.
>> Michael suggested to keep the language uniform as PA as this is ultimately what the guest driver is supplying during vq creation and in posting buffers as physical address.
>
> Yes the spec calls the address accessed by the device "physical
> address". Granted, this is pointless - there is only one
> type of address device can access. We can if we want to
> replace that with just "address" or "memory address".
> I don't think this ever caused confusion though and worth the
> churn.
But you know PA means disable IOMMU and always enable ATS on both the 
device and host
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-10-31  9:32         ` Zhu, Lingshan
@ 2023-10-31  9:41           ` Michael S. Tsirkin
  2023-10-31  9:47             ` Zhu, Lingshan
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-10-31  9:41 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Tue, Oct 31, 2023 at 05:32:01PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 10/31/2023 3:45 PM, Michael S. Tsirkin wrote:
> > On Tue, Oct 31, 2023 at 03:27:12AM +0000, Parav Pandit wrote:
> > > 
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > 
> > > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > > During a device migration flow (typically in a precopy phase of the
> > > > > live migration), a device may write to the guest memory. Some
> > > > > iommu/hypervisor may not be able to track these written pages.
> > > > > These pages to be migrated from source to destination hypervisor.
> > > > > 
> > > > > A device which writes to these pages, provides the page address record
> > > > > of the to the owner device. The owner device starts write recording
> > > > > for the device and queries all the page addresses written by the
> > > > > device.
> > > > > 
> > > > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > Signed-off-by: Satananda Burla <sburla@marvell.com>
> > > > > ---
> > > > > changelog:
> > > > > v1->v2:
> > > > > - addressed comments from Michael
> > > > > - replaced iova with physical address
> > > > > ---
> > > > >   admin-cmds-device-migration.tex | 15 +++++++++++++++
> > > > >   1 file changed, 15 insertions(+)
> > > > > 
> > > > > diff --git a/admin-cmds-device-migration.tex
> > > > > b/admin-cmds-device-migration.tex index ed911e4..2e32f2c 100644
> > > > > --- a/admin-cmds-device-migration.tex
> > > > > +++ b/admin-cmds-device-migration.tex
> > > > > @@ -95,6 +95,21 @@ \subsubsection{Device Migration}\label{sec:Basic
> > > > > Facilities of a Virtio Device /  The owner driver can discard any
> > > > > partially read or written device context when  any of the device migration flow
> > > > should be aborted.
> > > > > +During the device migration flow, a passthrough device may write data
> > > > > +to the guest virtual machine's memory, a source hypervisor needs to
> > > > > +keep track of these written memory to migrate such memory to destination
> > > > hypervisor.
> > > > > +Some systems may not be able to keep track of such memory write
> > > > > +addresses at hypervisor level. In such a scenario, a device records
> > > > > +and reports these written memory addresses to the owner device. The
> > > > > +owner driver enables write recording for one or more physical address
> > > > > +ranges per device during device migration flow. The owner driver
> > > > > +periodically queries these written physical address records from the device.
> > > > I wonder how PA works in this case. Device uses untranslated requests so it can
> > > > only see IOVA. We can't mandate ATS anyhow.
> > > Michael suggested to keep the language uniform as PA as this is ultimately what the guest driver is supplying during vq creation and in posting buffers as physical address.
> > 
> > Yes the spec calls the address accessed by the device "physical
> > address". Granted, this is pointless - there is only one
> > type of address device can access. We can if we want to
> > replace that with just "address" or "memory address".
> > I don't think this ever caused confusion though and worth the
> > churn.
> But you know PA means disable IOMMU and always enable ATS on both the device
> and host

That is not how virtio uses the term.  Just grep "physical address" in the spec.
This patch should be consistent.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-10-31  9:41           ` Michael S. Tsirkin
@ 2023-10-31  9:47             ` Zhu, Lingshan
  0 siblings, 0 replies; 157+ messages in thread
From: Zhu, Lingshan @ 2023-10-31  9:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 10/31/2023 5:41 PM, Michael S. Tsirkin wrote:
> On Tue, Oct 31, 2023 at 05:32:01PM +0800, Zhu, Lingshan wrote:
>>
>> On 10/31/2023 3:45 PM, Michael S. Tsirkin wrote:
>>> On Tue, Oct 31, 2023 at 03:27:12AM +0000, Parav Pandit wrote:
>>>>> From: Jason Wang <jasowang@redhat.com>
>>>>> Sent: Tuesday, October 31, 2023 7:13 AM
>>>>>
>>>>> On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit <parav@nvidia.com> wrote:
>>>>>> During a device migration flow (typically in a precopy phase of the
>>>>>> live migration), a device may write to the guest memory. Some
>>>>>> iommu/hypervisor may not be able to track these written pages.
>>>>>> These pages to be migrated from source to destination hypervisor.
>>>>>>
>>>>>> A device which writes to these pages, provides the page address record
>>>>>> of the to the owner device. The owner device starts write recording
>>>>>> for the device and queries all the page addresses written by the
>>>>>> device.
>>>>>>
>>>>>> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
>>>>>> Signed-off-by: Parav Pandit <parav@nvidia.com>
>>>>>> Signed-off-by: Satananda Burla <sburla@marvell.com>
>>>>>> ---
>>>>>> changelog:
>>>>>> v1->v2:
>>>>>> - addressed comments from Michael
>>>>>> - replaced iova with physical address
>>>>>> ---
>>>>>>    admin-cmds-device-migration.tex | 15 +++++++++++++++
>>>>>>    1 file changed, 15 insertions(+)
>>>>>>
>>>>>> diff --git a/admin-cmds-device-migration.tex
>>>>>> b/admin-cmds-device-migration.tex index ed911e4..2e32f2c 100644
>>>>>> --- a/admin-cmds-device-migration.tex
>>>>>> +++ b/admin-cmds-device-migration.tex
>>>>>> @@ -95,6 +95,21 @@ \subsubsection{Device Migration}\label{sec:Basic
>>>>>> Facilities of a Virtio Device /  The owner driver can discard any
>>>>>> partially read or written device context when  any of the device migration flow
>>>>> should be aborted.
>>>>>> +During the device migration flow, a passthrough device may write data
>>>>>> +to the guest virtual machine's memory, a source hypervisor needs to
>>>>>> +keep track of these written memory to migrate such memory to destination
>>>>> hypervisor.
>>>>>> +Some systems may not be able to keep track of such memory write
>>>>>> +addresses at hypervisor level. In such a scenario, a device records
>>>>>> +and reports these written memory addresses to the owner device. The
>>>>>> +owner driver enables write recording for one or more physical address
>>>>>> +ranges per device during device migration flow. The owner driver
>>>>>> +periodically queries these written physical address records from the device.
>>>>> I wonder how PA works in this case. Device uses untranslated requests so it can
>>>>> only see IOVA. We can't mandate ATS anyhow.
>>>> Michael suggested to keep the language uniform as PA as this is ultimately what the guest driver is supplying during vq creation and in posting buffers as physical address.
>>> Yes the spec calls the address accessed by the device "physical
>>> address". Granted, this is pointless - there is only one
>>> type of address device can access. We can if we want to
>>> replace that with just "address" or "memory address".
>>> I don't think this ever caused confusion though and worth the
>>> churn.
>> But you know PA means disable IOMMU and always enable ATS on both the device
>> and host
> That is not how virtio uses the term.  Just grep "physical address" in the spec.
> This patch should be consistent.
OK, I use other terms in my series, let's see whether it is better.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-10-31  3:27     ` [virtio-comment] " Parav Pandit
  2023-10-31  7:45       ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-01  0:29       ` Jason Wang
  2023-11-01  3:02         ` [virtio-comment] " Parav Pandit
  1 sibling, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-01  0:29 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, October 31, 2023 7:13 AM
> >
> > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > > During a device migration flow (typically in a precopy phase of the
> > > live migration), a device may write to the guest memory. Some
> > > iommu/hypervisor may not be able to track these written pages.
> > > These pages to be migrated from source to destination hypervisor.
> > >
> > > A device which writes to these pages, provides the page address record
> > > of the to the owner device. The owner device starts write recording
> > > for the device and queries all the page addresses written by the
> > > device.
> > >
> > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > Signed-off-by: Satananda Burla <sburla@marvell.com>
> > > ---
> > > changelog:
> > > v1->v2:
> > > - addressed comments from Michael
> > > - replaced iova with physical address
> > > ---
> > >  admin-cmds-device-migration.tex | 15 +++++++++++++++
> > >  1 file changed, 15 insertions(+)
> > >
> > > diff --git a/admin-cmds-device-migration.tex
> > > b/admin-cmds-device-migration.tex index ed911e4..2e32f2c 100644
> > > --- a/admin-cmds-device-migration.tex
> > > +++ b/admin-cmds-device-migration.tex
> > > @@ -95,6 +95,21 @@ \subsubsection{Device Migration}\label{sec:Basic
> > > Facilities of a Virtio Device /  The owner driver can discard any
> > > partially read or written device context when  any of the device migration flow
> > should be aborted.
> > >
> > > +During the device migration flow, a passthrough device may write data
> > > +to the guest virtual machine's memory, a source hypervisor needs to
> > > +keep track of these written memory to migrate such memory to destination
> > hypervisor.
> > > +Some systems may not be able to keep track of such memory write
> > > +addresses at hypervisor level. In such a scenario, a device records
> > > +and reports these written memory addresses to the owner device. The
> > > +owner driver enables write recording for one or more physical address
> > > +ranges per device during device migration flow. The owner driver
> > > +periodically queries these written physical address records from the device.
> >
> > I wonder how PA works in this case. Device uses untranslated requests so it can
> > only see IOVA. We can't mandate ATS anyhow.
> Michael suggested to keep the language uniform as PA as this is ultimately what the guest driver is supplying during vq creation and in posting buffers as physical address.

This seems to need some work. And, can you show me how it can work?

1) e.g if GAW is 48 bit, is the hypervisor expected to do a bisection
of the whole range?
2) does the device need to reserve sufficient internal resources for
logging the dirty page and why (not)?
3) DMA is part of the transport, it's natural to do logging there, why
duplicate efforts in the virtio layer? I can't see how it can compete
with the functionality that is provided by the platform. And what's
more, we can't assume virtio is the only device that is used by the
guest.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-01  0:29       ` Jason Wang
@ 2023-11-01  3:02         ` Parav Pandit
  2023-11-02  4:24           ` [virtio-comment] " Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-01  3:02 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu



> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 1, 2023 6:00 AM
> 
> On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, October 31, 2023 7:13 AM
> > >
> > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > > During a device migration flow (typically in a precopy phase of
> > > > the live migration), a device may write to the guest memory. Some
> > > > iommu/hypervisor may not be able to track these written pages.
> > > > These pages to be migrated from source to destination hypervisor.
> > > >
> > > > A device which writes to these pages, provides the page address
> > > > record of the to the owner device. The owner device starts write
> > > > recording for the device and queries all the page addresses
> > > > written by the device.
> > > >
> > > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > Signed-off-by: Satananda Burla <sburla@marvell.com>
> > > > ---
> > > > changelog:
> > > > v1->v2:
> > > > - addressed comments from Michael
> > > > - replaced iova with physical address
> > > > ---
> > > >  admin-cmds-device-migration.tex | 15 +++++++++++++++
> > > >  1 file changed, 15 insertions(+)
> > > >
> > > > diff --git a/admin-cmds-device-migration.tex
> > > > b/admin-cmds-device-migration.tex index ed911e4..2e32f2c 100644
> > > > --- a/admin-cmds-device-migration.tex
> > > > +++ b/admin-cmds-device-migration.tex
> > > > @@ -95,6 +95,21 @@ \subsubsection{Device
> > > > Migration}\label{sec:Basic Facilities of a Virtio Device /  The
> > > > owner driver can discard any partially read or written device
> > > > context when  any of the device migration flow
> > > should be aborted.
> > > >
> > > > +During the device migration flow, a passthrough device may write
> > > > +data to the guest virtual machine's memory, a source hypervisor
> > > > +needs to keep track of these written memory to migrate such
> > > > +memory to destination
> > > hypervisor.
> > > > +Some systems may not be able to keep track of such memory write
> > > > +addresses at hypervisor level. In such a scenario, a device
> > > > +records and reports these written memory addresses to the owner
> > > > +device. The owner driver enables write recording for one or more
> > > > +physical address ranges per device during device migration flow.
> > > > +The owner driver periodically queries these written physical address
> records from the device.
> > >
> > > I wonder how PA works in this case. Device uses untranslated
> > > requests so it can only see IOVA. We can't mandate ATS anyhow.
> > Michael suggested to keep the language uniform as PA as this is ultimately
> what the guest driver is supplying during vq creation and in posting buffers as
> physical address.
> 
> This seems to need some work. And, can you show me how it can work?
> 
> 1) e.g if GAW is 48 bit, is the hypervisor expected to do a bisection of the whole
> range?
> 2) does the device need to reserve sufficient internal resources for logging the
> dirty page and why (not)?
No when dirty page logging starts, only at that time, device will reserve enough resources.

> 3) DMA is part of the transport, it's natural to do logging there, why duplicate
> efforts in the virtio layer? 
He he, you have funny comment.
When an abstract facility is added to virtio you say to do in transport.
When one does something in transport, you say, this is transport specific, do some generic.

Here the device is being tracked is virtio device.
PCI-SIG has told already that PCIM interface is outside the scope of it.
Hence, this is done in virtio layer here in abstract way.

> I can't see how it can compete with the functionality
> that is provided by the platform. And what's more, we can't assume virtio is the
> only device that is used by the guest.
> 
You raised this before and it was answered.
Not all platform support dirty page tracking effectively.
This is optional facility that speed up the migration down time significantly.
So until platform supports it, it is supported by virtio.

> Thanks


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-01  3:02         ` [virtio-comment] " Parav Pandit
@ 2023-11-02  4:24           ` Jason Wang
  2023-11-02  6:10             ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-02  4:24 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, November 1, 2023 6:00 AM
> >
> > On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > >
> > > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > > During a device migration flow (typically in a precopy phase of
> > > > > the live migration), a device may write to the guest memory. Some
> > > > > iommu/hypervisor may not be able to track these written pages.
> > > > > These pages to be migrated from source to destination hypervisor.
> > > > >
> > > > > A device which writes to these pages, provides the page address
> > > > > record of the to the owner device. The owner device starts write
> > > > > recording for the device and queries all the page addresses
> > > > > written by the device.
> > > > >
> > > > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > Signed-off-by: Satananda Burla <sburla@marvell.com>
> > > > > ---
> > > > > changelog:
> > > > > v1->v2:
> > > > > - addressed comments from Michael
> > > > > - replaced iova with physical address
> > > > > ---
> > > > >  admin-cmds-device-migration.tex | 15 +++++++++++++++
> > > > >  1 file changed, 15 insertions(+)
> > > > >
> > > > > diff --git a/admin-cmds-device-migration.tex
> > > > > b/admin-cmds-device-migration.tex index ed911e4..2e32f2c 100644
> > > > > --- a/admin-cmds-device-migration.tex
> > > > > +++ b/admin-cmds-device-migration.tex
> > > > > @@ -95,6 +95,21 @@ \subsubsection{Device
> > > > > Migration}\label{sec:Basic Facilities of a Virtio Device /  The
> > > > > owner driver can discard any partially read or written device
> > > > > context when  any of the device migration flow
> > > > should be aborted.
> > > > >
> > > > > +During the device migration flow, a passthrough device may write
> > > > > +data to the guest virtual machine's memory, a source hypervisor
> > > > > +needs to keep track of these written memory to migrate such
> > > > > +memory to destination
> > > > hypervisor.
> > > > > +Some systems may not be able to keep track of such memory write
> > > > > +addresses at hypervisor level. In such a scenario, a device
> > > > > +records and reports these written memory addresses to the owner
> > > > > +device. The owner driver enables write recording for one or more
> > > > > +physical address ranges per device during device migration flow.
> > > > > +The owner driver periodically queries these written physical address
> > records from the device.
> > > >
> > > > I wonder how PA works in this case. Device uses untranslated
> > > > requests so it can only see IOVA. We can't mandate ATS anyhow.
> > > Michael suggested to keep the language uniform as PA as this is ultimately
> > what the guest driver is supplying during vq creation and in posting buffers as
> > physical address.
> >
> > This seems to need some work. And, can you show me how it can work?
> >
> > 1) e.g if GAW is 48 bit, is the hypervisor expected to do a bisection of the whole
> > range?
> > 2) does the device need to reserve sufficient internal resources for logging the
> > dirty page and why (not)?
> No when dirty page logging starts, only at that time, device will reserve enough resources.

GAW is 48bit, how large would it have then? What happens if we're
trying to migrate more than 1 device?

>
> > 3) DMA is part of the transport, it's natural to do logging there, why duplicate
> > efforts in the virtio layer?
> He he, you have funny comment.
> When an abstract facility is added to virtio you say to do in transport.

So it's not done in the general facility but tied to the admin part.
And we all know dirty page tracking is a challenge and Eugenio has a
good summary of pros/cons. A revisit of those docs make me think
virtio is not the good place for doing that for may reasons:

1) as stated, platform will evolve to be able to tracking dirty pages,
actually, it has been supported by a lot of major IOMMU vendors
2) you can't assume virtio is the only device that can be used by the
guest, having dirty pages tracking to be implemented in each type of
device is unrealistic
3) inventing it in the virtio layer will be deprecated in the future
for sure, as platform will provide much rich features for logging e.g
it can do it per PASID etc, I don't see any reason virtio need to
compete with the features that will be provided by the platform
4) if the platform support is missing, we can use software or leverage
transport for assistance like PRI

> When one does something in transport, you say, this is transport specific, do some generic.
>
> Here the device is being tracked is virtio device.
> PCI-SIG has told already that PCIM interface is outside the scope of it.
> Hence, this is done in virtio layer here in abstract way.

You will end up with a competition with the platform/transport one
that will fail.

>
> > I can't see how it can compete with the functionality
> > that is provided by the platform. And what's more, we can't assume virtio is the
> > only device that is used by the guest.
> >
> You raised this before and it was answered.
> Not all platform support dirty page tracking effectively.
> This is optional facility that speed up the migration down time significantly.

I can hardly believe the downtime is determined by the speed of
logging dirty pages...

> So until platform supports it, it is supported by virtio.

Some platforms already support that.

Thanks

>
> > Thanks
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-02  4:24           ` [virtio-comment] " Jason Wang
@ 2023-11-02  6:10             ` Parav Pandit
  2023-11-06  6:34               ` [virtio-comment] " Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-02  6:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, November 2, 2023 9:54 AM
> 
> On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Wednesday, November 1, 2023 6:00 AM
> > >
> > > On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > >
> > > > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > > During a device migration flow (typically in a precopy phase
> > > > > > of the live migration), a device may write to the guest
> > > > > > memory. Some iommu/hypervisor may not be able to track these
> written pages.
> > > > > > These pages to be migrated from source to destination hypervisor.
> > > > > >
> > > > > > A device which writes to these pages, provides the page
> > > > > > address record of the to the owner device. The owner device
> > > > > > starts write recording for the device and queries all the page
> > > > > > addresses written by the device.
> > > > > >
> > > > > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > > Signed-off-by: Satananda Burla <sburla@marvell.com>
> > > > > > ---
> > > > > > changelog:
> > > > > > v1->v2:
> > > > > > - addressed comments from Michael
> > > > > > - replaced iova with physical address
> > > > > > ---
> > > > > >  admin-cmds-device-migration.tex | 15 +++++++++++++++
> > > > > >  1 file changed, 15 insertions(+)
> > > > > >
> > > > > > diff --git a/admin-cmds-device-migration.tex
> > > > > > b/admin-cmds-device-migration.tex index ed911e4..2e32f2c
> > > > > > 100644
> > > > > > --- a/admin-cmds-device-migration.tex
> > > > > > +++ b/admin-cmds-device-migration.tex
> > > > > > @@ -95,6 +95,21 @@ \subsubsection{Device
> > > > > > Migration}\label{sec:Basic Facilities of a Virtio Device /
> > > > > > The owner driver can discard any partially read or written
> > > > > > device context when  any of the device migration flow
> > > > > should be aborted.
> > > > > >
> > > > > > +During the device migration flow, a passthrough device may
> > > > > > +write data to the guest virtual machine's memory, a source
> > > > > > +hypervisor needs to keep track of these written memory to
> > > > > > +migrate such memory to destination
> > > > > hypervisor.
> > > > > > +Some systems may not be able to keep track of such memory
> > > > > > +write addresses at hypervisor level. In such a scenario, a
> > > > > > +device records and reports these written memory addresses to
> > > > > > +the owner device. The owner driver enables write recording
> > > > > > +for one or more physical address ranges per device during device
> migration flow.
> > > > > > +The owner driver periodically queries these written physical
> > > > > > +address
> > > records from the device.
> > > > >
> > > > > I wonder how PA works in this case. Device uses untranslated
> > > > > requests so it can only see IOVA. We can't mandate ATS anyhow.
> > > > Michael suggested to keep the language uniform as PA as this is
> > > > ultimately
> > > what the guest driver is supplying during vq creation and in posting
> > > buffers as physical address.
> > >
> > > This seems to need some work. And, can you show me how it can work?
> > >
> > > 1) e.g if GAW is 48 bit, is the hypervisor expected to do a
> > > bisection of the whole range?
> > > 2) does the device need to reserve sufficient internal resources for
> > > logging the dirty page and why (not)?
> > No when dirty page logging starts, only at that time, device will reserve
> enough resources.
> 
> GAW is 48bit, how large would it have then? 
Dirty page tracking is not dependent on the size of the GAW.
It is function of address ranges for the amount of guest memory regardless of GAW.

> What happens if we're trying to migrate more than 1 device?
> 
That is perfectly fine.
Each device is updating its log of pages it wrote.
The hypervisor is collecting their sum.

> >
> > > 3) DMA is part of the transport, it's natural to do logging there,
> > > why duplicate efforts in the virtio layer?
> > He he, you have funny comment.
> > When an abstract facility is added to virtio you say to do in transport.
> 
> So it's not done in the general facility but tied to the admin part.
> And we all know dirty page tracking is a challenge and Eugenio has a good
> summary of pros/cons. A revisit of those docs make me think virtio is not the
> good place for doing that for may reasons:
> 
> 1) as stated, platform will evolve to be able to tracking dirty pages, actually, it
> has been supported by a lot of major IOMMU vendors

This is optional facility in virtio.
Can you please point to the references? I don’t see it in the common Linux kernel support for it.
Instead Linux kernel choose to extend to the devices.
At least not seen to arrive this in any near term in start of 2024 which is where users must use this.

> 2) you can't assume virtio is the only device that can be used by the guest,
> having dirty pages tracking to be implemented in each type of device is
> unrealistic
Of course, there is no such assumption made. Where did you see a text that made such assumption?
Each virtio and non virtio devices who wants to report their dirty page report, will do their way.

> 3) inventing it in the virtio layer will be deprecated in the future for sure, as
> platform will provide much rich features for logging e.g it can do it per PASID
> etc, I don't see any reason virtio need to compete with the features that will be
> provided by the platform
Can you bring the cpu vendors and committement to virtio tc with timelines so that virtio TC can omit?
i.e. in first year of 2024?
If not, we are better off to offer this, and when/if platform support is, sure, this feature can be disabled/not used/not enabled.

> 4) if the platform support is missing, we can use software or leverage transport
> for assistance like PRI
All of these are in theory.
Our experiment shows PRI performance is 21x slower than page fault rate done by the cpu.
It simply does not even pass a simple 10Gbps test.
There is no requirement for mandating PRI either.
So it is unusable.

> 
> > When one does something in transport, you say, this is transport specific, do
> some generic.
> >
> > Here the device is being tracked is virtio device.
> > PCI-SIG has told already that PCIM interface is outside the scope of it.
> > Hence, this is done in virtio layer here in abstract way.
> 
> You will end up with a competition with the platform/transport one that will
> fail.
> 
I don’t see a reason. There is no competition.
Platform always have a choice to not use device side page tracking when it is supported.

> >
> > > I can't see how it can compete with the functionality that is
> > > provided by the platform. And what's more, we can't assume virtio is
> > > the only device that is used by the guest.
> > >
> > You raised this before and it was answered.
> > Not all platform support dirty page tracking effectively.
> > This is optional facility that speed up the migration down time significantly.
> 
> I can hardly believe the downtime is determined by the speed of logging dirty
> pages...
> 
Without dirty page tracking, in pre-copy, all the memory must be migrated again, which takes significantly longer time.

> > So until platform supports it, it is supported by virtio.
> 
> Some platforms already support that.
> 
And some done.
Hence whichever wants to use by platform, will use by platform.
Whichever prefer to use from device will use by the device.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-02  6:10             ` [virtio-comment] " Parav Pandit
@ 2023-11-06  6:34               ` Jason Wang
  2023-11-06  6:53                 ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-06  6:34 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Thursday, November 2, 2023 9:54 AM
> >
> > On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Wednesday, November 1, 2023 6:00 AM
> > > >
> > > > On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > > >
> > > > > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > > During a device migration flow (typically in a precopy phase
> > > > > > > of the live migration), a device may write to the guest
> > > > > > > memory. Some iommu/hypervisor may not be able to track these
> > written pages.
> > > > > > > These pages to be migrated from source to destination hypervisor.
> > > > > > >
> > > > > > > A device which writes to these pages, provides the page
> > > > > > > address record of the to the owner device. The owner device
> > > > > > > starts write recording for the device and queries all the page
> > > > > > > addresses written by the device.
> > > > > > >
> > > > > > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > > > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > > > Signed-off-by: Satananda Burla <sburla@marvell.com>
> > > > > > > ---
> > > > > > > changelog:
> > > > > > > v1->v2:
> > > > > > > - addressed comments from Michael
> > > > > > > - replaced iova with physical address
> > > > > > > ---
> > > > > > >  admin-cmds-device-migration.tex | 15 +++++++++++++++
> > > > > > >  1 file changed, 15 insertions(+)
> > > > > > >
> > > > > > > diff --git a/admin-cmds-device-migration.tex
> > > > > > > b/admin-cmds-device-migration.tex index ed911e4..2e32f2c
> > > > > > > 100644
> > > > > > > --- a/admin-cmds-device-migration.tex
> > > > > > > +++ b/admin-cmds-device-migration.tex
> > > > > > > @@ -95,6 +95,21 @@ \subsubsection{Device
> > > > > > > Migration}\label{sec:Basic Facilities of a Virtio Device /
> > > > > > > The owner driver can discard any partially read or written
> > > > > > > device context when  any of the device migration flow
> > > > > > should be aborted.
> > > > > > >
> > > > > > > +During the device migration flow, a passthrough device may
> > > > > > > +write data to the guest virtual machine's memory, a source
> > > > > > > +hypervisor needs to keep track of these written memory to
> > > > > > > +migrate such memory to destination
> > > > > > hypervisor.
> > > > > > > +Some systems may not be able to keep track of such memory
> > > > > > > +write addresses at hypervisor level. In such a scenario, a
> > > > > > > +device records and reports these written memory addresses to
> > > > > > > +the owner device. The owner driver enables write recording
> > > > > > > +for one or more physical address ranges per device during device
> > migration flow.
> > > > > > > +The owner driver periodically queries these written physical
> > > > > > > +address
> > > > records from the device.
> > > > > >
> > > > > > I wonder how PA works in this case. Device uses untranslated
> > > > > > requests so it can only see IOVA. We can't mandate ATS anyhow.
> > > > > Michael suggested to keep the language uniform as PA as this is
> > > > > ultimately
> > > > what the guest driver is supplying during vq creation and in posting
> > > > buffers as physical address.
> > > >
> > > > This seems to need some work. And, can you show me how it can work?
> > > >
> > > > 1) e.g if GAW is 48 bit, is the hypervisor expected to do a
> > > > bisection of the whole range?
> > > > 2) does the device need to reserve sufficient internal resources for
> > > > logging the dirty page and why (not)?
> > > No when dirty page logging starts, only at that time, device will reserve
> > enough resources.
> >
> > GAW is 48bit, how large would it have then?
> Dirty page tracking is not dependent on the size of the GAW.
> It is function of address ranges for the amount of guest memory regardless of GAW.

The problem is, e.g when vIOMMU is enabled, you can't know which IOVA
is actually used by guests. And even for the case when vIOMMU is not
enabled, the guest may have several TBs. Is it easy to reserve
sufficient resources by the device itself?

Host should always have more resources than device, in that sense
there could be several methods that tries to utilize host memory
instead of the one in the device. I think we've discussed this when
going through the doc prepared by Eugenio.

>
> > What happens if we're trying to migrate more than 1 device?
> >
> That is perfectly fine.
> Each device is updating its log of pages it wrote.
> The hypervisor is collecting their sum.

See above.

>
> > >
> > > > 3) DMA is part of the transport, it's natural to do logging there,
> > > > why duplicate efforts in the virtio layer?
> > > He he, you have funny comment.
> > > When an abstract facility is added to virtio you say to do in transport.
> >
> > So it's not done in the general facility but tied to the admin part.
> > And we all know dirty page tracking is a challenge and Eugenio has a good
> > summary of pros/cons. A revisit of those docs make me think virtio is not the
> > good place for doing that for may reasons:
> >
> > 1) as stated, platform will evolve to be able to tracking dirty pages, actually, it
> > has been supported by a lot of major IOMMU vendors
>
> This is optional facility in virtio.
> Can you please point to the references? I don’t see it in the common Linux kernel support for it.

Note that when IOMMUFD is being proposed, dirty page tracking is one
of the major considerations.

This is one recent proposal:

https://www.spinics.net/lists/kvm/msg330894.html

> Instead Linux kernel choose to extend to the devices.

Well, as I stated, tracking dirty pages is challenging if you want to
do it on a device, and you can't simply invent dirty page tracking for
each type of the devices.

> At least not seen to arrive this in any near term in start of 2024 which is where users must use this.
>
> > 2) you can't assume virtio is the only device that can be used by the guest,
> > having dirty pages tracking to be implemented in each type of device is
> > unrealistic
> Of course, there is no such assumption made. Where did you see a text that made such assumption?

So what happens if you have a guest with virtio and other devices assigned?

> Each virtio and non virtio devices who wants to report their dirty page report, will do their way.
>
> > 3) inventing it in the virtio layer will be deprecated in the future for sure, as
> > platform will provide much rich features for logging e.g it can do it per PASID
> > etc, I don't see any reason virtio need to compete with the features that will be
> > provided by the platform
> Can you bring the cpu vendors and committement to virtio tc with timelines so that virtio TC can omit?

Why do we need to bring CPU vendors in the virtio TC? Virtio needs to
be built on top of transport or platform. There's no need to duplicate
their job. Especially considering that virtio can't do better than
them.

> i.e. in first year of 2024?

Why does it matter in 2024?

> If not, we are better off to offer this, and when/if platform support is, sure, this feature can be disabled/not used/not enabled.
>
> > 4) if the platform support is missing, we can use software or leverage transport
> > for assistance like PRI
> All of these are in theory.
> Our experiment shows PRI performance is 21x slower than page fault rate done by the cpu.
> It simply does not even pass a simple 10Gbps test.

If you stick to the wire speed during migration, it can converge.

> There is no requirement for mandating PRI either.
> So it is unusable.

It's not about mandating, it's about doing things in the correct
layer. If PRI is slow, PCI can evolve for sure.

>
> >
> > > When one does something in transport, you say, this is transport specific, do
> > some generic.
> > >
> > > Here the device is being tracked is virtio device.
> > > PCI-SIG has told already that PCIM interface is outside the scope of it.
> > > Hence, this is done in virtio layer here in abstract way.
> >
> > You will end up with a competition with the platform/transport one that will
> > fail.
> >
> I don’t see a reason. There is no competition.
> Platform always have a choice to not use device side page tracking when it is supported.

Platform provides a lot of other functionalities for dirty logging:
e.g per PASID, granular, etc. So you want to duplicate them again in
the virtio? If not, why choose this way?

>
> > >
> > > > I can't see how it can compete with the functionality that is
> > > > provided by the platform. And what's more, we can't assume virtio is
> > > > the only device that is used by the guest.
> > > >
> > > You raised this before and it was answered.
> > > Not all platform support dirty page tracking effectively.
> > > This is optional facility that speed up the migration down time significantly.
> >
> > I can hardly believe the downtime is determined by the speed of logging dirty
> > pages...
> >
> Without dirty page tracking, in pre-copy, all the memory must be migrated again, which takes significantly longer time.

It's about w/ and w/o dirty page tracking, it's not about the speed of
dirty page tracking.

>
> > > So until platform supports it, it is supported by virtio.
> >
> > Some platforms already support that.
> >
> And some done.
> Hence whichever wants to use by platform, will use by platform.
> Whichever prefer to use from device will use by the device.

I don't think so, for example virtio has a hard time when it doesn't
rely on the platform (e.g IOMMU). We don't want to repeat that
tragedy.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-06  6:34               ` [virtio-comment] " Jason Wang
@ 2023-11-06  6:53                 ` Parav Pandit
  2023-11-07  4:04                   ` [virtio-comment] " Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-06  6:53 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, November 6, 2023 12:04 PM
> 
> On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Thursday, November 2, 2023 9:54 AM
> > >
> > > On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Wednesday, November 1, 2023 6:00 AM
> > > > >
> > > > > On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > > > >
> > > > > > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > > During a device migration flow (typically in a precopy
> > > > > > > > phase of the live migration), a device may write to the
> > > > > > > > guest memory. Some iommu/hypervisor may not be able to
> > > > > > > > track these
> > > written pages.
> > > > > > > > These pages to be migrated from source to destination hypervisor.
> > > > > > > >
> > > > > > > > A device which writes to these pages, provides the page
> > > > > > > > address record of the to the owner device. The owner
> > > > > > > > device starts write recording for the device and queries
> > > > > > > > all the page addresses written by the device.
> > > > > > > >
> > > > > > > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > > > > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > > > > Signed-off-by: Satananda Burla <sburla@marvell.com>
> > > > > > > > ---
> > > > > > > > changelog:
> > > > > > > > v1->v2:
> > > > > > > > - addressed comments from Michael
> > > > > > > > - replaced iova with physical address
> > > > > > > > ---
> > > > > > > >  admin-cmds-device-migration.tex | 15 +++++++++++++++
> > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/admin-cmds-device-migration.tex
> > > > > > > > b/admin-cmds-device-migration.tex index ed911e4..2e32f2c
> > > > > > > > 100644
> > > > > > > > --- a/admin-cmds-device-migration.tex
> > > > > > > > +++ b/admin-cmds-device-migration.tex
> > > > > > > > @@ -95,6 +95,21 @@ \subsubsection{Device
> > > > > > > > Migration}\label{sec:Basic Facilities of a Virtio Device /
> > > > > > > > The owner driver can discard any partially read or written
> > > > > > > > device context when  any of the device migration flow
> > > > > > > should be aborted.
> > > > > > > >
> > > > > > > > +During the device migration flow, a passthrough device
> > > > > > > > +may write data to the guest virtual machine's memory, a
> > > > > > > > +source hypervisor needs to keep track of these written
> > > > > > > > +memory to migrate such memory to destination
> > > > > > > hypervisor.
> > > > > > > > +Some systems may not be able to keep track of such memory
> > > > > > > > +write addresses at hypervisor level. In such a scenario,
> > > > > > > > +a device records and reports these written memory
> > > > > > > > +addresses to the owner device. The owner driver enables
> > > > > > > > +write recording for one or more physical address ranges
> > > > > > > > +per device during device
> > > migration flow.
> > > > > > > > +The owner driver periodically queries these written
> > > > > > > > +physical address
> > > > > records from the device.
> > > > > > >
> > > > > > > I wonder how PA works in this case. Device uses untranslated
> > > > > > > requests so it can only see IOVA. We can't mandate ATS anyhow.
> > > > > > Michael suggested to keep the language uniform as PA as this
> > > > > > is ultimately
> > > > > what the guest driver is supplying during vq creation and in
> > > > > posting buffers as physical address.
> > > > >
> > > > > This seems to need some work. And, can you show me how it can work?
> > > > >
> > > > > 1) e.g if GAW is 48 bit, is the hypervisor expected to do a
> > > > > bisection of the whole range?
> > > > > 2) does the device need to reserve sufficient internal resources
> > > > > for logging the dirty page and why (not)?
> > > > No when dirty page logging starts, only at that time, device will
> > > > reserve
> > > enough resources.
> > >
> > > GAW is 48bit, how large would it have then?
> > Dirty page tracking is not dependent on the size of the GAW.
> > It is function of address ranges for the amount of guest memory regardless of
> GAW.
> 
> The problem is, e.g when vIOMMU is enabled, you can't know which IOVA is
> actually used by guests. And even for the case when vIOMMU is not enabled,
> the guest may have several TBs. Is it easy to reserve sufficient resources by the
> device itself?
> 
When page tracking is enabled per device, it knows about the range and it can reserve certain resource.

> Host should always have more resources than device, in that sense there could
> be several methods that tries to utilize host memory instead of the one in the
> device. I think we've discussed this when going through the doc prepared by
> Eugenio.
> 
> >
> > > What happens if we're trying to migrate more than 1 device?
> > >
> > That is perfectly fine.
> > Each device is updating its log of pages it wrote.
> > The hypervisor is collecting their sum.
> 
> See above.
> 
> >
> > > >
> > > > > 3) DMA is part of the transport, it's natural to do logging
> > > > > there, why duplicate efforts in the virtio layer?
> > > > He he, you have funny comment.
> > > > When an abstract facility is added to virtio you say to do in transport.
> > >
> > > So it's not done in the general facility but tied to the admin part.
> > > And we all know dirty page tracking is a challenge and Eugenio has a
> > > good summary of pros/cons. A revisit of those docs make me think
> > > virtio is not the good place for doing that for may reasons:
> > >
> > > 1) as stated, platform will evolve to be able to tracking dirty
> > > pages, actually, it has been supported by a lot of major IOMMU
> > > vendors
> >
> > This is optional facility in virtio.
> > Can you please point to the references? I don’t see it in the common Linux
> kernel support for it.
> 
> Note that when IOMMUFD is being proposed, dirty page tracking is one of the
> major considerations.
> 
> This is one recent proposal:
> 
> https://www.spinics.net/lists/kvm/msg330894.html
> 
Sure, so if platform supports it. it can be used from the platform.
If it does not, the device supplies it.

> > Instead Linux kernel choose to extend to the devices.
> 
> Well, as I stated, tracking dirty pages is challenging if you want to do it on a
> device, and you can't simply invent dirty page tracking for each type of the
> devices.
> 
It is not invented.
It is generic framework for all virtio device types as proposed here.
Keep in mind, that it is optional already in v3 series.

> > At least not seen to arrive this in any near term in start of 2024 which is
> where users must use this.
> >
> > > 2) you can't assume virtio is the only device that can be used by
> > > the guest, having dirty pages tracking to be implemented in each
> > > type of device is unrealistic
> > Of course, there is no such assumption made. Where did you see a text that
> made such assumption?
> 
> So what happens if you have a guest with virtio and other devices assigned?
> 
What happens? Each device type would do its own dirty page tracking.
And if all devices does not have support, hypervisor knows to fall back to platform iommu or its own.

> > Each virtio and non virtio devices who wants to report their dirty page report,
> will do their way.
> >
> > > 3) inventing it in the virtio layer will be deprecated in the future
> > > for sure, as platform will provide much rich features for logging
> > > e.g it can do it per PASID etc, I don't see any reason virtio need
> > > to compete with the features that will be provided by the platform
> > Can you bring the cpu vendors and committement to virtio tc with timelines
> so that virtio TC can omit?
> 
> Why do we need to bring CPU vendors in the virtio TC? Virtio needs to be built
> on top of transport or platform. There's no need to duplicate their job.
> Especially considering that virtio can't do better than them.
> 
I wanted to see a strong commitment for the cpu vendors to support dirty page tracking.
And the work seems to have started for some platforms.
Without such platform commitment, virtio also skipping it would not work.

> > i.e. in first year of 2024?
> 
> Why does it matter in 2024?
Because users needs to use it now.

> 
> > If not, we are better off to offer this, and when/if platform support is, sure,
> this feature can be disabled/not used/not enabled.
> >
> > > 4) if the platform support is missing, we can use software or
> > > leverage transport for assistance like PRI
> > All of these are in theory.
> > Our experiment shows PRI performance is 21x slower than page fault rate
> done by the cpu.
> > It simply does not even pass a simple 10Gbps test.
> 
> If you stick to the wire speed during migration, it can converge.
Do you have perf data for this?
In the internal tests we don’t see this happening.

> 
> > There is no requirement for mandating PRI either.
> > So it is unusable.
> 
> It's not about mandating, it's about doing things in the correct layer. If PRI is
> slow, PCI can evolve for sure.
You should try.
In the current state, it is mandating.
And if you think PRI is the only way, than you should propose that in the dirty page tracking series that you listed above to not do dirty page tracking. Rather depend on PRI, right?

> 
> >
> > >
> > > > When one does something in transport, you say, this is transport
> > > > specific, do
> > > some generic.
> > > >
> > > > Here the device is being tracked is virtio device.
> > > > PCI-SIG has told already that PCIM interface is outside the scope of it.
> > > > Hence, this is done in virtio layer here in abstract way.
> > >
> > > You will end up with a competition with the platform/transport one
> > > that will fail.
> > >
> > I don’t see a reason. There is no competition.
> > Platform always have a choice to not use device side page tracking when it is
> supported.
> 
> Platform provides a lot of other functionalities for dirty logging:
> e.g per PASID, granular, etc. So you want to duplicate them again in the virtio? If
> not, why choose this way?
> 
It is optional for the platforms where platform do not have it.

> >
> > > >
> > > > > I can't see how it can compete with the functionality that is
> > > > > provided by the platform. And what's more, we can't assume
> > > > > virtio is the only device that is used by the guest.
> > > > >
> > > > You raised this before and it was answered.
> > > > Not all platform support dirty page tracking effectively.
> > > > This is optional facility that speed up the migration down time
> significantly.
> > >
> > > I can hardly believe the downtime is determined by the speed of
> > > logging dirty pages...
> > >
> > Without dirty page tracking, in pre-copy, all the memory must be migrated
> again, which takes significantly longer time.
> 
> It's about w/ and w/o dirty page tracking, it's not about the speed of dirty page
> tracking.
> 
> >
> > > > So until platform supports it, it is supported by virtio.
> > >
> > > Some platforms already support that.
> > >
> > And some done.
> > Hence whichever wants to use by platform, will use by platform.
> > Whichever prefer to use from device will use by the device.
> 
> I don't think so, for example virtio has a hard time when it doesn't rely on the
> platform (e.g IOMMU). We don't want to repeat that tragedy.
To general statement that I don’t think is applicable here.

But lets see the support from platform.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-06  6:53                 ` [virtio-comment] " Parav Pandit
@ 2023-11-07  4:04                   ` Jason Wang
  2023-11-07  7:05                     ` Michael S. Tsirkin
  2023-11-09  6:24                     ` Parav Pandit
  0 siblings, 2 replies; 157+ messages in thread
From: Jason Wang @ 2023-11-07  4:04 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Mon, Nov 6, 2023 at 2:54 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, November 6, 2023 12:04 PM
> >
> > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Thursday, November 2, 2023 9:54 AM
> > > >
> > > > On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Wednesday, November 1, 2023 6:00 AM
> > > > > >
> > > > > > On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > > > > >
> > > > > > > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > During a device migration flow (typically in a precopy
> > > > > > > > > phase of the live migration), a device may write to the
> > > > > > > > > guest memory. Some iommu/hypervisor may not be able to
> > > > > > > > > track these
> > > > written pages.
> > > > > > > > > These pages to be migrated from source to destination hypervisor.
> > > > > > > > >
> > > > > > > > > A device which writes to these pages, provides the page
> > > > > > > > > address record of the to the owner device. The owner
> > > > > > > > > device starts write recording for the device and queries
> > > > > > > > > all the page addresses written by the device.
> > > > > > > > >
> > > > > > > > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > > > > > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > > > > > Signed-off-by: Satananda Burla <sburla@marvell.com>
> > > > > > > > > ---
> > > > > > > > > changelog:
> > > > > > > > > v1->v2:
> > > > > > > > > - addressed comments from Michael
> > > > > > > > > - replaced iova with physical address
> > > > > > > > > ---
> > > > > > > > >  admin-cmds-device-migration.tex | 15 +++++++++++++++
> > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/admin-cmds-device-migration.tex
> > > > > > > > > b/admin-cmds-device-migration.tex index ed911e4..2e32f2c
> > > > > > > > > 100644
> > > > > > > > > --- a/admin-cmds-device-migration.tex
> > > > > > > > > +++ b/admin-cmds-device-migration.tex
> > > > > > > > > @@ -95,6 +95,21 @@ \subsubsection{Device
> > > > > > > > > Migration}\label{sec:Basic Facilities of a Virtio Device /
> > > > > > > > > The owner driver can discard any partially read or written
> > > > > > > > > device context when  any of the device migration flow
> > > > > > > > should be aborted.
> > > > > > > > >
> > > > > > > > > +During the device migration flow, a passthrough device
> > > > > > > > > +may write data to the guest virtual machine's memory, a
> > > > > > > > > +source hypervisor needs to keep track of these written
> > > > > > > > > +memory to migrate such memory to destination
> > > > > > > > hypervisor.
> > > > > > > > > +Some systems may not be able to keep track of such memory
> > > > > > > > > +write addresses at hypervisor level. In such a scenario,
> > > > > > > > > +a device records and reports these written memory
> > > > > > > > > +addresses to the owner device. The owner driver enables
> > > > > > > > > +write recording for one or more physical address ranges
> > > > > > > > > +per device during device
> > > > migration flow.
> > > > > > > > > +The owner driver periodically queries these written
> > > > > > > > > +physical address
> > > > > > records from the device.
> > > > > > > >
> > > > > > > > I wonder how PA works in this case. Device uses untranslated
> > > > > > > > requests so it can only see IOVA. We can't mandate ATS anyhow.
> > > > > > > Michael suggested to keep the language uniform as PA as this
> > > > > > > is ultimately
> > > > > > what the guest driver is supplying during vq creation and in
> > > > > > posting buffers as physical address.
> > > > > >
> > > > > > This seems to need some work. And, can you show me how it can work?
> > > > > >
> > > > > > 1) e.g if GAW is 48 bit, is the hypervisor expected to do a
> > > > > > bisection of the whole range?
> > > > > > 2) does the device need to reserve sufficient internal resources
> > > > > > for logging the dirty page and why (not)?
> > > > > No when dirty page logging starts, only at that time, device will
> > > > > reserve
> > > > enough resources.
> > > >
> > > > GAW is 48bit, how large would it have then?
> > > Dirty page tracking is not dependent on the size of the GAW.
> > > It is function of address ranges for the amount of guest memory regardless of
> > GAW.
> >
> > The problem is, e.g when vIOMMU is enabled, you can't know which IOVA is
> > actually used by guests. And even for the case when vIOMMU is not enabled,
> > the guest may have several TBs. Is it easy to reserve sufficient resources by the
> > device itself?
> >
> When page tracking is enabled per device, it knows about the range and it can reserve certain resource.

I didn't see such an interface in this series. Anything I miss?

Btw, the IOVA is allocated by the guest actually, how can we know the
range? (or using the host range?)

>
> > Host should always have more resources than device, in that sense there could
> > be several methods that tries to utilize host memory instead of the one in the
> > device. I think we've discussed this when going through the doc prepared by
> > Eugenio.
> >
> > >
> > > > What happens if we're trying to migrate more than 1 device?
> > > >
> > > That is perfectly fine.
> > > Each device is updating its log of pages it wrote.
> > > The hypervisor is collecting their sum.
> >
> > See above.
> >
> > >
> > > > >
> > > > > > 3) DMA is part of the transport, it's natural to do logging
> > > > > > there, why duplicate efforts in the virtio layer?
> > > > > He he, you have funny comment.
> > > > > When an abstract facility is added to virtio you say to do in transport.
> > > >
> > > > So it's not done in the general facility but tied to the admin part.
> > > > And we all know dirty page tracking is a challenge and Eugenio has a
> > > > good summary of pros/cons. A revisit of those docs make me think
> > > > virtio is not the good place for doing that for may reasons:
> > > >
> > > > 1) as stated, platform will evolve to be able to tracking dirty
> > > > pages, actually, it has been supported by a lot of major IOMMU
> > > > vendors
> > >
> > > This is optional facility in virtio.
> > > Can you please point to the references? I don’t see it in the common Linux
> > kernel support for it.
> >
> > Note that when IOMMUFD is being proposed, dirty page tracking is one of the
> > major considerations.
> >
> > This is one recent proposal:
> >
> > https://www.spinics.net/lists/kvm/msg330894.html
> >
> Sure, so if platform supports it. it can be used from the platform.
> If it does not, the device supplies it.
>
> > > Instead Linux kernel choose to extend to the devices.
> >
> > Well, as I stated, tracking dirty pages is challenging if you want to do it on a
> > device, and you can't simply invent dirty page tracking for each type of the
> > devices.
> >
> It is not invented.
> It is generic framework for all virtio device types as proposed here.
> Keep in mind, that it is optional already in v3 series.
>
> > > At least not seen to arrive this in any near term in start of 2024 which is
> > where users must use this.
> > >
> > > > 2) you can't assume virtio is the only device that can be used by
> > > > the guest, having dirty pages tracking to be implemented in each
> > > > type of device is unrealistic
> > > Of course, there is no such assumption made. Where did you see a text that
> > made such assumption?
> >
> > So what happens if you have a guest with virtio and other devices assigned?
> >
> What happens? Each device type would do its own dirty page tracking.
> And if all devices does not have support, hypervisor knows to fall back to platform iommu or its own.
>
> > > Each virtio and non virtio devices who wants to report their dirty page report,
> > will do their way.
> > >
> > > > 3) inventing it in the virtio layer will be deprecated in the future
> > > > for sure, as platform will provide much rich features for logging
> > > > e.g it can do it per PASID etc, I don't see any reason virtio need
> > > > to compete with the features that will be provided by the platform
> > > Can you bring the cpu vendors and committement to virtio tc with timelines
> > so that virtio TC can omit?
> >
> > Why do we need to bring CPU vendors in the virtio TC? Virtio needs to be built
> > on top of transport or platform. There's no need to duplicate their job.
> > Especially considering that virtio can't do better than them.
> >
> I wanted to see a strong commitment for the cpu vendors to support dirty page tracking.

The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
ARM are all supporting that now.

> And the work seems to have started for some platforms.

Let me quote from the above link:

"""
Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
alongside VT-D rev3.x also do support.
"""

> Without such platform commitment, virtio also skipping it would not work.

Is the above sufficient? I'm a little bit more familiar with vtd, the
hw feature has been there for years.

>
> > > i.e. in first year of 2024?
> >
> > Why does it matter in 2024?
> Because users needs to use it now.
>
> >
> > > If not, we are better off to offer this, and when/if platform support is, sure,
> > this feature can be disabled/not used/not enabled.
> > >
> > > > 4) if the platform support is missing, we can use software or
> > > > leverage transport for assistance like PRI
> > > All of these are in theory.
> > > Our experiment shows PRI performance is 21x slower than page fault rate
> > done by the cpu.
> > > It simply does not even pass a simple 10Gbps test.
> >
> > If you stick to the wire speed during migration, it can converge.
> Do you have perf data for this?

No, but it's not hard to imagine the worst case. Wrote a small program
that dirty every page by a NIC.

> In the internal tests we don’t see this happening.

downtime = dirty_rates * PAGE_SIZE / migration_speed

So if we get very high dirty rates (e.g by a high speed NIC), we can't
satisfy the requirement of the downtime. Or if you see the converge,
you might get help from the auto converge support by the hypervisors
like KVM where it tries to throttle the VCPU then you can't reach the
wire speed.

>
> >
> > > There is no requirement for mandating PRI either.
> > > So it is unusable.
> >
> > It's not about mandating, it's about doing things in the correct layer. If PRI is
> > slow, PCI can evolve for sure.
> You should try.

Not my duty, I just want to make sure things are done in the correct
layer, and once it needs to be done in the virtio, there's nothing
obviously wrong.

> In the current state, it is mandating.
> And if you think PRI is the only way,

I don't, it's just an example where virtio can leverage from either
transport or platform. Or if it's the fault in virtio that slows down
the PRI, then it is something we can do.

>  than you should propose that in the dirty page tracking series that you listed above to not do dirty page tracking. Rather depend on PRI, right?

No, the point is to not duplicate works especially considering virtio
can't do better than platform or transport.

>
> >
> > >
> > > >
> > > > > When one does something in transport, you say, this is transport
> > > > > specific, do
> > > > some generic.
> > > > >
> > > > > Here the device is being tracked is virtio device.
> > > > > PCI-SIG has told already that PCIM interface is outside the scope of it.
> > > > > Hence, this is done in virtio layer here in abstract way.
> > > >
> > > > You will end up with a competition with the platform/transport one
> > > > that will fail.
> > > >
> > > I don’t see a reason. There is no competition.
> > > Platform always have a choice to not use device side page tracking when it is
> > supported.
> >
> > Platform provides a lot of other functionalities for dirty logging:
> > e.g per PASID, granular, etc. So you want to duplicate them again in the virtio? If
> > not, why choose this way?
> >
> It is optional for the platforms where platform do not have it.

We are developing new virtio functionalities that are targeted for
future platforms. Otherwise we would end up with a feature with a very
narrow use case.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-07  4:04                   ` [virtio-comment] " Jason Wang
@ 2023-11-07  7:05                     ` Michael S. Tsirkin
  2023-11-08  4:28                       ` Jason Wang
  2023-11-09  6:24                     ` Parav Pandit
  1 sibling, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-07  7:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > Each virtio and non virtio devices who wants to report their dirty page report,
> > > will do their way.
> > > >
> > > > > 3) inventing it in the virtio layer will be deprecated in the future
> > > > > for sure, as platform will provide much rich features for logging
> > > > > e.g it can do it per PASID etc, I don't see any reason virtio need
> > > > > to compete with the features that will be provided by the platform
> > > > Can you bring the cpu vendors and committement to virtio tc with timelines
> > > so that virtio TC can omit?
> > >
> > > Why do we need to bring CPU vendors in the virtio TC? Virtio needs to be built
> > > on top of transport or platform. There's no need to duplicate their job.
> > > Especially considering that virtio can't do better than them.
> > >
> > I wanted to see a strong commitment for the cpu vendors to support dirty page tracking.
> 
> The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
> ARM are all supporting that now.
> 
> > And the work seems to have started for some platforms.
> 
> Let me quote from the above link:
> 
> """
> Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> alongside VT-D rev3.x also do support.
> """
> 
> > Without such platform commitment, virtio also skipping it would not work.
> 
> Is the above sufficient? I'm a little bit more familiar with vtd, the
> hw feature has been there for years.


Repeating myself - I'm not sure that will work well for all workloads. Definitely KVM did
not scan PTEs. It used pagefaults with bit per page and later as VM size
grew switched to PLM.  This interface is analogous to PLM, what Lingshan
proposed is analogous to bit per page - problem unfortunately is
you can't easily set a bit by DMA.

So I think this dirty tracking is a good option to have.



> >
> > > > i.e. in first year of 2024?
> > >
> > > Why does it matter in 2024?
> > Because users needs to use it now.
> >
> > >
> > > > If not, we are better off to offer this, and when/if platform support is, sure,
> > > this feature can be disabled/not used/not enabled.
> > > >
> > > > > 4) if the platform support is missing, we can use software or
> > > > > leverage transport for assistance like PRI
> > > > All of these are in theory.
> > > > Our experiment shows PRI performance is 21x slower than page fault rate
> > > done by the cpu.
> > > > It simply does not even pass a simple 10Gbps test.
> > >
> > > If you stick to the wire speed during migration, it can converge.
> > Do you have perf data for this?
> 
> No, but it's not hard to imagine the worst case. Wrote a small program
> that dirty every page by a NIC.
> 
> > In the internal tests we don’t see this happening.
> 
> downtime = dirty_rates * PAGE_SIZE / migration_speed
> 
> So if we get very high dirty rates (e.g by a high speed NIC), we can't
> satisfy the requirement of the downtime. Or if you see the converge,
> you might get help from the auto converge support by the hypervisors
> like KVM where it tries to throttle the VCPU then you can't reach the
> wire speed.

Will only work for some device types.



> >
> > >
> > > > There is no requirement for mandating PRI either.
> > > > So it is unusable.
> > >
> > > It's not about mandating, it's about doing things in the correct layer. If PRI is
> > > slow, PCI can evolve for sure.
> > You should try.
> 
> Not my duty, I just want to make sure things are done in the correct
> layer, and once it needs to be done in the virtio, there's nothing
> obviously wrong.

Yea but just vague questions don't help to make sure eiter way.


> > In the current state, it is mandating.
> > And if you think PRI is the only way,
> 
> I don't, it's just an example where virtio can leverage from either
> transport or platform. Or if it's the fault in virtio that slows down
> the PRI, then it is something we can do.
> 
> >  than you should propose that in the dirty page tracking series that you listed above to not do dirty page tracking. Rather depend on PRI, right?
> 
> No, the point is to not duplicate works especially considering virtio
> can't do better than platform or transport.

If someone says they tried and platform's migration support does not
work for them and they want to build a solution in virtio then
what exactly is the objection? virtio is here in the
first place because emulating devices didn't work well.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-07  7:05                     ` Michael S. Tsirkin
@ 2023-11-08  4:28                       ` Jason Wang
  2023-11-08  8:17                         ` Michael S. Tsirkin
  2023-11-09  6:26                         ` [virtio-comment] " Parav Pandit
  0 siblings, 2 replies; 157+ messages in thread
From: Jason Wang @ 2023-11-08  4:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > Each virtio and non virtio devices who wants to report their dirty page report,
> > > > will do their way.
> > > > >
> > > > > > 3) inventing it in the virtio layer will be deprecated in the future
> > > > > > for sure, as platform will provide much rich features for logging
> > > > > > e.g it can do it per PASID etc, I don't see any reason virtio need
> > > > > > to compete with the features that will be provided by the platform
> > > > > Can you bring the cpu vendors and committement to virtio tc with timelines
> > > > so that virtio TC can omit?
> > > >
> > > > Why do we need to bring CPU vendors in the virtio TC? Virtio needs to be built
> > > > on top of transport or platform. There's no need to duplicate their job.
> > > > Especially considering that virtio can't do better than them.
> > > >
> > > I wanted to see a strong commitment for the cpu vendors to support dirty page tracking.
> >
> > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
> > ARM are all supporting that now.
> >
> > > And the work seems to have started for some platforms.
> >
> > Let me quote from the above link:
> >
> > """
> > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > alongside VT-D rev3.x also do support.
> > """
> >
> > > Without such platform commitment, virtio also skipping it would not work.
> >
> > Is the above sufficient? I'm a little bit more familiar with vtd, the
> > hw feature has been there for years.
>
>
> Repeating myself - I'm not sure that will work well for all workloads.

I think this comment applies to this proposal as well.

> Definitely KVM did
> not scan PTEs. It used pagefaults with bit per page and later as VM size
> grew switched to PLM.  This interface is analogous to PLM,

I think you meant PML actually. And it doesn't work like PML. To
behave like PML it needs to

1) log buffers were organized as a queue with indices
2) device needs to suspend (as a #vmexit in PML) if it runs out of the buffers
3) device need to send a notification to the driver if it runs out of the buffer

I don't see any of the above in this proposal. If we do that it would
be less problematic than what is being proposed here.

Even if we manage to do that, it doesn't mean we won't have issues.

1) For many reasons it can neither see nor log via GPA, so this
requires a traversal of the vIOMMU mapping tables by the hypervisor
afterwards, it would be expensive and need synchronization with the
guest modification of the IO page table which looks very hard.
2) There are a lot of special or reserved IOVA ranges (for example the
interrupt areas in x86) that need special care which is architectural
and where it is beyond the scope or knowledge of the virtio device but
the platform IOMMU. Things would be more complicated when SVA is
enabled. And there could be other architecte specific knowledge (e.g
PAGE_SIZE) that might be needed. There's no easy way to deal with
those cases.

We wouldn't need to care about all of them if it is done at platform
IOMMU level.

> what Lingshan
> proposed is analogous to bit per page - problem unfortunately is
> you can't easily set a bit by DMA.
>

I'm not saying bit/bytemap is the best, but it has been used by real
hardware. And we have many other options.

> So I think this dirty tracking is a good option to have.
>
>
>
> > >
> > > > > i.e. in first year of 2024?
> > > >
> > > > Why does it matter in 2024?
> > > Because users needs to use it now.
> > >
> > > >
> > > > > If not, we are better off to offer this, and when/if platform support is, sure,
> > > > this feature can be disabled/not used/not enabled.
> > > > >
> > > > > > 4) if the platform support is missing, we can use software or
> > > > > > leverage transport for assistance like PRI
> > > > > All of these are in theory.
> > > > > Our experiment shows PRI performance is 21x slower than page fault rate
> > > > done by the cpu.
> > > > > It simply does not even pass a simple 10Gbps test.
> > > >
> > > > If you stick to the wire speed during migration, it can converge.
> > > Do you have perf data for this?
> >
> > No, but it's not hard to imagine the worst case. Wrote a small program
> > that dirty every page by a NIC.
> >
> > > In the internal tests we don’t see this happening.
> >
> > downtime = dirty_rates * PAGE_SIZE / migration_speed
> >
> > So if we get very high dirty rates (e.g by a high speed NIC), we can't
> > satisfy the requirement of the downtime. Or if you see the converge,
> > you might get help from the auto converge support by the hypervisors
> > like KVM where it tries to throttle the VCPU then you can't reach the
> > wire speed.
>
> Will only work for some device types.
>

Yes, that's the point. Parav said he doesn't see the issue, it's
probably because he is testing a virtio-net and so the vCPU is
automatically throttled. It doesn't mean it can work for other virito
devices.

>
>
> > >
> > > >
> > > > > There is no requirement for mandating PRI either.
> > > > > So it is unusable.
> > > >
> > > > It's not about mandating, it's about doing things in the correct layer. If PRI is
> > > > slow, PCI can evolve for sure.
> > > You should try.
> >
> > Not my duty, I just want to make sure things are done in the correct
> > layer, and once it needs to be done in the virtio, there's nothing
> > obviously wrong.
>
> Yea but just vague questions don't help to make sure eiter way.

I don't think it's vague, I have explained, if something in the virito
slows down the PRI, we can try to fix them. Missing functions in
platform or transport is not a good excuse to try to workaround it in
the virtio. It's a layer violation and we never had any feature like
this in the past.

>
> > > In the current state, it is mandating.
> > > And if you think PRI is the only way,
> >
> > I don't, it's just an example where virtio can leverage from either
> > transport or platform. Or if it's the fault in virtio that slows down
> > the PRI, then it is something we can do.
> >
> > >  than you should propose that in the dirty page tracking series that you listed above to not do dirty page tracking. Rather depend on PRI, right?
> >
> > No, the point is to not duplicate works especially considering virtio
> > can't do better than platform or transport.
>
> If someone says they tried and platform's migration support does not
> work for them and they want to build a solution in virtio then
> what exactly is the objection?

The discussion is to make sure whether virtio can do this easily and
correctly, then we can have a conclusion. I've stated some issues
above, and I've asked other questions related to them which are still
not answered.

I think we had a very hard time in bypassing IOMMU in the past that we
don't want to repeat.

We've gone through several methods of logging dirty pages in the past
(each with pros/cons), but this proposal never explains why it chooses
one of them but not others. Spec needs to find the best path instead
of just a possible path without any rationale about why.

> virtio is here in the
> first place because emulating devices didn't work well.

I don't understand here. We have supported emulated devices for years.
I'm pretty sure a lot of issues could be uncovered if this proposal
can be prototyped with an emulated device first.

Thanks





>
> --
> MST
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-08  4:28                       ` Jason Wang
@ 2023-11-08  8:17                         ` Michael S. Tsirkin
  2023-11-08  9:00                           ` [virtio-comment] " Parav Pandit
  2023-11-09  3:31                           ` Jason Wang
  2023-11-09  6:26                         ` [virtio-comment] " Parav Pandit
  1 sibling, 2 replies; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-08  8:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > Each virtio and non virtio devices who wants to report their dirty page report,
> > > > > will do their way.
> > > > > >
> > > > > > > 3) inventing it in the virtio layer will be deprecated in the future
> > > > > > > for sure, as platform will provide much rich features for logging
> > > > > > > e.g it can do it per PASID etc, I don't see any reason virtio need
> > > > > > > to compete with the features that will be provided by the platform
> > > > > > Can you bring the cpu vendors and committement to virtio tc with timelines
> > > > > so that virtio TC can omit?
> > > > >
> > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio needs to be built
> > > > > on top of transport or platform. There's no need to duplicate their job.
> > > > > Especially considering that virtio can't do better than them.
> > > > >
> > > > I wanted to see a strong commitment for the cpu vendors to support dirty page tracking.
> > >
> > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
> > > ARM are all supporting that now.
> > >
> > > > And the work seems to have started for some platforms.
> > >
> > > Let me quote from the above link:
> > >
> > > """
> > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > alongside VT-D rev3.x also do support.
> > > """
> > >
> > > > Without such platform commitment, virtio also skipping it would not work.
> > >
> > > Is the above sufficient? I'm a little bit more familiar with vtd, the
> > > hw feature has been there for years.
> >
> >
> > Repeating myself - I'm not sure that will work well for all workloads.
> 
> I think this comment applies to this proposal as well.

Yes - some systems might be better off with platform tracking.
And I think supporting shadow vq better would be nice too.

> > Definitely KVM did
> > not scan PTEs. It used pagefaults with bit per page and later as VM size
> > grew switched to PLM.  This interface is analogous to PLM,
> 
> I think you meant PML actually. And it doesn't work like PML. To
> behave like PML it needs to
> 
> 1) log buffers were organized as a queue with indices
> 2) device needs to suspend (as a #vmexit in PML) if it runs out of the buffers
> 3) device need to send a notification to the driver if it runs out of the buffer
> 
> I don't see any of the above in this proposal. If we do that it would
> be less problematic than what is being proposed here.

What is common between this and PML is that you get the addresses
directly without scanning megabytes of bitmaps or worse -
hundreds of megabytes of page tables.

The data structure is different but I don't see why it is critical.

I agree that I don't see out of buffers notifications too which implies
device has to maintain something like a bitmap internally.  Which I
guess could be fine but it is not clear to me how large that bitmap has
to be. How does the device know? Needs to be addressed.


> Even if we manage to do that, it doesn't mean we won't have issues.
> 
> 1) For many reasons it can neither see nor log via GPA, so this
> requires a traversal of the vIOMMU mapping tables by the hypervisor
> afterwards, it would be expensive and need synchronization with the
> guest modification of the IO page table which looks very hard.

vIOMMU is fast enough to be used on data path but not fast enough for
dirty tracking? Hard to believe.  If true and you want to speed up
vIOMMU then you implement an efficient datastructure for that.

> 2) There are a lot of special or reserved IOVA ranges (for example the
> interrupt areas in x86) that need special care which is architectural
> and where it is beyond the scope or knowledge of the virtio device but
> the platform IOMMU. Things would be more complicated when SVA is
> enabled.

SVA being what here?

> And there could be other architecte specific knowledge (e.g
> PAGE_SIZE) that might be needed. There's no easy way to deal with
> those cases.

Good point about page size actually - using 4k unconditionally
is a waste of resources.


> We wouldn't need to care about all of them if it is done at platform
> IOMMU level.

If someone logs at IOMMU level then nothing needs to be done
in the spec at all. This is about capability at the device level.


> > what Lingshan
> > proposed is analogous to bit per page - problem unfortunately is
> > you can't easily set a bit by DMA.
> >
> 
> I'm not saying bit/bytemap is the best, but it has been used by real
> hardware. And we have many other options.
> 
> > So I think this dirty tracking is a good option to have.
> >
> >
> >
> > > >
> > > > > > i.e. in first year of 2024?
> > > > >
> > > > > Why does it matter in 2024?
> > > > Because users needs to use it now.
> > > >
> > > > >
> > > > > > If not, we are better off to offer this, and when/if platform support is, sure,
> > > > > this feature can be disabled/not used/not enabled.
> > > > > >
> > > > > > > 4) if the platform support is missing, we can use software or
> > > > > > > leverage transport for assistance like PRI
> > > > > > All of these are in theory.
> > > > > > Our experiment shows PRI performance is 21x slower than page fault rate
> > > > > done by the cpu.
> > > > > > It simply does not even pass a simple 10Gbps test.
> > > > >
> > > > > If you stick to the wire speed during migration, it can converge.
> > > > Do you have perf data for this?
> > >
> > > No, but it's not hard to imagine the worst case. Wrote a small program
> > > that dirty every page by a NIC.
> > >
> > > > In the internal tests we don’t see this happening.
> > >
> > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > >
> > > So if we get very high dirty rates (e.g by a high speed NIC), we can't
> > > satisfy the requirement of the downtime. Or if you see the converge,
> > > you might get help from the auto converge support by the hypervisors
> > > like KVM where it tries to throttle the VCPU then you can't reach the
> > > wire speed.
> >
> > Will only work for some device types.
> >
> 
> Yes, that's the point. Parav said he doesn't see the issue, it's
> probably because he is testing a virtio-net and so the vCPU is
> automatically throttled. It doesn't mean it can work for other virito
> devices.

Only for TX, and I'm pretty sure they had the foresight to test RX not
just TX but let's confirm. Parav did you test both directions?

> >
> >
> > > >
> > > > >
> > > > > > There is no requirement for mandating PRI either.
> > > > > > So it is unusable.
> > > > >
> > > > > It's not about mandating, it's about doing things in the correct layer. If PRI is
> > > > > slow, PCI can evolve for sure.
> > > > You should try.
> > >
> > > Not my duty, I just want to make sure things are done in the correct
> > > layer, and once it needs to be done in the virtio, there's nothing
> > > obviously wrong.
> >
> > Yea but just vague questions don't help to make sure eiter way.
> 
> I don't think it's vague, I have explained, if something in the virito
> slows down the PRI, we can try to fix them.

I don't believe you are going to make PRI fast. No one managed so far.

> Missing functions in
> platform or transport is not a good excuse to try to workaround it in
> the virtio. It's a layer violation and we never had any feature like
> this in the past.

Yes missing functionality in the platform is exactly why virtio
was born in the first place.

> >
> > > > In the current state, it is mandating.
> > > > And if you think PRI is the only way,
> > >
> > > I don't, it's just an example where virtio can leverage from either
> > > transport or platform. Or if it's the fault in virtio that slows down
> > > the PRI, then it is something we can do.
> > >
> > > >  than you should propose that in the dirty page tracking series that you listed above to not do dirty page tracking. Rather depend on PRI, right?
> > >
> > > No, the point is to not duplicate works especially considering virtio
> > > can't do better than platform or transport.
> >
> > If someone says they tried and platform's migration support does not
> > work for them and they want to build a solution in virtio then
> > what exactly is the objection?
> 
> The discussion is to make sure whether virtio can do this easily and
> correctly, then we can have a conclusion. I've stated some issues
> above, and I've asked other questions related to them which are still
> not answered.
> 
> I think we had a very hard time in bypassing IOMMU in the past that we
> don't want to repeat.
> 
> We've gone through several methods of logging dirty pages in the past
> (each with pros/cons), but this proposal never explains why it chooses
> one of them but not others. Spec needs to find the best path instead
> of just a possible path without any rationale about why.

Adding more rationale isn't a bad thing.
In particular if platform supplies dirty tracking then how does
driver decide which to use platform or device capability?
A bit of discussion around this is a good idea.


> > virtio is here in the
> > first place because emulating devices didn't work well.
> 
> I don't understand here. We have supported emulated devices for years.
> I'm pretty sure a lot of issues could be uncovered if this proposal
> can be prototyped with an emulated device first.
> 
> Thanks

virtio was originally PV as opposed to emulation. That there's now
hardware virtio and you call software implementation "an emulation" is
very meta.


> 
> 
> 
> 
> >
> > --
> > MST
> >


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-08  8:17                         ` Michael S. Tsirkin
@ 2023-11-08  9:00                           ` Parav Pandit
  2023-11-08 17:16                             ` [virtio-comment] " Michael S. Tsirkin
  2023-11-09  3:31                           ` Jason Wang
  1 sibling, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-08  9:00 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment, cohuck, sburla, Shahaf Shuler, Maor Gottlieb,
	Yishai Hadas, lingshan.zhu

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, November 8, 2023 1:47 PM

> Only for TX, and I'm pretty sure they had the foresight to test RX not just TX but
> let's confirm. Parav did you test both directions?
Rx is the main part to test for the dirty page tracking.
Tx only is around ring updates, so not very interesting test.

When page tracking is started, there is impact on Rx as natural throttle by slowing down rx packet rate.
And devices have implementation choice to covering the spectrum on amount of drop rate.
And has implementations evolve, it will improve as well.

PRI for sure is out of question. Most workload won't use it at all given other limitations of virtio on PCI front and existing PRI behaviors.

There is also a test with platform IOMMU dirty page tracking done as well but it is very early stage to comment in this forum.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-08  9:00                           ` [virtio-comment] " Parav Pandit
@ 2023-11-08 17:16                             ` Michael S. Tsirkin
  2023-11-09  6:27                               ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-08 17:16 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Wed, Nov 08, 2023 at 09:00:00AM +0000, Parav Pandit wrote:
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Wednesday, November 8, 2023 1:47 PM
> 
> > Only for TX, and I'm pretty sure they had the foresight to test RX not just TX but
> > let's confirm. Parav did you test both directions?
> Rx is the main part to test for the dirty page tracking.
> Tx only is around ring updates, so not very interesting test.
> 
> When page tracking is started, there is impact on Rx as natural throttle by slowing down rx packet rate.

Wait a sec sounds kind of like what we see with SVQ as well?

> And devices have implementation choice to covering the spectrum on amount of drop rate.

What does this mean practically?

> And has implementations evolve, it will improve as well.

Maybe, or maybe not.

> PRI for sure is out of question. Most workload won't use it at all given other limitations of virtio on PCI front and existing PRI behaviors.

Out of curiousity what do you refer to?

> There is also a test with platform IOMMU dirty page tracking done as well but it is very early stage to comment in this forum.

Interesting. That's a very important test I'd say - if platform based
dirty page tracking is just as good we can avoid adding it to virtio.


And I'm rather confused at this point - I was under the impression you
already have a prototype which shows negligeable performance impact
with on device tracking.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-08  8:17                         ` Michael S. Tsirkin
  2023-11-08  9:00                           ` [virtio-comment] " Parav Pandit
@ 2023-11-09  3:31                           ` Jason Wang
  2023-11-09  7:59                             ` Michael S. Tsirkin
  1 sibling, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-09  3:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Wed, Nov 8, 2023 at 4:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> > On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > Each virtio and non virtio devices who wants to report their dirty page report,
> > > > > > will do their way.
> > > > > > >
> > > > > > > > 3) inventing it in the virtio layer will be deprecated in the future
> > > > > > > > for sure, as platform will provide much rich features for logging
> > > > > > > > e.g it can do it per PASID etc, I don't see any reason virtio need
> > > > > > > > to compete with the features that will be provided by the platform
> > > > > > > Can you bring the cpu vendors and committement to virtio tc with timelines
> > > > > > so that virtio TC can omit?
> > > > > >
> > > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio needs to be built
> > > > > > on top of transport or platform. There's no need to duplicate their job.
> > > > > > Especially considering that virtio can't do better than them.
> > > > > >
> > > > > I wanted to see a strong commitment for the cpu vendors to support dirty page tracking.
> > > >
> > > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
> > > > ARM are all supporting that now.
> > > >
> > > > > And the work seems to have started for some platforms.
> > > >
> > > > Let me quote from the above link:
> > > >
> > > > """
> > > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > > alongside VT-D rev3.x also do support.
> > > > """
> > > >
> > > > > Without such platform commitment, virtio also skipping it would not work.
> > > >
> > > > Is the above sufficient? I'm a little bit more familiar with vtd, the
> > > > hw feature has been there for years.
> > >
> > >
> > > Repeating myself - I'm not sure that will work well for all workloads.
> >
> > I think this comment applies to this proposal as well.
>
> Yes - some systems might be better off with platform tracking.
> And I think supporting shadow vq better would be nice too.

For shadow vq, did you mean the work that is done by Eugenio?

>
> > > Definitely KVM did
> > > not scan PTEs. It used pagefaults with bit per page and later as VM size
> > > grew switched to PLM.  This interface is analogous to PLM,
> >
> > I think you meant PML actually. And it doesn't work like PML. To
> > behave like PML it needs to
> >
> > 1) log buffers were organized as a queue with indices
> > 2) device needs to suspend (as a #vmexit in PML) if it runs out of the buffers
> > 3) device need to send a notification to the driver if it runs out of the buffer
> >
> > I don't see any of the above in this proposal. If we do that it would
> > be less problematic than what is being proposed here.
>
> What is common between this and PML is that you get the addresses
> directly without scanning megabytes of bitmaps or worse -
> hundreds of megabytes of page tables.

Yes, it has overhead but this is the method we use for vhost and KVM (earlier).

To me the  important advantage of PML is that it uses limited
resources on the host which

1) doesn't require resources in the device
2) doesn't scale as the guest memory increases. (but this advantage
doesn't exist in neither this nor bitmap)

>
> The data structure is different but I don't see why it is critical.
>
> I agree that I don't see out of buffers notifications too which implies
> device has to maintain something like a bitmap internally.  Which I
> guess could be fine but it is not clear to me how large that bitmap has
> to be. How does the device know? Needs to be addressed.

This is the question I asked Parav in another thread. Using host
memory as a queue with notification (like PML) might be much better.

>
>
> > Even if we manage to do that, it doesn't mean we won't have issues.
> >
> > 1) For many reasons it can neither see nor log via GPA, so this
> > requires a traversal of the vIOMMU mapping tables by the hypervisor
> > afterwards, it would be expensive and need synchronization with the
> > guest modification of the IO page table which looks very hard.
>
> vIOMMU is fast enough to be used on data path but not fast enough for
> dirty tracking?

We set up SPTEs or using nesting offloading where the PTEs could be
iterated by hardware directly which is fast.

This is not the case here where software needs to iterate the IO page
tables in the guest which could be slow.

> Hard to believe.  If true and you want to speed up
> vIOMMU then you implement an efficient datastructure for that.

Besides the issue of performance, it's also racy, assuming we are logging IOVA.

0) device log IOVA
1) hypervisor fetches IOVA from log buffer
2) guest map IOVA to a new GPA
3) hypervisor traverse guest table to get IOVA to new GPA

Then we lost the old GPA.

>
> > 2) There are a lot of special or reserved IOVA ranges (for example the
> > interrupt areas in x86) that need special care which is architectural
> > and where it is beyond the scope or knowledge of the virtio device but
> > the platform IOMMU. Things would be more complicated when SVA is
> > enabled.
>
> SVA being what here?

For example, IOMMU may treat interrupt ranges differently depending on
whether SVA is enabled or not. It's very hard and unnecessary to teach
devices about this.

>
> > And there could be other architecte specific knowledge (e.g
> > PAGE_SIZE) that might be needed. There's no easy way to deal with
> > those cases.
>
> Good point about page size actually - using 4k unconditionally
> is a waste of resources.

Actually, they are more than just PAGE_SIZE, for example, PASID and others.

>
>
> > We wouldn't need to care about all of them if it is done at platform
> > IOMMU level.
>
> If someone logs at IOMMU level then nothing needs to be done
> in the spec at all. This is about capability at the device level.

True, but my question is where or not it can be done at the device level easily.

>
>
> > > what Lingshan
> > > proposed is analogous to bit per page - problem unfortunately is
> > > you can't easily set a bit by DMA.
> > >
> >
> > I'm not saying bit/bytemap is the best, but it has been used by real
> > hardware. And we have many other options.
> >
> > > So I think this dirty tracking is a good option to have.
> > >
> > >
> > >
> > > > >
> > > > > > > i.e. in first year of 2024?
> > > > > >
> > > > > > Why does it matter in 2024?
> > > > > Because users needs to use it now.
> > > > >
> > > > > >
> > > > > > > If not, we are better off to offer this, and when/if platform support is, sure,
> > > > > > this feature can be disabled/not used/not enabled.
> > > > > > >
> > > > > > > > 4) if the platform support is missing, we can use software or
> > > > > > > > leverage transport for assistance like PRI
> > > > > > > All of these are in theory.
> > > > > > > Our experiment shows PRI performance is 21x slower than page fault rate
> > > > > > done by the cpu.
> > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > >
> > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > Do you have perf data for this?
> > > >
> > > > No, but it's not hard to imagine the worst case. Wrote a small program
> > > > that dirty every page by a NIC.
> > > >
> > > > > In the internal tests we don’t see this happening.
> > > >
> > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > >
> > > > So if we get very high dirty rates (e.g by a high speed NIC), we can't
> > > > satisfy the requirement of the downtime. Or if you see the converge,
> > > > you might get help from the auto converge support by the hypervisors
> > > > like KVM where it tries to throttle the VCPU then you can't reach the
> > > > wire speed.
> > >
> > > Will only work for some device types.
> > >
> >
> > Yes, that's the point. Parav said he doesn't see the issue, it's
> > probably because he is testing a virtio-net and so the vCPU is
> > automatically throttled. It doesn't mean it can work for other virito
> > devices.
>
> Only for TX, and I'm pretty sure they had the foresight to test RX not
> just TX but let's confirm. Parav did you test both directions?

RX speed somehow depends on the speed of refill, so throttling helps
more or less.

>
> > >
> > >
> > > > >
> > > > > >
> > > > > > > There is no requirement for mandating PRI either.
> > > > > > > So it is unusable.
> > > > > >
> > > > > > It's not about mandating, it's about doing things in the correct layer. If PRI is
> > > > > > slow, PCI can evolve for sure.
> > > > > You should try.
> > > >
> > > > Not my duty, I just want to make sure things are done in the correct
> > > > layer, and once it needs to be done in the virtio, there's nothing
> > > > obviously wrong.
> > >
> > > Yea but just vague questions don't help to make sure eiter way.
> >
> > I don't think it's vague, I have explained, if something in the virito
> > slows down the PRI, we can try to fix them.
>
> I don't believe you are going to make PRI fast. No one managed so far.

So it's the fault of PRI not virito, but it doesn't mean we need to do
it in virtio.

>
> > Missing functions in
> > platform or transport is not a good excuse to try to workaround it in
> > the virtio. It's a layer violation and we never had any feature like
> > this in the past.
>
> Yes missing functionality in the platform is exactly why virtio
> was born in the first place.

Well the platform can't do device specific logic. But that's not the
case of dirty page tracking which is device logic agnostic.

>
> > >
> > > > > In the current state, it is mandating.
> > > > > And if you think PRI is the only way,
> > > >
> > > > I don't, it's just an example where virtio can leverage from either
> > > > transport or platform. Or if it's the fault in virtio that slows down
> > > > the PRI, then it is something we can do.
> > > >
> > > > >  than you should propose that in the dirty page tracking series that you listed above to not do dirty page tracking. Rather depend on PRI, right?
> > > >
> > > > No, the point is to not duplicate works especially considering virtio
> > > > can't do better than platform or transport.
> > >
> > > If someone says they tried and platform's migration support does not
> > > work for them and they want to build a solution in virtio then
> > > what exactly is the objection?
> >
> > The discussion is to make sure whether virtio can do this easily and
> > correctly, then we can have a conclusion. I've stated some issues
> > above, and I've asked other questions related to them which are still
> > not answered.
> >
> > I think we had a very hard time in bypassing IOMMU in the past that we
> > don't want to repeat.
> >
> > We've gone through several methods of logging dirty pages in the past
> > (each with pros/cons), but this proposal never explains why it chooses
> > one of them but not others. Spec needs to find the best path instead
> > of just a possible path without any rationale about why.
>
> Adding more rationale isn't a bad thing.
> In particular if platform supplies dirty tracking then how does
> driver decide which to use platform or device capability?
> A bit of discussion around this is a good idea.
>
>
> > > virtio is here in the
> > > first place because emulating devices didn't work well.
> >
> > I don't understand here. We have supported emulated devices for years.
> > I'm pretty sure a lot of issues could be uncovered if this proposal
> > can be prototyped with an emulated device first.
> >
> > Thanks
>
> virtio was originally PV as opposed to emulation. That there's now
> hardware virtio and you call software implementation "an emulation" is
> very meta.

Yes but I don't see how it relates to dirty page tracking. When we
find a way it should work for both software and hardware devices.

Thanks

>
>
> >
> >
> >
> >
> > >
> > > --
> > > MST
> > >
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-07  4:04                   ` [virtio-comment] " Jason Wang
  2023-11-07  7:05                     ` Michael S. Tsirkin
@ 2023-11-09  6:24                     ` Parav Pandit
  2023-11-13  3:37                       ` [virtio-comment] " Jason Wang
  2023-11-15  7:58                       ` Michael S. Tsirkin
  1 sibling, 2 replies; 157+ messages in thread
From: Parav Pandit @ 2023-11-09  6:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, November 7, 2023 9:34 AM
> 
> On Mon, Nov 6, 2023 at 2:54 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, November 6, 2023 12:04 PM
> > >
> > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Thursday, November 2, 2023 9:54 AM
> > > > >
> > > > > On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Wednesday, November 1, 2023 6:00 AM
> > > > > > >
> > > > > > > On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > > > > > >
> > > > > > > > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit
> > > > > > > > > <parav@nvidia.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > During a device migration flow (typically in a precopy
> > > > > > > > > > phase of the live migration), a device may write to
> > > > > > > > > > the guest memory. Some iommu/hypervisor may not be
> > > > > > > > > > able to track these
> > > > > written pages.
> > > > > > > > > > These pages to be migrated from source to destination
> hypervisor.
> > > > > > > > > >
> > > > > > > > > > A device which writes to these pages, provides the
> > > > > > > > > > page address record of the to the owner device. The
> > > > > > > > > > owner device starts write recording for the device and
> > > > > > > > > > queries all the page addresses written by the device.
> > > > > > > > > >
> > > > > > > > > > Fixes:
> > > > > > > > > > https://github.com/oasis-tcs/virtio-spec/issues/176
> > > > > > > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > > > > > > Signed-off-by: Satananda Burla <sburla@marvell.com>
> > > > > > > > > > ---
> > > > > > > > > > changelog:
> > > > > > > > > > v1->v2:
> > > > > > > > > > - addressed comments from Michael
> > > > > > > > > > - replaced iova with physical address
> > > > > > > > > > ---
> > > > > > > > > >  admin-cmds-device-migration.tex | 15 +++++++++++++++
> > > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > > >
> > > > > > > > > > diff --git a/admin-cmds-device-migration.tex
> > > > > > > > > > b/admin-cmds-device-migration.tex index
> > > > > > > > > > ed911e4..2e32f2c
> > > > > > > > > > 100644
> > > > > > > > > > --- a/admin-cmds-device-migration.tex
> > > > > > > > > > +++ b/admin-cmds-device-migration.tex
> > > > > > > > > > @@ -95,6 +95,21 @@ \subsubsection{Device
> > > > > > > > > > Migration}\label{sec:Basic Facilities of a Virtio
> > > > > > > > > > Device / The owner driver can discard any partially
> > > > > > > > > > read or written device context when  any of the device
> > > > > > > > > > migration flow
> > > > > > > > > should be aborted.
> > > > > > > > > >
> > > > > > > > > > +During the device migration flow, a passthrough
> > > > > > > > > > +device may write data to the guest virtual machine's
> > > > > > > > > > +memory, a source hypervisor needs to keep track of
> > > > > > > > > > +these written memory to migrate such memory to
> > > > > > > > > > +destination
> > > > > > > > > hypervisor.
> > > > > > > > > > +Some systems may not be able to keep track of such
> > > > > > > > > > +memory write addresses at hypervisor level. In such a
> > > > > > > > > > +scenario, a device records and reports these written
> > > > > > > > > > +memory addresses to the owner device. The owner
> > > > > > > > > > +driver enables write recording for one or more
> > > > > > > > > > +physical address ranges per device during device
> > > > > migration flow.
> > > > > > > > > > +The owner driver periodically queries these written
> > > > > > > > > > +physical address
> > > > > > > records from the device.
> > > > > > > > >
> > > > > > > > > I wonder how PA works in this case. Device uses
> > > > > > > > > untranslated requests so it can only see IOVA. We can't mandate
> ATS anyhow.
> > > > > > > > Michael suggested to keep the language uniform as PA as
> > > > > > > > this is ultimately
> > > > > > > what the guest driver is supplying during vq creation and in
> > > > > > > posting buffers as physical address.
> > > > > > >
> > > > > > > This seems to need some work. And, can you show me how it can
> work?
> > > > > > >
> > > > > > > 1) e.g if GAW is 48 bit, is the hypervisor expected to do a
> > > > > > > bisection of the whole range?
> > > > > > > 2) does the device need to reserve sufficient internal
> > > > > > > resources for logging the dirty page and why (not)?
> > > > > > No when dirty page logging starts, only at that time, device
> > > > > > will reserve
> > > > > enough resources.
> > > > >
> > > > > GAW is 48bit, how large would it have then?
> > > > Dirty page tracking is not dependent on the size of the GAW.
> > > > It is function of address ranges for the amount of guest memory
> > > > regardless of
> > > GAW.
> > >
> > > The problem is, e.g when vIOMMU is enabled, you can't know which
> > > IOVA is actually used by guests. And even for the case when vIOMMU
> > > is not enabled, the guest may have several TBs. Is it easy to
> > > reserve sufficient resources by the device itself?
> > >
> > When page tracking is enabled per device, it knows about the range and it can
> reserve certain resource.
> 
> I didn't see such an interface in this series. Anything I miss?
> 
Yes, this patch and the next patch is covering the page tracking start,stop and query commands.
They are named as write recording commands.

> Btw, the IOVA is allocated by the guest actually, how can we know the range?
> (or using the host range?)
> 
Hypervisor would have mapping translation.

> >
> > > Host should always have more resources than device, in that sense
> > > there could be several methods that tries to utilize host memory
> > > instead of the one in the device. I think we've discussed this when
> > > going through the doc prepared by Eugenio.
> > >
> > > >
> > > > > What happens if we're trying to migrate more than 1 device?
> > > > >
> > > > That is perfectly fine.
> > > > Each device is updating its log of pages it wrote.
> > > > The hypervisor is collecting their sum.
> > >
> > > See above.
> > >
> > > >
> > > > > >
> > > > > > > 3) DMA is part of the transport, it's natural to do logging
> > > > > > > there, why duplicate efforts in the virtio layer?
> > > > > > He he, you have funny comment.
> > > > > > When an abstract facility is added to virtio you say to do in transport.
> > > > >
> > > > > So it's not done in the general facility but tied to the admin part.
> > > > > And we all know dirty page tracking is a challenge and Eugenio
> > > > > has a good summary of pros/cons. A revisit of those docs make me
> > > > > think virtio is not the good place for doing that for may reasons:
> > > > >
> > > > > 1) as stated, platform will evolve to be able to tracking dirty
> > > > > pages, actually, it has been supported by a lot of major IOMMU
> > > > > vendors
> > > >
> > > > This is optional facility in virtio.
> > > > Can you please point to the references? I don’t see it in the
> > > > common Linux
> > > kernel support for it.
> > >
> > > Note that when IOMMUFD is being proposed, dirty page tracking is one
> > > of the major considerations.
> > >
> > > This is one recent proposal:
> > >
> > > https://www.spinics.net/lists/kvm/msg330894.html
> > >
> > Sure, so if platform supports it. it can be used from the platform.
> > If it does not, the device supplies it.
> >
> > > > Instead Linux kernel choose to extend to the devices.
> > >
> > > Well, as I stated, tracking dirty pages is challenging if you want
> > > to do it on a device, and you can't simply invent dirty page
> > > tracking for each type of the devices.
> > >
> > It is not invented.
> > It is generic framework for all virtio device types as proposed here.
> > Keep in mind, that it is optional already in v3 series.
> >
> > > > At least not seen to arrive this in any near term in start of 2024
> > > > which is
> > > where users must use this.
> > > >
> > > > > 2) you can't assume virtio is the only device that can be used
> > > > > by the guest, having dirty pages tracking to be implemented in
> > > > > each type of device is unrealistic
> > > > Of course, there is no such assumption made. Where did you see a
> > > > text that
> > > made such assumption?
> > >
> > > So what happens if you have a guest with virtio and other devices assigned?
> > >
> > What happens? Each device type would do its own dirty page tracking.
> > And if all devices does not have support, hypervisor knows to fall back to
> platform iommu or its own.
> >
> > > > Each virtio and non virtio devices who wants to report their dirty
> > > > page report,
> > > will do their way.
> > > >
> > > > > 3) inventing it in the virtio layer will be deprecated in the
> > > > > future for sure, as platform will provide much rich features for
> > > > > logging e.g it can do it per PASID etc, I don't see any reason
> > > > > virtio need to compete with the features that will be provided
> > > > > by the platform
> > > > Can you bring the cpu vendors and committement to virtio tc with
> > > > timelines
> > > so that virtio TC can omit?
> > >
> > > Why do we need to bring CPU vendors in the virtio TC? Virtio needs
> > > to be built on top of transport or platform. There's no need to duplicate
> their job.
> > > Especially considering that virtio can't do better than them.
> > >
> > I wanted to see a strong commitment for the cpu vendors to support dirty
> page tracking.
> 
> The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and ARM
> are all supporting that now.
> 
> > And the work seems to have started for some platforms.
> 
> Let me quote from the above link:
> 
> """
> Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2 alongside
> VT-D rev3.x also do support.
> """
> 
> > Without such platform commitment, virtio also skipping it would not work.
> 
> Is the above sufficient? I'm a little bit more familiar with vtd, the hw feature has
> been there for years.
>
Vtd has a sticky D bit that requires synchronization with IOPTE page caches when sw wants to clear it.
Do you know if is it reliable when device does multiple writes, ie,

a. iommu write D bit
b. software read it
c. sw synchronize cache
d. iommu write D bit on next write by device

ARM SMMU based servers to be present with D bit tracking.
It is still early to say platform is ready. 

It is optional so whichever has the support it will be used.
 
> >
> > > > i.e. in first year of 2024?
> > >
> > > Why does it matter in 2024?
> > Because users needs to use it now.
> >
> > >
> > > > If not, we are better off to offer this, and when/if platform
> > > > support is, sure,
> > > this feature can be disabled/not used/not enabled.
> > > >
> > > > > 4) if the platform support is missing, we can use software or
> > > > > leverage transport for assistance like PRI
> > > > All of these are in theory.
> > > > Our experiment shows PRI performance is 21x slower than page fault
> > > > rate
> > > done by the cpu.
> > > > It simply does not even pass a simple 10Gbps test.
> > >
> > > If you stick to the wire speed during migration, it can converge.
> > Do you have perf data for this?
> 
> No, but it's not hard to imagine the worst case. Wrote a small program that dirty
> every page by a NIC.
> 
> > In the internal tests we don’t see this happening.
> 
> downtime = dirty_rates * PAGE_SIZE / migration_speed
> 
> So if we get very high dirty rates (e.g by a high speed NIC), we can't satisfy the
> requirement of the downtime. Or if you see the converge, you might get help
> from the auto converge support by the hypervisors like KVM where it tries to
> throttle the VCPU then you can't reach the wire speed.
>
Once PRI is enabled, even without migration, there is basic perf issues.
 
> >
> > >
> > > > There is no requirement for mandating PRI either.
> > > > So it is unusable.
> > >
> > > It's not about mandating, it's about doing things in the correct
> > > layer. If PRI is slow, PCI can evolve for sure.
> > You should try.
> 
> Not my duty, I just want to make sure things are done in the correct layer, and
> once it needs to be done in the virtio, there's nothing obviously wrong.
> 
At present, it looks all platforms are not equally ready for page tracking.

> > In the current state, it is mandating.
> > And if you think PRI is the only way,
> 
> I don't, it's just an example where virtio can leverage from either transport or
> platform. Or if it's the fault in virtio that slows down the PRI, then it is
> something we can do.
> 
Yea, it does not seem to be ready yet.

> >  than you should propose that in the dirty page tracking series that you listed
> above to not do dirty page tracking. Rather depend on PRI, right?
> 
> No, the point is to not duplicate works especially considering virtio can't do
> better than platform or transport.
> 
Both the platform and virtio work is ongoing.

> >
> > >
> > > >
> > > > >
> > > > > > When one does something in transport, you say, this is
> > > > > > transport specific, do
> > > > > some generic.
> > > > > >
> > > > > > Here the device is being tracked is virtio device.
> > > > > > PCI-SIG has told already that PCIM interface is outside the scope of it.
> > > > > > Hence, this is done in virtio layer here in abstract way.
> > > > >
> > > > > You will end up with a competition with the platform/transport
> > > > > one that will fail.
> > > > >
> > > > I don’t see a reason. There is no competition.
> > > > Platform always have a choice to not use device side page tracking
> > > > when it is
> > > supported.
> > >
> > > Platform provides a lot of other functionalities for dirty logging:
> > > e.g per PASID, granular, etc. So you want to duplicate them again in
> > > the virtio? If not, why choose this way?
> > >
> > It is optional for the platforms where platform do not have it.
> 
> We are developing new virtio functionalities that are targeted for future
> platforms. Otherwise we would end up with a feature with a very narrow use
> case.
In general I agree that platform is an option too.
Hypervisor will be able to make the decision to use platform when available and fallback to device method when platform does not have it.
 
Future and to be equally usable in near term :)

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-08  4:28                       ` Jason Wang
  2023-11-08  8:17                         ` Michael S. Tsirkin
@ 2023-11-09  6:26                         ` Parav Pandit
  2023-11-15  7:59                           ` [virtio-comment] " Michael S. Tsirkin
  1 sibling, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-09  6:26 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin
  Cc: virtio-comment, cohuck, sburla, Shahaf Shuler, Maor Gottlieb,
	Yishai Hadas, lingshan.zhu


> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 8, 2023 9:59 AM
> 
> On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > Each virtio and non virtio devices who wants to report their
> > > > > > dirty page report,
> > > > > will do their way.
> > > > > >
> > > > > > > 3) inventing it in the virtio layer will be deprecated in
> > > > > > > the future for sure, as platform will provide much rich
> > > > > > > features for logging e.g it can do it per PASID etc, I don't
> > > > > > > see any reason virtio need to compete with the features that
> > > > > > > will be provided by the platform
> > > > > > Can you bring the cpu vendors and committement to virtio tc
> > > > > > with timelines
> > > > > so that virtio TC can omit?
> > > > >
> > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio
> > > > > needs to be built on top of transport or platform. There's no need to
> duplicate their job.
> > > > > Especially considering that virtio can't do better than them.
> > > > >
> > > > I wanted to see a strong commitment for the cpu vendors to support dirty
> page tracking.
> > >
> > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
> > > ARM are all supporting that now.
> > >
> > > > And the work seems to have started for some platforms.
> > >
> > > Let me quote from the above link:
> > >
> > > """
> > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > alongside VT-D rev3.x also do support.
> > > """
> > >
> > > > Without such platform commitment, virtio also skipping it would not work.
> > >
> > > Is the above sufficient? I'm a little bit more familiar with vtd,
> > > the hw feature has been there for years.
> >
> >
> > Repeating myself - I'm not sure that will work well for all workloads.
> 
> I think this comment applies to this proposal as well.
> 
> > Definitely KVM did
> > not scan PTEs. It used pagefaults with bit per page and later as VM
> > size grew switched to PLM.  This interface is analogous to PLM,
> 
> I think you meant PML actually. And it doesn't work like PML. To behave like
> PML it needs to
> 
> 1) log buffers were organized as a queue with indices
> 2) device needs to suspend (as a #vmexit in PML) if it runs out of the buffers
> 3) device need to send a notification to the driver if it runs out of the buffer
> 
> I don't see any of the above in this proposal. If we do that it would be less
> problematic than what is being proposed here.
> 
In this proposal, its slightly different than PML.
The log buffer is a write record with the device. It keeps recording it.
And owner driver queries the recorded pages.
The device internally can do PML or other different implementations as it finds suitable.

> Even if we manage to do that, it doesn't mean we won't have issues.
> 
> 1) For many reasons it can neither see nor log via GPA, so this requires a
> traversal of the vIOMMU mapping tables by the hypervisor afterwards, it would
> be expensive and need synchronization with the guest modification of the IO
> page table which looks very hard.
> 2) There are a lot of special or reserved IOVA ranges (for example the interrupt
> areas in x86) that need special care which is architectural and where it is
> beyond the scope or knowledge of the virtio device but the platform IOMMU.
> Things would be more complicated when SVA is enabled. And there could be
> other architecte specific knowledge (e.g
> PAGE_SIZE) that might be needed. There's no easy way to deal with those cases.
> 

Current and future iommufd and OS interface likely can support this already.
In current proposal, multiple ranges are supplied to the device, the reserved ranges are not part of it.

> We wouldn't need to care about all of them if it is done at platform IOMMU
> level.
> 
I agree that when platform IOMMU has support and if its better it should be first priority to use by the hypervisor.
Mainly because the D bit of the page already there, and not a special PML queue or a racy bitmap like what was proposed in other series.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-08 17:16                             ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-09  6:27                               ` Parav Pandit
  0 siblings, 0 replies; 157+ messages in thread
From: Parav Pandit @ 2023-11-09  6:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Michael S. Tsirkin
> Sent: Wednesday, November 8, 2023 10:46 PM
> 
> On Wed, Nov 08, 2023 at 09:00:00AM +0000, Parav Pandit wrote:
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Wednesday, November 8, 2023 1:47 PM
> >
> > > Only for TX, and I'm pretty sure they had the foresight to test RX
> > > not just TX but let's confirm. Parav did you test both directions?
> > Rx is the main part to test for the dirty page tracking.
> > Tx only is around ring updates, so not very interesting test.
> >
> > When page tracking is started, there is impact on Rx as natural throttle by
> slowing down rx packet rate.
> 
> Wait a sec sounds kind of like what we see with SVQ as well?
> 
> > And devices have implementation choice to covering the spectrum on amount
> of drop rate.
> 
> What does this mean practically?
How to implement dirty page tracking is implementation specific detail.
So for example, what does extra things device does in its rx path to track pages, which data structures etc is used, depends on device implementation.
For example, practically, the device under test by us cannot do at 800Gbps in current generation, but possibly in future.

> 
> > And has implementations evolve, it will improve as well.
> 
> Maybe, or maybe not.
> 
> > PRI for sure is out of question. Most workload won't use it at all given other
> limitations of virtio on PCI front and existing PRI behaviors.
> 
> Out of curiousity what do you refer to?
> 
The whole scheme to notify page faults from device to iommu and serving them back is not as efficient and scalable as the way cpu can page fault.

> > There is also a test with platform IOMMU dirty page tracking done as well but
> it is very early stage to comment in this forum.
> 
> Interesting. That's a very important test I'd say - if platform based dirty page
> tracking is just as good we can avoid adding it to virtio.
I agree. I don't see all platforms have it. There is ongoing work.
> 
> 
> And I'm rather confused at this point - I was under the impression you already
> have a prototype which shows negligeable performance impact with on device
> tracking.

Compared to no tracking by platform, vs device tracking, there is significant gain.
Compared to platform tracking and device tracking, hard to say.
Given all platforms may not have it, it is not very interesting from perf point of view.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-09  3:31                           ` Jason Wang
@ 2023-11-09  7:59                             ` Michael S. Tsirkin
  2023-11-10  6:46                               ` [virtio-comment] " Parav Pandit
  2023-11-13  3:31                               ` Jason Wang
  0 siblings, 2 replies; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-09  7:59 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote:
> On Wed, Nov 8, 2023 at 4:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> > > On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > > Each virtio and non virtio devices who wants to report their dirty page report,
> > > > > > > will do their way.
> > > > > > > >
> > > > > > > > > 3) inventing it in the virtio layer will be deprecated in the future
> > > > > > > > > for sure, as platform will provide much rich features for logging
> > > > > > > > > e.g it can do it per PASID etc, I don't see any reason virtio need
> > > > > > > > > to compete with the features that will be provided by the platform
> > > > > > > > Can you bring the cpu vendors and committement to virtio tc with timelines
> > > > > > > so that virtio TC can omit?
> > > > > > >
> > > > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio needs to be built
> > > > > > > on top of transport or platform. There's no need to duplicate their job.
> > > > > > > Especially considering that virtio can't do better than them.
> > > > > > >
> > > > > > I wanted to see a strong commitment for the cpu vendors to support dirty page tracking.
> > > > >
> > > > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
> > > > > ARM are all supporting that now.
> > > > >
> > > > > > And the work seems to have started for some platforms.
> > > > >
> > > > > Let me quote from the above link:
> > > > >
> > > > > """
> > > > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > > > alongside VT-D rev3.x also do support.
> > > > > """
> > > > >
> > > > > > Without such platform commitment, virtio also skipping it would not work.
> > > > >
> > > > > Is the above sufficient? I'm a little bit more familiar with vtd, the
> > > > > hw feature has been there for years.
> > > >
> > > >
> > > > Repeating myself - I'm not sure that will work well for all workloads.
> > >
> > > I think this comment applies to this proposal as well.
> >
> > Yes - some systems might be better off with platform tracking.
> > And I think supporting shadow vq better would be nice too.
> 
> For shadow vq, did you mean the work that is done by Eugenio?

Yes.

> >
> > > > Definitely KVM did
> > > > not scan PTEs. It used pagefaults with bit per page and later as VM size
> > > > grew switched to PLM.  This interface is analogous to PLM,
> > >
> > > I think you meant PML actually. And it doesn't work like PML. To
> > > behave like PML it needs to
> > >
> > > 1) log buffers were organized as a queue with indices
> > > 2) device needs to suspend (as a #vmexit in PML) if it runs out of the buffers
> > > 3) device need to send a notification to the driver if it runs out of the buffer
> > >
> > > I don't see any of the above in this proposal. If we do that it would
> > > be less problematic than what is being proposed here.
> >
> > What is common between this and PML is that you get the addresses
> > directly without scanning megabytes of bitmaps or worse -
> > hundreds of megabytes of page tables.
> 
> Yes, it has overhead but this is the method we use for vhost and KVM (earlier).
> 
> To me the  important advantage of PML is that it uses limited
> resources on the host which
> 
> 1) doesn't require resources in the device
> 2) doesn't scale as the guest memory increases. (but this advantage
> doesn't exist in neither this nor bitmap)

it seems 2 exactly exists here.


> >
> > The data structure is different but I don't see why it is critical.
> >
> > I agree that I don't see out of buffers notifications too which implies
> > device has to maintain something like a bitmap internally.  Which I
> > guess could be fine but it is not clear to me how large that bitmap has
> > to be. How does the device know? Needs to be addressed.
> 
> This is the question I asked Parav in another thread. Using host
> memory as a queue with notification (like PML) might be much better.

Well if queue is what you want to do you can just do it internally.
Problem of course is that it might overflow and cause things like
packet drops.


> >
> >
> > > Even if we manage to do that, it doesn't mean we won't have issues.
> > >
> > > 1) For many reasons it can neither see nor log via GPA, so this
> > > requires a traversal of the vIOMMU mapping tables by the hypervisor
> > > afterwards, it would be expensive and need synchronization with the
> > > guest modification of the IO page table which looks very hard.
> >
> > vIOMMU is fast enough to be used on data path but not fast enough for
> > dirty tracking?
> 
> We set up SPTEs or using nesting offloading where the PTEs could be
> iterated by hardware directly which is fast.

There's a way to have hardware find dirty PTEs for you quickly?
I don't know how it's done. Do tell.


> This is not the case here where software needs to iterate the IO page
> tables in the guest which could be slow.
> 
> > Hard to believe.  If true and you want to speed up
> > vIOMMU then you implement an efficient datastructure for that.
> 
> Besides the issue of performance, it's also racy, assuming we are logging IOVA.
> 
> 0) device log IOVA
> 1) hypervisor fetches IOVA from log buffer
> 2) guest map IOVA to a new GPA
> 3) hypervisor traverse guest table to get IOVA to new GPA
> 
> Then we lost the old GPA.

Interesting and a good point. And by the way e.g. vhost has the same
issue.  You need to flush dirty tracking info when changing the mappings
somehow.  Parav what's the plan for this? Should be addressed in the
spec too.



> >
> > > 2) There are a lot of special or reserved IOVA ranges (for example the
> > > interrupt areas in x86) that need special care which is architectural
> > > and where it is beyond the scope or knowledge of the virtio device but
> > > the platform IOMMU. Things would be more complicated when SVA is
> > > enabled.
> >
> > SVA being what here?
> 
> For example, IOMMU may treat interrupt ranges differently depending on
> whether SVA is enabled or not. It's very hard and unnecessary to teach
> devices about this.

Oh, shared virtual memory. So what you are saying here? virtio
does not care, it just uses some addresses and if you want it to
it can record writes somewhere.

> >
> > > And there could be other architecte specific knowledge (e.g
> > > PAGE_SIZE) that might be needed. There's no easy way to deal with
> > > those cases.
> >
> > Good point about page size actually - using 4k unconditionally
> > is a waste of resources.
> 
> Actually, they are more than just PAGE_SIZE, for example, PASID and others.

what does pasid have to do with it? anyway, just give driver control
over page size.

> >
> >
> > > We wouldn't need to care about all of them if it is done at platform
> > > IOMMU level.
> >
> > If someone logs at IOMMU level then nothing needs to be done
> > in the spec at all. This is about capability at the device level.
> 
> True, but my question is where or not it can be done at the device level easily.

there's no "easily" about live migration ever.
For example on-device iommus are a thing.

> >
> >
> > > > what Lingshan
> > > > proposed is analogous to bit per page - problem unfortunately is
> > > > you can't easily set a bit by DMA.
> > > >
> > >
> > > I'm not saying bit/bytemap is the best, but it has been used by real
> > > hardware. And we have many other options.
> > >
> > > > So I think this dirty tracking is a good option to have.
> > > >
> > > >
> > > >
> > > > > >
> > > > > > > > i.e. in first year of 2024?
> > > > > > >
> > > > > > > Why does it matter in 2024?
> > > > > > Because users needs to use it now.
> > > > > >
> > > > > > >
> > > > > > > > If not, we are better off to offer this, and when/if platform support is, sure,
> > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > >
> > > > > > > > > 4) if the platform support is missing, we can use software or
> > > > > > > > > leverage transport for assistance like PRI
> > > > > > > > All of these are in theory.
> > > > > > > > Our experiment shows PRI performance is 21x slower than page fault rate
> > > > > > > done by the cpu.
> > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > >
> > > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > > Do you have perf data for this?
> > > > >
> > > > > No, but it's not hard to imagine the worst case. Wrote a small program
> > > > > that dirty every page by a NIC.
> > > > >
> > > > > > In the internal tests we don’t see this happening.
> > > > >
> > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > >
> > > > > So if we get very high dirty rates (e.g by a high speed NIC), we can't
> > > > > satisfy the requirement of the downtime. Or if you see the converge,
> > > > > you might get help from the auto converge support by the hypervisors
> > > > > like KVM where it tries to throttle the VCPU then you can't reach the
> > > > > wire speed.
> > > >
> > > > Will only work for some device types.
> > > >
> > >
> > > Yes, that's the point. Parav said he doesn't see the issue, it's
> > > probably because he is testing a virtio-net and so the vCPU is
> > > automatically throttled. It doesn't mean it can work for other virito
> > > devices.
> >
> > Only for TX, and I'm pretty sure they had the foresight to test RX not
> > just TX but let's confirm. Parav did you test both directions?
> 
> RX speed somehow depends on the speed of refill, so throttling helps
> more or less.

It doesn't depend on speed of refill you just underrun and drop
packets. then your nice 10usec latency becomes more like 10sec.

> >
> > > >
> > > >
> > > > > >
> > > > > > >
> > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > So it is unusable.
> > > > > > >
> > > > > > > It's not about mandating, it's about doing things in the correct layer. If PRI is
> > > > > > > slow, PCI can evolve for sure.
> > > > > > You should try.
> > > > >
> > > > > Not my duty, I just want to make sure things are done in the correct
> > > > > layer, and once it needs to be done in the virtio, there's nothing
> > > > > obviously wrong.
> > > >
> > > > Yea but just vague questions don't help to make sure eiter way.
> > >
> > > I don't think it's vague, I have explained, if something in the virito
> > > slows down the PRI, we can try to fix them.
> >
> > I don't believe you are going to make PRI fast. No one managed so far.
> 
> So it's the fault of PRI not virito, but it doesn't mean we need to do
> it in virtio.

I keep saying with this approach we would just say "e1000 emulation is
slow and encumbered this is the fault of e1000" and never get virtio at
all.  Assigning blame only gets you so far.

> >
> > > Missing functions in
> > > platform or transport is not a good excuse to try to workaround it in
> > > the virtio. It's a layer violation and we never had any feature like
> > > this in the past.
> >
> > Yes missing functionality in the platform is exactly why virtio
> > was born in the first place.
> 
> Well the platform can't do device specific logic. But that's not the
> case of dirty page tracking which is device logic agnostic.

Not true platforms have things like NICs on board and have for many
years. It's about performance really. So I'd like Parav to publish some
experiment results and/or some estimates.


> >
> > > >
> > > > > > In the current state, it is mandating.
> > > > > > And if you think PRI is the only way,
> > > > >
> > > > > I don't, it's just an example where virtio can leverage from either
> > > > > transport or platform. Or if it's the fault in virtio that slows down
> > > > > the PRI, then it is something we can do.
> > > > >
> > > > > >  than you should propose that in the dirty page tracking series that you listed above to not do dirty page tracking. Rather depend on PRI, right?
> > > > >
> > > > > No, the point is to not duplicate works especially considering virtio
> > > > > can't do better than platform or transport.
> > > >
> > > > If someone says they tried and platform's migration support does not
> > > > work for them and they want to build a solution in virtio then
> > > > what exactly is the objection?
> > >
> > > The discussion is to make sure whether virtio can do this easily and
> > > correctly, then we can have a conclusion. I've stated some issues
> > > above, and I've asked other questions related to them which are still
> > > not answered.
> > >
> > > I think we had a very hard time in bypassing IOMMU in the past that we
> > > don't want to repeat.
> > >
> > > We've gone through several methods of logging dirty pages in the past
> > > (each with pros/cons), but this proposal never explains why it chooses
> > > one of them but not others. Spec needs to find the best path instead
> > > of just a possible path without any rationale about why.
> >
> > Adding more rationale isn't a bad thing.
> > In particular if platform supplies dirty tracking then how does
> > driver decide which to use platform or device capability?
> > A bit of discussion around this is a good idea.
> >
> >
> > > > virtio is here in the
> > > > first place because emulating devices didn't work well.
> > >
> > > I don't understand here. We have supported emulated devices for years.
> > > I'm pretty sure a lot of issues could be uncovered if this proposal
> > > can be prototyped with an emulated device first.
> > >
> > > Thanks
> >
> > virtio was originally PV as opposed to emulation. That there's now
> > hardware virtio and you call software implementation "an emulation" is
> > very meta.
> 
> Yes but I don't see how it relates to dirty page tracking. When we
> find a way it should work for both software and hardware devices.
> 
> Thanks

It has to work well on a variety of existing platforms. If it does then
sure, why would we roll our own.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-09  7:59                             ` Michael S. Tsirkin
@ 2023-11-10  6:46                               ` Parav Pandit
  2023-11-13  3:41                                 ` [virtio-comment] " Jason Wang
  2023-11-13  3:31                               ` Jason Wang
  1 sibling, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-10  6:46 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment, cohuck, sburla, Shahaf Shuler, Maor Gottlieb,
	Yishai Hadas, lingshan.zhu

Hi Michael,

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, November 9, 2023 1:29 PM

[..]
> > Besides the issue of performance, it's also racy, assuming we are logging
> IOVA.
> >
> > 0) device log IOVA
> > 1) hypervisor fetches IOVA from log buffer
> > 2) guest map IOVA to a new GPA
> > 3) hypervisor traverse guest table to get IOVA to new GPA
> >
> > Then we lost the old GPA.
> 
> Interesting and a good point. And by the way e.g. vhost has the same issue.  You
> need to flush dirty tracking info when changing the mappings somehow.  Parav
> what's the plan for this? Should be addressed in the spec too.
> 
As you listed the flush is needed for vhost or device-based DPT.
The necessary plumbing is already covered for this in the query (read and clear) command of this v3 proposal.
It is listed in Device Write Records Read Command.

When the page write record is fully read, it is flushed.
How/when to use, I think its hypervisor specific, so we probably better off not documenting those details.
May be such read is needed in some other path too depending on how hypervisor implemented.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-09  7:59                             ` Michael S. Tsirkin
  2023-11-10  6:46                               ` [virtio-comment] " Parav Pandit
@ 2023-11-13  3:31                               ` Jason Wang
  2023-11-13  6:57                                 ` Michael S. Tsirkin
  2023-11-15 17:42                                 ` [virtio-comment] " Parav Pandit
  1 sibling, 2 replies; 157+ messages in thread
From: Jason Wang @ 2023-11-13  3:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 9, 2023 at 3:59 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote:
> > On Wed, Nov 8, 2023 at 4:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> > > > On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >
> > > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > > > Each virtio and non virtio devices who wants to report their dirty page report,
> > > > > > > > will do their way.
> > > > > > > > >
> > > > > > > > > > 3) inventing it in the virtio layer will be deprecated in the future
> > > > > > > > > > for sure, as platform will provide much rich features for logging
> > > > > > > > > > e.g it can do it per PASID etc, I don't see any reason virtio need
> > > > > > > > > > to compete with the features that will be provided by the platform
> > > > > > > > > Can you bring the cpu vendors and committement to virtio tc with timelines
> > > > > > > > so that virtio TC can omit?
> > > > > > > >
> > > > > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio needs to be built
> > > > > > > > on top of transport or platform. There's no need to duplicate their job.
> > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > >
> > > > > > > I wanted to see a strong commitment for the cpu vendors to support dirty page tracking.
> > > > > >
> > > > > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
> > > > > > ARM are all supporting that now.
> > > > > >
> > > > > > > And the work seems to have started for some platforms.
> > > > > >
> > > > > > Let me quote from the above link:
> > > > > >
> > > > > > """
> > > > > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > > > > alongside VT-D rev3.x also do support.
> > > > > > """
> > > > > >
> > > > > > > Without such platform commitment, virtio also skipping it would not work.
> > > > > >
> > > > > > Is the above sufficient? I'm a little bit more familiar with vtd, the
> > > > > > hw feature has been there for years.
> > > > >
> > > > >
> > > > > Repeating myself - I'm not sure that will work well for all workloads.
> > > >
> > > > I think this comment applies to this proposal as well.
> > >
> > > Yes - some systems might be better off with platform tracking.
> > > And I think supporting shadow vq better would be nice too.
> >
> > For shadow vq, did you mean the work that is done by Eugenio?
>
> Yes.

That's exactly why vDPA starts with shadow virtqueue. We've evaluated
various possible approaches, each of them have their shortcomings and
shadow virtqueue is the only one that doesn't require any additional
hardware features to work in every platform.

>
> > >
> > > > > Definitely KVM did
> > > > > not scan PTEs. It used pagefaults with bit per page and later as VM size
> > > > > grew switched to PLM.  This interface is analogous to PLM,
> > > >
> > > > I think you meant PML actually. And it doesn't work like PML. To
> > > > behave like PML it needs to
> > > >
> > > > 1) log buffers were organized as a queue with indices
> > > > 2) device needs to suspend (as a #vmexit in PML) if it runs out of the buffers
> > > > 3) device need to send a notification to the driver if it runs out of the buffer
> > > >
> > > > I don't see any of the above in this proposal. If we do that it would
> > > > be less problematic than what is being proposed here.
> > >
> > > What is common between this and PML is that you get the addresses
> > > directly without scanning megabytes of bitmaps or worse -
> > > hundreds of megabytes of page tables.
> >
> > Yes, it has overhead but this is the method we use for vhost and KVM (earlier).
> >
> > To me the  important advantage of PML is that it uses limited
> > resources on the host which
> >
> > 1) doesn't require resources in the device
> > 2) doesn't scale as the guest memory increases. (but this advantage
> > doesn't exist in neither this nor bitmap)
>
> it seems 2 exactly exists here.

Actually not, Parav said the device needs to reserve sufficient
resources in another thread.

>
>
> > >
> > > The data structure is different but I don't see why it is critical.
> > >
> > > I agree that I don't see out of buffers notifications too which implies
> > > device has to maintain something like a bitmap internally.  Which I
> > > guess could be fine but it is not clear to me how large that bitmap has
> > > to be. How does the device know? Needs to be addressed.
> >
> > This is the question I asked Parav in another thread. Using host
> > memory as a queue with notification (like PML) might be much better.
>
> Well if queue is what you want to do you can just do it internally.

Then it's not the proposal here, Parav has explained it in another
reply, and as explained it lacks a lot of other facilities.

> Problem of course is that it might overflow and cause things like
> packet drops.

Exactly like PML. So sticking to wire speed should not be a general
goal in the context of migration. It can be done if the speed of the
migration interface is faster than the virtio device that needs to be
migrated.

>
>
> > >
> > >
> > > > Even if we manage to do that, it doesn't mean we won't have issues.
> > > >
> > > > 1) For many reasons it can neither see nor log via GPA, so this
> > > > requires a traversal of the vIOMMU mapping tables by the hypervisor
> > > > afterwards, it would be expensive and need synchronization with the
> > > > guest modification of the IO page table which looks very hard.
> > >
> > > vIOMMU is fast enough to be used on data path but not fast enough for
> > > dirty tracking?
> >
> > We set up SPTEs or using nesting offloading where the PTEs could be
> > iterated by hardware directly which is fast.
>
> There's a way to have hardware find dirty PTEs for you quickly?

Scanning PTEs on the host is faster and more secure than scanning
guests, that's what I want to say:

1) the guest page could be swapped out but not the host one.
2) no guest triggerable behavior

> I don't know how it's done. Do tell.
>
>
> > This is not the case here where software needs to iterate the IO page
> > tables in the guest which could be slow.
> >
> > > Hard to believe.  If true and you want to speed up
> > > vIOMMU then you implement an efficient datastructure for that.
> >
> > Besides the issue of performance, it's also racy, assuming we are logging IOVA.
> >
> > 0) device log IOVA
> > 1) hypervisor fetches IOVA from log buffer
> > 2) guest map IOVA to a new GPA
> > 3) hypervisor traverse guest table to get IOVA to new GPA
> >
> > Then we lost the old GPA.
>
> Interesting and a good point.

Note that PML logs at GPA as it works at L1 of EPT.

> And by the way e.g. vhost has the same
> issue.  You need to flush dirty tracking info when changing the mappings
> somehow.

It's not,

1) memory translation is done by vhost
2) vhost knows GPA and it doesn't log via IOVA.

See this for example, and DPDK has similar fixes.

commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
Author: Jason Wang <jasowang@redhat.com>
Date:   Wed Jan 16 16:54:42 2019 +0800

    vhost: log dirty page correctly

    Vhost dirty page logging API is designed to sync through GPA. But we
    try to log GIOVA when device IOTLB is enabled. This is wrong and may
    lead to missing data after migration.

    To solve this issue, when logging with device IOTLB enabled, we will:

    1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
       get HVA, for writable descriptor, get HVA through iovec. For used
       ring update, translate its GIOVA to HVA
    2) traverse the GPA->HVA mapping to get the possible GPA and log
       through GPA. Pay attention this reverse mapping is not guaranteed
       to be unique, so we should log each possible GPA in this case.

    This fix the failure of scp to guest during migration. In -next, we
    will probably support passing GIOVA->GPA instead of GIOVA->HVA.

    Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
    Reported-by: Jintack Lim <jintack@cs.columbia.edu>
    Cc: Jintack Lim <jintack@cs.columbia.edu>
    Signed-off-by: Jason Wang <jasowang@redhat.com>
    Acked-by: Michael S. Tsirkin <mst@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

All of the above is not what virtio did right now.

> Parav what's the plan for this? Should be addressed in the
> spec too.
>

AFAIK, there's no easy/efficient way to do that. I hope I was wrong.

>
>
> > >
> > > > 2) There are a lot of special or reserved IOVA ranges (for example the
> > > > interrupt areas in x86) that need special care which is architectural
> > > > and where it is beyond the scope or knowledge of the virtio device but
> > > > the platform IOMMU. Things would be more complicated when SVA is
> > > > enabled.
> > >
> > > SVA being what here?
> >
> > For example, IOMMU may treat interrupt ranges differently depending on
> > whether SVA is enabled or not. It's very hard and unnecessary to teach
> > devices about this.
>
> Oh, shared virtual memory. So what you are saying here? virtio
> does not care, it just uses some addresses and if you want it to
> it can record writes somewhere.

One example, PCI allows devices to send translated requests, how can a
hypervisor know it's a PA or IOVA in this case? We probably need a new
bit. But it's not the only thing we need to deal with.

By definition, interrupt ranges and other reserved ranges should not
belong to dirty pages. And the logging should be done before the DMA
where there's no way for the device to know whether or not an IOVA is
valid or not. It would be more safe to just not report them from the
source instead of leaving it to the hypervisor to deal with but this
seems impossible at the device level. Otherwise the hypervisor driver
needs to communicate with the (v)IOMMU to be reached with the
interrupt(MSI) area, RMRR area etc in order to do the correct things
or it might have security implications. And those areas don't make
sense at L1 when vSVA is enabled. What's more, when vIOMMU could be
fully offloaded, there's no easy way to fetch that information.

Again, it's hard to bypass or even duplicate the functionality of the
platform or we need to step into every single detail of a specific
transport, architecture or IOMMU to figure out whether or not logging
at virtio is correct which is awkward and unrealistic. This proposal
suffers from an exact similar issue when inventing things like
freeze/stop where I've pointed out other branches of issues as well.

>
> > >
> > > > And there could be other architecte specific knowledge (e.g
> > > > PAGE_SIZE) that might be needed. There's no easy way to deal with
> > > > those cases.
> > >
> > > Good point about page size actually - using 4k unconditionally
> > > is a waste of resources.
> >
> > Actually, they are more than just PAGE_SIZE, for example, PASID and others.
>
> what does pasid have to do with it? anyway, just give driver control
> over page size.

For example, two virtqueues have two PASIDs assigned. How can a
hypervisor know which specific IOVA belongs to which IOVA? For
platform IOMMU, they are handy as it talks to the transport. But I
don't think we need to duplicate every transport specific address
space feature in core virtio layer:

1) translated/untranslated request
2) request w/ and w/o PASID

>
> > >
> > >
> > > > We wouldn't need to care about all of them if it is done at platform
> > > > IOMMU level.
> > >
> > > If someone logs at IOMMU level then nothing needs to be done
> > > in the spec at all. This is about capability at the device level.
> >
> > True, but my question is where or not it can be done at the device level easily.
>
> there's no "easily" about live migration ever.

I think I've stated sufficient issues to demonstrate how hard virtio
wants to do it. And I've given the link that it is possible to do that
in IOMMU without those issues. So in this context doing it in virtio
is much harder.

> For example on-device iommus are a thing.

I'm not sure that's the way to go considering the platform IOMMU
evolves very quickly.

>
> > >
> > >
> > > > > what Lingshan
> > > > > proposed is analogous to bit per page - problem unfortunately is
> > > > > you can't easily set a bit by DMA.
> > > > >
> > > >
> > > > I'm not saying bit/bytemap is the best, but it has been used by real
> > > > hardware. And we have many other options.
> > > >
> > > > > So I think this dirty tracking is a good option to have.
> > > > >
> > > > >
> > > > >
> > > > > > >
> > > > > > > > > i.e. in first year of 2024?
> > > > > > > >
> > > > > > > > Why does it matter in 2024?
> > > > > > > Because users needs to use it now.
> > > > > > >
> > > > > > > >
> > > > > > > > > If not, we are better off to offer this, and when/if platform support is, sure,
> > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > >
> > > > > > > > > > 4) if the platform support is missing, we can use software or
> > > > > > > > > > leverage transport for assistance like PRI
> > > > > > > > > All of these are in theory.
> > > > > > > > > Our experiment shows PRI performance is 21x slower than page fault rate
> > > > > > > > done by the cpu.
> > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > >
> > > > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > > > Do you have perf data for this?
> > > > > >
> > > > > > No, but it's not hard to imagine the worst case. Wrote a small program
> > > > > > that dirty every page by a NIC.
> > > > > >
> > > > > > > In the internal tests we don’t see this happening.
> > > > > >
> > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > >
> > > > > > So if we get very high dirty rates (e.g by a high speed NIC), we can't
> > > > > > satisfy the requirement of the downtime. Or if you see the converge,
> > > > > > you might get help from the auto converge support by the hypervisors
> > > > > > like KVM where it tries to throttle the VCPU then you can't reach the
> > > > > > wire speed.
> > > > >
> > > > > Will only work for some device types.
> > > > >
> > > >
> > > > Yes, that's the point. Parav said he doesn't see the issue, it's
> > > > probably because he is testing a virtio-net and so the vCPU is
> > > > automatically throttled. It doesn't mean it can work for other virito
> > > > devices.
> > >
> > > Only for TX, and I'm pretty sure they had the foresight to test RX not
> > > just TX but let's confirm. Parav did you test both directions?
> >
> > RX speed somehow depends on the speed of refill, so throttling helps
> > more or less.
>
> It doesn't depend on speed of refill you just underrun and drop
> packets. then your nice 10usec latency becomes more like 10sec.

I miss your point here. If the driver can't achieve wire speed without
dirty page tracking, it can neither when dirty page tracking is
enabled.

>
> > >
> > > > >
> > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > So it is unusable.
> > > > > > > >
> > > > > > > > It's not about mandating, it's about doing things in the correct layer. If PRI is
> > > > > > > > slow, PCI can evolve for sure.
> > > > > > > You should try.
> > > > > >
> > > > > > Not my duty, I just want to make sure things are done in the correct
> > > > > > layer, and once it needs to be done in the virtio, there's nothing
> > > > > > obviously wrong.
> > > > >
> > > > > Yea but just vague questions don't help to make sure eiter way.
> > > >
> > > > I don't think it's vague, I have explained, if something in the virito
> > > > slows down the PRI, we can try to fix them.
> > >
> > > I don't believe you are going to make PRI fast. No one managed so far.
> >
> > So it's the fault of PRI not virito, but it doesn't mean we need to do
> > it in virtio.
>
> I keep saying with this approach we would just say "e1000 emulation is
> slow and encumbered this is the fault of e1000" and never get virtio at
> all.  Assigning blame only gets you so far.

I think we are discussing different things. My point is virtio needs
to leverage the functionality provided by transport or platform
(especially considering they evolve faster than virtio). It seems to
me it's hard even to duplicate some basic function of platform IOMMU
in virtio.

>
> > >
> > > > Missing functions in
> > > > platform or transport is not a good excuse to try to workaround it in
> > > > the virtio. It's a layer violation and we never had any feature like
> > > > this in the past.
> > >
> > > Yes missing functionality in the platform is exactly why virtio
> > > was born in the first place.
> >
> > Well the platform can't do device specific logic. But that's not the
> > case of dirty page tracking which is device logic agnostic.
>
> Not true platforms have things like NICs on board and have for many
> years. It's about performance really.

I've stated sufficient issues above. And one more obvious issue for
device initiated page logging is that it needs a lot of extra or
unnecessary PCI transactions which will throttle the performance of
the whole system (and lead to other issues like QOS). So I can't
believe it has good performance overall. Logging via IOMMU or using
shadow virtqueue doesn't need any extra PCI transactions at least.

> So I'd like Parav to publish some
> experiment results and/or some estimates.
>

That's fine, but the above equation (used by Qemu) is sufficient to
demonstrate how hard to stick wire speed in the case.

>
> > >
> > > > >
> > > > > > > In the current state, it is mandating.
> > > > > > > And if you think PRI is the only way,
> > > > > >
> > > > > > I don't, it's just an example where virtio can leverage from either
> > > > > > transport or platform. Or if it's the fault in virtio that slows down
> > > > > > the PRI, then it is something we can do.
> > > > > >
> > > > > > >  than you should propose that in the dirty page tracking series that you listed above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > >
> > > > > > No, the point is to not duplicate works especially considering virtio
> > > > > > can't do better than platform or transport.
> > > > >
> > > > > If someone says they tried and platform's migration support does not
> > > > > work for them and they want to build a solution in virtio then
> > > > > what exactly is the objection?
> > > >
> > > > The discussion is to make sure whether virtio can do this easily and
> > > > correctly, then we can have a conclusion. I've stated some issues
> > > > above, and I've asked other questions related to them which are still
> > > > not answered.
> > > >
> > > > I think we had a very hard time in bypassing IOMMU in the past that we
> > > > don't want to repeat.
> > > >
> > > > We've gone through several methods of logging dirty pages in the past
> > > > (each with pros/cons), but this proposal never explains why it chooses
> > > > one of them but not others. Spec needs to find the best path instead
> > > > of just a possible path without any rationale about why.
> > >
> > > Adding more rationale isn't a bad thing.
> > > In particular if platform supplies dirty tracking then how does
> > > driver decide which to use platform or device capability?
> > > A bit of discussion around this is a good idea.
> > >
> > >
> > > > > virtio is here in the
> > > > > first place because emulating devices didn't work well.
> > > >
> > > > I don't understand here. We have supported emulated devices for years.
> > > > I'm pretty sure a lot of issues could be uncovered if this proposal
> > > > can be prototyped with an emulated device first.
> > > >
> > > > Thanks
> > >
> > > virtio was originally PV as opposed to emulation. That there's now
> > > hardware virtio and you call software implementation "an emulation" is
> > > very meta.
> >
> > Yes but I don't see how it relates to dirty page tracking. When we
> > find a way it should work for both software and hardware devices.
> >
> > Thanks
>
> It has to work well on a variety of existing platforms. If it does then
> sure, why would we roll our own.

If virtio can do that in an efficient way without any issues, I agree.
But it seems not.

Thanks









>
> --
> MST
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-09  6:24                     ` Parav Pandit
@ 2023-11-13  3:37                       ` Jason Wang
  2023-11-15 17:38                         ` [virtio-comment] " Parav Pandit
  2023-11-15  7:58                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-13  3:37 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, November 7, 2023 9:34 AM
> >
> > On Mon, Nov 6, 2023 at 2:54 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Monday, November 6, 2023 12:04 PM
> > > >
> > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Thursday, November 2, 2023 9:54 AM
> > > > > >
> > > > > > On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Wednesday, November 1, 2023 6:00 AM
> > > > > > > >
> > > > > > > > On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > > > > > > >
> > > > > > > > > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit
> > > > > > > > > > <parav@nvidia.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > During a device migration flow (typically in a precopy
> > > > > > > > > > > phase of the live migration), a device may write to
> > > > > > > > > > > the guest memory. Some iommu/hypervisor may not be
> > > > > > > > > > > able to track these
> > > > > > written pages.
> > > > > > > > > > > These pages to be migrated from source to destination
> > hypervisor.
> > > > > > > > > > >
> > > > > > > > > > > A device which writes to these pages, provides the
> > > > > > > > > > > page address record of the to the owner device. The
> > > > > > > > > > > owner device starts write recording for the device and
> > > > > > > > > > > queries all the page addresses written by the device.
> > > > > > > > > > >
> > > > > > > > > > > Fixes:
> > > > > > > > > > > https://github.com/oasis-tcs/virtio-spec/issues/176
> > > > > > > > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > > > > > > > Signed-off-by: Satananda Burla <sburla@marvell.com>
> > > > > > > > > > > ---
> > > > > > > > > > > changelog:
> > > > > > > > > > > v1->v2:
> > > > > > > > > > > - addressed comments from Michael
> > > > > > > > > > > - replaced iova with physical address
> > > > > > > > > > > ---
> > > > > > > > > > >  admin-cmds-device-migration.tex | 15 +++++++++++++++
> > > > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/admin-cmds-device-migration.tex
> > > > > > > > > > > b/admin-cmds-device-migration.tex index
> > > > > > > > > > > ed911e4..2e32f2c
> > > > > > > > > > > 100644
> > > > > > > > > > > --- a/admin-cmds-device-migration.tex
> > > > > > > > > > > +++ b/admin-cmds-device-migration.tex
> > > > > > > > > > > @@ -95,6 +95,21 @@ \subsubsection{Device
> > > > > > > > > > > Migration}\label{sec:Basic Facilities of a Virtio
> > > > > > > > > > > Device / The owner driver can discard any partially
> > > > > > > > > > > read or written device context when  any of the device
> > > > > > > > > > > migration flow
> > > > > > > > > > should be aborted.
> > > > > > > > > > >
> > > > > > > > > > > +During the device migration flow, a passthrough
> > > > > > > > > > > +device may write data to the guest virtual machine's
> > > > > > > > > > > +memory, a source hypervisor needs to keep track of
> > > > > > > > > > > +these written memory to migrate such memory to
> > > > > > > > > > > +destination
> > > > > > > > > > hypervisor.
> > > > > > > > > > > +Some systems may not be able to keep track of such
> > > > > > > > > > > +memory write addresses at hypervisor level. In such a
> > > > > > > > > > > +scenario, a device records and reports these written
> > > > > > > > > > > +memory addresses to the owner device. The owner
> > > > > > > > > > > +driver enables write recording for one or more
> > > > > > > > > > > +physical address ranges per device during device
> > > > > > migration flow.
> > > > > > > > > > > +The owner driver periodically queries these written
> > > > > > > > > > > +physical address
> > > > > > > > records from the device.
> > > > > > > > > >
> > > > > > > > > > I wonder how PA works in this case. Device uses
> > > > > > > > > > untranslated requests so it can only see IOVA. We can't mandate
> > ATS anyhow.
> > > > > > > > > Michael suggested to keep the language uniform as PA as
> > > > > > > > > this is ultimately
> > > > > > > > what the guest driver is supplying during vq creation and in
> > > > > > > > posting buffers as physical address.
> > > > > > > >
> > > > > > > > This seems to need some work. And, can you show me how it can
> > work?
> > > > > > > >
> > > > > > > > 1) e.g if GAW is 48 bit, is the hypervisor expected to do a
> > > > > > > > bisection of the whole range?
> > > > > > > > 2) does the device need to reserve sufficient internal
> > > > > > > > resources for logging the dirty page and why (not)?
> > > > > > > No when dirty page logging starts, only at that time, device
> > > > > > > will reserve
> > > > > > enough resources.
> > > > > >
> > > > > > GAW is 48bit, how large would it have then?
> > > > > Dirty page tracking is not dependent on the size of the GAW.
> > > > > It is function of address ranges for the amount of guest memory
> > > > > regardless of
> > > > GAW.
> > > >
> > > > The problem is, e.g when vIOMMU is enabled, you can't know which
> > > > IOVA is actually used by guests. And even for the case when vIOMMU
> > > > is not enabled, the guest may have several TBs. Is it easy to
> > > > reserve sufficient resources by the device itself?
> > > >
> > > When page tracking is enabled per device, it knows about the range and it can
> > reserve certain resource.
> >
> > I didn't see such an interface in this series. Anything I miss?
> >
> Yes, this patch and the next patch is covering the page tracking start,stop and query commands.
> They are named as write recording commands.

So I still don't see how the device can reserve sufficient resources?
Guests may map a very large area of memory to IOMMU (or when vIOMMU is
disabled, GPA is used). It would be several TBs, how can the device
reserve sufficient resources in this case? Again, if we use host
resources, we don't need to care about this.

>
> > Btw, the IOVA is allocated by the guest actually, how can we know the range?
> > (or using the host range?)
> >
> Hypervisor would have mapping translation.

That's really tricky and can only work in some cases:

1) It requires the hypervisor to traverse the guest I/O page tables
which could be very large range
2) It requests the hypervisor to trap the modification of guest I/O
page tables and synchronize with the range changes, which is
inefficient and can only be done when we are doing shadow PTEs. It
won't work when the nesting translation could be offloaded to the
hardware
3) It is racy with the guest modification of I/O page tables which is
explained in another thread
4) No aware of new features like PASID which has been explained in
another thread

>
> > >
> > > > Host should always have more resources than device, in that sense
> > > > there could be several methods that tries to utilize host memory
> > > > instead of the one in the device. I think we've discussed this when
> > > > going through the doc prepared by Eugenio.
> > > >
> > > > >
> > > > > > What happens if we're trying to migrate more than 1 device?
> > > > > >
> > > > > That is perfectly fine.
> > > > > Each device is updating its log of pages it wrote.
> > > > > The hypervisor is collecting their sum.
> > > >
> > > > See above.
> > > >
> > > > >
> > > > > > >
> > > > > > > > 3) DMA is part of the transport, it's natural to do logging
> > > > > > > > there, why duplicate efforts in the virtio layer?
> > > > > > > He he, you have funny comment.
> > > > > > > When an abstract facility is added to virtio you say to do in transport.
> > > > > >
> > > > > > So it's not done in the general facility but tied to the admin part.
> > > > > > And we all know dirty page tracking is a challenge and Eugenio
> > > > > > has a good summary of pros/cons. A revisit of those docs make me
> > > > > > think virtio is not the good place for doing that for may reasons:
> > > > > >
> > > > > > 1) as stated, platform will evolve to be able to tracking dirty
> > > > > > pages, actually, it has been supported by a lot of major IOMMU
> > > > > > vendors
> > > > >
> > > > > This is optional facility in virtio.
> > > > > Can you please point to the references? I don’t see it in the
> > > > > common Linux
> > > > kernel support for it.
> > > >
> > > > Note that when IOMMUFD is being proposed, dirty page tracking is one
> > > > of the major considerations.
> > > >
> > > > This is one recent proposal:
> > > >
> > > > https://www.spinics.net/lists/kvm/msg330894.html
> > > >
> > > Sure, so if platform supports it. it can be used from the platform.
> > > If it does not, the device supplies it.
> > >
> > > > > Instead Linux kernel choose to extend to the devices.
> > > >
> > > > Well, as I stated, tracking dirty pages is challenging if you want
> > > > to do it on a device, and you can't simply invent dirty page
> > > > tracking for each type of the devices.
> > > >
> > > It is not invented.
> > > It is generic framework for all virtio device types as proposed here.
> > > Keep in mind, that it is optional already in v3 series.
> > >
> > > > > At least not seen to arrive this in any near term in start of 2024
> > > > > which is
> > > > where users must use this.
> > > > >
> > > > > > 2) you can't assume virtio is the only device that can be used
> > > > > > by the guest, having dirty pages tracking to be implemented in
> > > > > > each type of device is unrealistic
> > > > > Of course, there is no such assumption made. Where did you see a
> > > > > text that
> > > > made such assumption?
> > > >
> > > > So what happens if you have a guest with virtio and other devices assigned?
> > > >
> > > What happens? Each device type would do its own dirty page tracking.
> > > And if all devices does not have support, hypervisor knows to fall back to
> > platform iommu or its own.
> > >
> > > > > Each virtio and non virtio devices who wants to report their dirty
> > > > > page report,
> > > > will do their way.
> > > > >
> > > > > > 3) inventing it in the virtio layer will be deprecated in the
> > > > > > future for sure, as platform will provide much rich features for
> > > > > > logging e.g it can do it per PASID etc, I don't see any reason
> > > > > > virtio need to compete with the features that will be provided
> > > > > > by the platform
> > > > > Can you bring the cpu vendors and committement to virtio tc with
> > > > > timelines
> > > > so that virtio TC can omit?
> > > >
> > > > Why do we need to bring CPU vendors in the virtio TC? Virtio needs
> > > > to be built on top of transport or platform. There's no need to duplicate
> > their job.
> > > > Especially considering that virtio can't do better than them.
> > > >
> > > I wanted to see a strong commitment for the cpu vendors to support dirty
> > page tracking.
> >
> > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and ARM
> > are all supporting that now.
> >
> > > And the work seems to have started for some platforms.
> >
> > Let me quote from the above link:
> >
> > """
> > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2 alongside
> > VT-D rev3.x also do support.
> > """
> >
> > > Without such platform commitment, virtio also skipping it would not work.
> >
> > Is the above sufficient? I'm a little bit more familiar with vtd, the hw feature has
> > been there for years.
> >
> Vtd has a sticky D bit that requires synchronization with IOPTE page caches when sw wants to clear it.

This is by design.

> Do you know if is it reliable when device does multiple writes, ie,
>
> a. iommu write D bit
> b. software read it
> c. sw synchronize cache
> d. iommu write D bit on next write by device

What issue did you see here? But that's not even an excuse, if there's
a bug, let's report it to IOMMU vendors and let them fix it. The
thread I point to you is actually a good space.

Again, the point is to let the correct role play.

>
> ARM SMMU based servers to be present with D bit tracking.
> It is still early to say platform is ready.

This is not what I read from both the series I posted and the spec,
dirty bit has been supported several years ago at least for vtd.

>
> It is optional so whichever has the support it will be used.

I can't see the point of this, it is already available. And migration
doesn't exist in virtio spec yet.

>
> > >
> > > > > i.e. in first year of 2024?
> > > >
> > > > Why does it matter in 2024?
> > > Because users needs to use it now.
> > >
> > > >
> > > > > If not, we are better off to offer this, and when/if platform
> > > > > support is, sure,
> > > > this feature can be disabled/not used/not enabled.
> > > > >
> > > > > > 4) if the platform support is missing, we can use software or
> > > > > > leverage transport for assistance like PRI
> > > > > All of these are in theory.
> > > > > Our experiment shows PRI performance is 21x slower than page fault
> > > > > rate
> > > > done by the cpu.
> > > > > It simply does not even pass a simple 10Gbps test.
> > > >
> > > > If you stick to the wire speed during migration, it can converge.
> > > Do you have perf data for this?
> >
> > No, but it's not hard to imagine the worst case. Wrote a small program that dirty
> > every page by a NIC.
> >
> > > In the internal tests we don’t see this happening.
> >
> > downtime = dirty_rates * PAGE_SIZE / migration_speed
> >
> > So if we get very high dirty rates (e.g by a high speed NIC), we can't satisfy the
> > requirement of the downtime. Or if you see the converge, you might get help
> > from the auto converge support by the hypervisors like KVM where it tries to
> > throttle the VCPU then you can't reach the wire speed.
> >
> Once PRI is enabled, even without migration, there is basic perf issues.

The context is not PRI here...

It's about if you can stick to wire speed during live migration. Based
on the analysis so far, you can't achieve wirespeed and downtime at
the same time. That's why the hypervisor needs to throttle VCPU or
devices.

For PRI, it really depends on how you want to use it. E.g if you don't
want to pin a page, the performance is the price you must pay.

>
> > >
> > > >
> > > > > There is no requirement for mandating PRI either.
> > > > > So it is unusable.
> > > >
> > > > It's not about mandating, it's about doing things in the correct
> > > > layer. If PRI is slow, PCI can evolve for sure.
> > > You should try.
> >
> > Not my duty, I just want to make sure things are done in the correct layer, and
> > once it needs to be done in the virtio, there's nothing obviously wrong.
> >
> At present, it looks all platforms are not equally ready for page tracking.

That's not an excuse to let virtio support that. And we need also to
figure out if virtio can do that easily. I've pointed out sufficient
issues, I'm pretty sure there would be more as the platform evolves.

>
> > > In the current state, it is mandating.
> > > And if you think PRI is the only way,
> >
> > I don't, it's just an example where virtio can leverage from either transport or
> > platform. Or if it's the fault in virtio that slows down the PRI, then it is
> > something we can do.
> >
> Yea, it does not seem to be ready yet.
>
> > >  than you should propose that in the dirty page tracking series that you listed
> > above to not do dirty page tracking. Rather depend on PRI, right?
> >
> > No, the point is to not duplicate works especially considering virtio can't do
> > better than platform or transport.
> >
> Both the platform and virtio work is ongoing.

Why duplicate the work then?

>
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > When one does something in transport, you say, this is
> > > > > > > transport specific, do
> > > > > > some generic.
> > > > > > >
> > > > > > > Here the device is being tracked is virtio device.
> > > > > > > PCI-SIG has told already that PCIM interface is outside the scope of it.
> > > > > > > Hence, this is done in virtio layer here in abstract way.
> > > > > >
> > > > > > You will end up with a competition with the platform/transport
> > > > > > one that will fail.
> > > > > >
> > > > > I don’t see a reason. There is no competition.
> > > > > Platform always have a choice to not use device side page tracking
> > > > > when it is
> > > > supported.
> > > >
> > > > Platform provides a lot of other functionalities for dirty logging:
> > > > e.g per PASID, granular, etc. So you want to duplicate them again in
> > > > the virtio? If not, why choose this way?
> > > >
> > > It is optional for the platforms where platform do not have it.
> >
> > We are developing new virtio functionalities that are targeted for future
> > platforms. Otherwise we would end up with a feature with a very narrow use
> > case.
> In general I agree that platform is an option too.
> Hypervisor will be able to make the decision to use platform when available and fallback to device method when platform does not have it.
>
> Future and to be equally usable in near term :)

Please don't double standard again:

When you are talking about TDISP, you want virtio to be designed to
fit for the future where the platform is ready in the future
When you are talking about dirty tracking, you want it to work now even if

1) most of the platform is ready now
2) whether or not virtio can log dirty page correctly is still suspicious

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-10  6:46                               ` [virtio-comment] " Parav Pandit
@ 2023-11-13  3:41                                 ` Jason Wang
  2023-11-13 14:30                                   ` Michael S. Tsirkin
  2023-11-15 17:37                                   ` [virtio-comment] " Parav Pandit
  0 siblings, 2 replies; 157+ messages in thread
From: Jason Wang @ 2023-11-13  3:41 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Fri, Nov 10, 2023 at 2:46 PM Parav Pandit <parav@nvidia.com> wrote:
>
> Hi Michael,
>
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, November 9, 2023 1:29 PM
>
> [..]
> > > Besides the issue of performance, it's also racy, assuming we are logging
> > IOVA.
> > >
> > > 0) device log IOVA
> > > 1) hypervisor fetches IOVA from log buffer
> > > 2) guest map IOVA to a new GPA
> > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > >
> > > Then we lost the old GPA.
> >
> > Interesting and a good point. And by the way e.g. vhost has the same issue.  You
> > need to flush dirty tracking info when changing the mappings somehow.  Parav
> > what's the plan for this? Should be addressed in the spec too.
> >
> As you listed the flush is needed for vhost or device-based DPT.

What does DPT mean? Device Page Table? Let's not invent terminology
which is not known by others please.

We have discussed it many times. You can't just depend on ATS or
reinventing wheels in virtio.

What's more, please try not to give me the impression that the
proposal is optimized for a specific vendor (like device IOMMU
stuffs).

> The necessary plumbing is already covered for this in the query (read and clear) command of this v3 proposal.

The issue is logging via IOVA ... I don't see how "read and clear" can help.

> It is listed in Device Write Records Read Command.

Please explain how your proposal can solve the above race.

>
> When the page write record is fully read, it is flushed.
> How/when to use, I think its hypervisor specific, so we probably better off not documenting those details.

Well, as the author of this proposal, at least you need to know how a
hypervisor can work with your proposal, no?

> May be such read is needed in some other path too depending on how hypervisor implemented.

What do you mean by "May be ... some other path" here? You're
inventing a mechanism that you don't know how a hypervisor can use?

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-13  3:31                               ` Jason Wang
@ 2023-11-13  6:57                                 ` Michael S. Tsirkin
  2023-11-14  7:34                                   ` Zhu, Lingshan
  2023-11-14  7:57                                   ` Jason Wang
  2023-11-15 17:42                                 ` [virtio-comment] " Parav Pandit
  1 sibling, 2 replies; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-13  6:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Mon, Nov 13, 2023 at 11:31:37AM +0800, Jason Wang wrote:
> On Thu, Nov 9, 2023 at 3:59 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote:
> > > On Wed, Nov 8, 2023 at 4:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> > > > > On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > > > > Each virtio and non virtio devices who wants to report their dirty page report,
> > > > > > > > > will do their way.
> > > > > > > > > >
> > > > > > > > > > > 3) inventing it in the virtio layer will be deprecated in the future
> > > > > > > > > > > for sure, as platform will provide much rich features for logging
> > > > > > > > > > > e.g it can do it per PASID etc, I don't see any reason virtio need
> > > > > > > > > > > to compete with the features that will be provided by the platform
> > > > > > > > > > Can you bring the cpu vendors and committement to virtio tc with timelines
> > > > > > > > > so that virtio TC can omit?
> > > > > > > > >
> > > > > > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio needs to be built
> > > > > > > > > on top of transport or platform. There's no need to duplicate their job.
> > > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > > >
> > > > > > > > I wanted to see a strong commitment for the cpu vendors to support dirty page tracking.
> > > > > > >
> > > > > > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
> > > > > > > ARM are all supporting that now.
> > > > > > >
> > > > > > > > And the work seems to have started for some platforms.
> > > > > > >
> > > > > > > Let me quote from the above link:
> > > > > > >
> > > > > > > """
> > > > > > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > > > > > alongside VT-D rev3.x also do support.
> > > > > > > """
> > > > > > >
> > > > > > > > Without such platform commitment, virtio also skipping it would not work.
> > > > > > >
> > > > > > > Is the above sufficient? I'm a little bit more familiar with vtd, the
> > > > > > > hw feature has been there for years.
> > > > > >
> > > > > >
> > > > > > Repeating myself - I'm not sure that will work well for all workloads.
> > > > >
> > > > > I think this comment applies to this proposal as well.
> > > >
> > > > Yes - some systems might be better off with platform tracking.
> > > > And I think supporting shadow vq better would be nice too.
> > >
> > > For shadow vq, did you mean the work that is done by Eugenio?
> >
> > Yes.
> 
> That's exactly why vDPA starts with shadow virtqueue. We've evaluated
> various possible approaches, each of them have their shortcomings and
> shadow virtqueue is the only one that doesn't require any additional
> hardware features to work in every platform.

What I would like to see is effort to switch shadow on/off not keep it
on at all times. That's only good enough for a PoC. And to work on top
of virtio that will require effort in the spec.  If I see spec patches
that do that I personally would support that.  It needs to be reasonably
generic though, a single 16 bit RW number is not going to be enough. I
think it's likely admin commands is a good interface for this. If it's a
hack making vendor specific assumptions, just keep it in vdpa.

> >
> > > >
> > > > > > Definitely KVM did
> > > > > > not scan PTEs. It used pagefaults with bit per page and later as VM size
> > > > > > grew switched to PLM.  This interface is analogous to PLM,
> > > > >
> > > > > I think you meant PML actually. And it doesn't work like PML. To
> > > > > behave like PML it needs to
> > > > >
> > > > > 1) log buffers were organized as a queue with indices
> > > > > 2) device needs to suspend (as a #vmexit in PML) if it runs out of the buffers
> > > > > 3) device need to send a notification to the driver if it runs out of the buffer
> > > > >
> > > > > I don't see any of the above in this proposal. If we do that it would
> > > > > be less problematic than what is being proposed here.
> > > >
> > > > What is common between this and PML is that you get the addresses
> > > > directly without scanning megabytes of bitmaps or worse -
> > > > hundreds of megabytes of page tables.
> > >
> > > Yes, it has overhead but this is the method we use for vhost and KVM (earlier).
> > >
> > > To me the  important advantage of PML is that it uses limited
> > > resources on the host which
> > >
> > > 1) doesn't require resources in the device
> > > 2) doesn't scale as the guest memory increases. (but this advantage
> > > doesn't exist in neither this nor bitmap)
> >
> > it seems 2 exactly exists here.
> 
> Actually not, Parav said the device needs to reserve sufficient
> resources in another thread.
> 
> >
> >
> > > >
> > > > The data structure is different but I don't see why it is critical.
> > > >
> > > > I agree that I don't see out of buffers notifications too which implies
> > > > device has to maintain something like a bitmap internally.  Which I
> > > > guess could be fine but it is not clear to me how large that bitmap has
> > > > to be. How does the device know? Needs to be addressed.
> > >
> > > This is the question I asked Parav in another thread. Using host
> > > memory as a queue with notification (like PML) might be much better.
> >
> > Well if queue is what you want to do you can just do it internally.
> 
> Then it's not the proposal here, Parav has explained it in another
> reply, and as explained it lacks a lot of other facilities.
> 
> > Problem of course is that it might overflow and cause things like
> > packet drops.
> 
> Exactly like PML. So sticking to wire speed should not be a general
> goal in the context of migration. It can be done if the speed of the
> migration interface is faster than the virtio device that needs to be
> migrated.

People buy hardware to improve performance. Apparently there are people
who want to build this hardware. It is not our role to tell either
of the groups "this should not be a general goal". 


> >
> >
> > > >
> > > >
> > > > > Even if we manage to do that, it doesn't mean we won't have issues.
> > > > >
> > > > > 1) For many reasons it can neither see nor log via GPA, so this
> > > > > requires a traversal of the vIOMMU mapping tables by the hypervisor
> > > > > afterwards, it would be expensive and need synchronization with the
> > > > > guest modification of the IO page table which looks very hard.
> > > >
> > > > vIOMMU is fast enough to be used on data path but not fast enough for
> > > > dirty tracking?
> > >
> > > We set up SPTEs or using nesting offloading where the PTEs could be
> > > iterated by hardware directly which is fast.
> >
> > There's a way to have hardware find dirty PTEs for you quickly?
> 
> Scanning PTEs on the host is faster and more secure than scanning
> guests, that's what I want to say:
> 
> 1) the guest page could be swapped out but not the host one.
> 2) no guest triggerable behavior
> 
> > I don't know how it's done. Do tell.
> >
> >
> > > This is not the case here where software needs to iterate the IO page
> > > tables in the guest which could be slow.
> > >
> > > > Hard to believe.  If true and you want to speed up
> > > > vIOMMU then you implement an efficient datastructure for that.
> > >
> > > Besides the issue of performance, it's also racy, assuming we are logging IOVA.
> > >
> > > 0) device log IOVA
> > > 1) hypervisor fetches IOVA from log buffer
> > > 2) guest map IOVA to a new GPA
> > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > >
> > > Then we lost the old GPA.
> >
> > Interesting and a good point.
> 
> Note that PML logs at GPA as it works at L1 of EPT.

And that's perfect for migration.

> > And by the way e.g. vhost has the same
> > issue.  You need to flush dirty tracking info when changing the mappings
> > somehow.
> 
> It's not,
> 
> 1) memory translation is done by vhost
> 2) vhost knows GPA and it doesn't log via IOVA.
> 
> See this for example, and DPDK has similar fixes.
> 
> commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
> Author: Jason Wang <jasowang@redhat.com>
> Date:   Wed Jan 16 16:54:42 2019 +0800
> 
>     vhost: log dirty page correctly
> 
>     Vhost dirty page logging API is designed to sync through GPA. But we
>     try to log GIOVA when device IOTLB is enabled. This is wrong and may
>     lead to missing data after migration.
> 
>     To solve this issue, when logging with device IOTLB enabled, we will:
> 
>     1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
>        get HVA, for writable descriptor, get HVA through iovec. For used
>        ring update, translate its GIOVA to HVA
>     2) traverse the GPA->HVA mapping to get the possible GPA and log
>        through GPA. Pay attention this reverse mapping is not guaranteed
>        to be unique, so we should log each possible GPA in this case.
> 
>     This fix the failure of scp to guest during migration. In -next, we
>     will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> 
>     Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
>     Reported-by: Jintack Lim <jintack@cs.columbia.edu>
>     Cc: Jintack Lim <jintack@cs.columbia.edu>
>     Signed-off-by: Jason Wang <jasowang@redhat.com>
>     Acked-by: Michael S. Tsirkin <mst@redhat.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> All of the above is not what virtio did right now.

Any IOMMU flushes IOTLB on translation changes. If vhost doesn't then
it's highly likely to be a bug.


> > Parav what's the plan for this? Should be addressed in the
> > spec too.
> >
> 
> AFAIK, there's no easy/efficient way to do that. I hope I was wrong.
> 
> >
> >
> > > >
> > > > > 2) There are a lot of special or reserved IOVA ranges (for example the
> > > > > interrupt areas in x86) that need special care which is architectural
> > > > > and where it is beyond the scope or knowledge of the virtio device but
> > > > > the platform IOMMU. Things would be more complicated when SVA is
> > > > > enabled.
> > > >
> > > > SVA being what here?
> > >
> > > For example, IOMMU may treat interrupt ranges differently depending on
> > > whether SVA is enabled or not. It's very hard and unnecessary to teach
> > > devices about this.
> >
> > Oh, shared virtual memory. So what you are saying here? virtio
> > does not care, it just uses some addresses and if you want it to
> > it can record writes somewhere.
> 
> One example, PCI allows devices to send translated requests, how can a
> hypervisor know it's a PA or IOVA in this case? We probably need a new
> bit. But it's not the only thing we need to deal with.

virtio must always log PA.


> By definition, interrupt ranges and other reserved ranges should not
> belong to dirty pages. And the logging should be done before the DMA
> where there's no way for the device to know whether or not an IOVA is
> valid or not. It would be more safe to just not report them from the
> source instead of leaving it to the hypervisor to deal with but this
> seems impossible at the device level. Otherwise the hypervisor driver
> needs to communicate with the (v)IOMMU to be reached with the
> interrupt(MSI) area, RMRR area etc in order to do the correct things
> or it might have security implications. And those areas don't make
> sense at L1 when vSVA is enabled. What's more, when vIOMMU could be
> fully offloaded, there's no easy way to fetch that information.
>
> Again, it's hard to bypass or even duplicate the functionality of the
> platform or we need to step into every single detail of a specific
> transport, architecture or IOMMU to figure out whether or not logging
> at virtio is correct which is awkward and unrealistic. This proposal
> suffers from an exact similar issue when inventing things like
> freeze/stop where I've pointed out other branches of issues as well.


Exactly it's a mess.  Instead of making everything 10x more complex,
let's just keep talking about PA and leave translation to IOMMU.


> >
> > > >
> > > > > And there could be other architecte specific knowledge (e.g
> > > > > PAGE_SIZE) that might be needed. There's no easy way to deal with
> > > > > those cases.
> > > >
> > > > Good point about page size actually - using 4k unconditionally
> > > > is a waste of resources.
> > >
> > > Actually, they are more than just PAGE_SIZE, for example, PASID and others.
> >
> > what does pasid have to do with it? anyway, just give driver control
> > over page size.
> 
> For example, two virtqueues have two PASIDs assigned. How can a
> hypervisor know which specific IOVA belongs to which IOVA? For
> platform IOMMU, they are handy as it talks to the transport. But I
> don't think we need to duplicate every transport specific address
> space feature in core virtio layer:
> 
> 1) translated/untranslated request
> 2) request w/ and w/o PASID

Can't say I understand. All the talk about IOVA is just confusing -
what we care about for logging is which page to resend.

> > > >
> > > >
> > > > > We wouldn't need to care about all of them if it is done at platform
> > > > > IOMMU level.
> > > >
> > > > If someone logs at IOMMU level then nothing needs to be done
> > > > in the spec at all. This is about capability at the device level.
> > >
> > > True, but my question is where or not it can be done at the device level easily.
> >
> > there's no "easily" about live migration ever.
> 
> I think I've stated sufficient issues to demonstrate how hard virtio
> wants to do it. And I've given the link that it is possible to do that
> in IOMMU without those issues. So in this context doing it in virtio
> is much harder.

Code walks though.


> > For example on-device iommus are a thing.
> 
> I'm not sure that's the way to go considering the platform IOMMU
> evolves very quickly.

What do you refer to? People buy hardware and use it for years
with no chance to add features.


> >
> > > >
> > > >
> > > > > > what Lingshan
> > > > > > proposed is analogous to bit per page - problem unfortunately is
> > > > > > you can't easily set a bit by DMA.
> > > > > >
> > > > >
> > > > > I'm not saying bit/bytemap is the best, but it has been used by real
> > > > > hardware. And we have many other options.
> > > > >
> > > > > > So I think this dirty tracking is a good option to have.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > >
> > > > > > > > > Why does it matter in 2024?
> > > > > > > > Because users needs to use it now.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > If not, we are better off to offer this, and when/if platform support is, sure,
> > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > >
> > > > > > > > > > > 4) if the platform support is missing, we can use software or
> > > > > > > > > > > leverage transport for assistance like PRI
> > > > > > > > > > All of these are in theory.
> > > > > > > > > > Our experiment shows PRI performance is 21x slower than page fault rate
> > > > > > > > > done by the cpu.
> > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > >
> > > > > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > > > > Do you have perf data for this?
> > > > > > >
> > > > > > > No, but it's not hard to imagine the worst case. Wrote a small program
> > > > > > > that dirty every page by a NIC.
> > > > > > >
> > > > > > > > In the internal tests we don’t see this happening.
> > > > > > >
> > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > >
> > > > > > > So if we get very high dirty rates (e.g by a high speed NIC), we can't
> > > > > > > satisfy the requirement of the downtime. Or if you see the converge,
> > > > > > > you might get help from the auto converge support by the hypervisors
> > > > > > > like KVM where it tries to throttle the VCPU then you can't reach the
> > > > > > > wire speed.
> > > > > >
> > > > > > Will only work for some device types.
> > > > > >
> > > > >
> > > > > Yes, that's the point. Parav said he doesn't see the issue, it's
> > > > > probably because he is testing a virtio-net and so the vCPU is
> > > > > automatically throttled. It doesn't mean it can work for other virito
> > > > > devices.
> > > >
> > > > Only for TX, and I'm pretty sure they had the foresight to test RX not
> > > > just TX but let's confirm. Parav did you test both directions?
> > >
> > > RX speed somehow depends on the speed of refill, so throttling helps
> > > more or less.
> >
> > It doesn't depend on speed of refill you just underrun and drop
> > packets. then your nice 10usec latency becomes more like 10sec.
> 
> I miss your point here. If the driver can't achieve wire speed without
> dirty page tracking, it can neither when dirty page tracking is
> enabled.

My point is PRI causes rx ring underruns and throttling CPU makes it
worse not better. And I believe people actually tried, nvidia
have a pri implementation in hardware. If they come and say
virtio help is needed for performance I tend to believe them.



> >
> > > >
> > > > > >
> > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > So it is unusable.
> > > > > > > > >
> > > > > > > > > It's not about mandating, it's about doing things in the correct layer. If PRI is
> > > > > > > > > slow, PCI can evolve for sure.
> > > > > > > > You should try.
> > > > > > >
> > > > > > > Not my duty, I just want to make sure things are done in the correct
> > > > > > > layer, and once it needs to be done in the virtio, there's nothing
> > > > > > > obviously wrong.
> > > > > >
> > > > > > Yea but just vague questions don't help to make sure eiter way.
> > > > >
> > > > > I don't think it's vague, I have explained, if something in the virito
> > > > > slows down the PRI, we can try to fix them.
> > > >
> > > > I don't believe you are going to make PRI fast. No one managed so far.
> > >
> > > So it's the fault of PRI not virito, but it doesn't mean we need to do
> > > it in virtio.
> >
> > I keep saying with this approach we would just say "e1000 emulation is
> > slow and encumbered this is the fault of e1000" and never get virtio at
> > all.  Assigning blame only gets you so far.
> 
> I think we are discussing different things. My point is virtio needs
> to leverage the functionality provided by transport or platform
> (especially considering they evolve faster than virtio). It seems to
> me it's hard even to duplicate some basic function of platform IOMMU
> in virtio.

Dirty tracking in the IOMMU is annoying enough that I am not
sure it's usable. Go ahead but I want to see patches then.

> >
> > > >
> > > > > Missing functions in
> > > > > platform or transport is not a good excuse to try to workaround it in
> > > > > the virtio. It's a layer violation and we never had any feature like
> > > > > this in the past.
> > > >
> > > > Yes missing functionality in the platform is exactly why virtio
> > > > was born in the first place.
> > >
> > > Well the platform can't do device specific logic. But that's not the
> > > case of dirty page tracking which is device logic agnostic.
> >
> > Not true platforms have things like NICs on board and have for many
> > years. It's about performance really.
> 
> I've stated sufficient issues above. And one more obvious issue for
> device initiated page logging is that it needs a lot of extra or
> unnecessary PCI transactions which will throttle the performance of
> the whole system (and lead to other issues like QOS).

Maybe. This kind of statement is just vague enough not to be falsifiable.

> So I can't
> believe it has good performance overall. Logging via IOMMU or using
> shadow virtqueue doesn't need any extra PCI transactions at least.

On the other hand they have an extra CPU cost.  Personally if this is
coming from a hardware vendor, I am inclined to trust them wrt PCI
transactions.  But anyway, discussing this at a high level theoretically
is pointless - whoever bothers with actual prototyping for performance
testing wins, if no one does I'd expect a back of a napkin estimate
to be included.



> > So I'd like Parav to publish some
> > experiment results and/or some estimates.
> >
> 
> That's fine, but the above equation (used by Qemu) is sufficient to
> demonstrate how hard to stick wire speed in the case.
> 
> >
> > > >
> > > > > >
> > > > > > > > In the current state, it is mandating.
> > > > > > > > And if you think PRI is the only way,
> > > > > > >
> > > > > > > I don't, it's just an example where virtio can leverage from either
> > > > > > > transport or platform. Or if it's the fault in virtio that slows down
> > > > > > > the PRI, then it is something we can do.
> > > > > > >
> > > > > > > >  than you should propose that in the dirty page tracking series that you listed above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > > >
> > > > > > > No, the point is to not duplicate works especially considering virtio
> > > > > > > can't do better than platform or transport.
> > > > > >
> > > > > > If someone says they tried and platform's migration support does not
> > > > > > work for them and they want to build a solution in virtio then
> > > > > > what exactly is the objection?
> > > > >
> > > > > The discussion is to make sure whether virtio can do this easily and
> > > > > correctly, then we can have a conclusion. I've stated some issues
> > > > > above, and I've asked other questions related to them which are still
> > > > > not answered.
> > > > >
> > > > > I think we had a very hard time in bypassing IOMMU in the past that we
> > > > > don't want to repeat.
> > > > >
> > > > > We've gone through several methods of logging dirty pages in the past
> > > > > (each with pros/cons), but this proposal never explains why it chooses
> > > > > one of them but not others. Spec needs to find the best path instead
> > > > > of just a possible path without any rationale about why.
> > > >
> > > > Adding more rationale isn't a bad thing.
> > > > In particular if platform supplies dirty tracking then how does
> > > > driver decide which to use platform or device capability?
> > > > A bit of discussion around this is a good idea.
> > > >
> > > >
> > > > > > virtio is here in the
> > > > > > first place because emulating devices didn't work well.
> > > > >
> > > > > I don't understand here. We have supported emulated devices for years.
> > > > > I'm pretty sure a lot of issues could be uncovered if this proposal
> > > > > can be prototyped with an emulated device first.
> > > > >
> > > > > Thanks
> > > >
> > > > virtio was originally PV as opposed to emulation. That there's now
> > > > hardware virtio and you call software implementation "an emulation" is
> > > > very meta.
> > >
> > > Yes but I don't see how it relates to dirty page tracking. When we
> > > find a way it should work for both software and hardware devices.
> > >
> > > Thanks
> >
> > It has to work well on a variety of existing platforms. If it does then
> > sure, why would we roll our own.
> 
> If virtio can do that in an efficient way without any issues, I agree.
> But it seems not.
> 
> Thanks



> 
> 
> 
> 
> 
> 
> >
> > --
> > MST
> >


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-13  3:41                                 ` [virtio-comment] " Jason Wang
@ 2023-11-13 14:30                                   ` Michael S. Tsirkin
  2023-11-14  2:03                                     ` Zhu, Lingshan
  2023-11-15 17:37                                   ` [virtio-comment] " Parav Pandit
  1 sibling, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-13 14:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Mon, Nov 13, 2023 at 11:41:07AM +0800, Jason Wang wrote:
> On Fri, Nov 10, 2023 at 2:46 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> > Hi Michael,
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, November 9, 2023 1:29 PM
> >
> > [..]
> > > > Besides the issue of performance, it's also racy, assuming we are logging
> > > IOVA.
> > > >
> > > > 0) device log IOVA
> > > > 1) hypervisor fetches IOVA from log buffer
> > > > 2) guest map IOVA to a new GPA
> > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > >
> > > > Then we lost the old GPA.
> > >
> > > Interesting and a good point. And by the way e.g. vhost has the same issue.  You
> > > need to flush dirty tracking info when changing the mappings somehow.  Parav
> > > what's the plan for this? Should be addressed in the spec too.
> > >
> > As you listed the flush is needed for vhost or device-based DPT.
> 
> What does DPT mean? Device Page Table? Let's not invent terminology
> which is not known by others please.
> 
> We have discussed it many times. You can't just depend on ATS or
> reinventing wheels in virtio.
> 
> What's more, please try not to give me the impression that the
> proposal is optimized for a specific vendor (like device IOMMU
> stuffs).

Devices with IOMMU exist.
So if it's for device IOMMU that's fine, as long as it's well isolated.


> > The necessary plumbing is already covered for this in the query (read and clear) command of this v3 proposal.
> 
> The issue is logging via IOVA ... I don't see how "read and clear" can help.
> 
> > It is listed in Device Write Records Read Command.
> 
> Please explain how your proposal can solve the above race.
> 
> >
> > When the page write record is fully read, it is flushed.
> > How/when to use, I think its hypervisor specific, so we probably better off not documenting those details.
> 
> Well, as the author of this proposal, at least you need to know how a
> hypervisor can work with your proposal, no?
> 
> > May be such read is needed in some other path too depending on how hypervisor implemented.
> 
> What do you mean by "May be ... some other path" here? You're
> inventing a mechanism that you don't know how a hypervisor can use?
> 
> Thanks


It seems like a subtle enough race that it really should be documented.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-13 14:30                                   ` Michael S. Tsirkin
@ 2023-11-14  2:03                                     ` Zhu, Lingshan
  2023-11-14  7:52                                       ` Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Zhu, Lingshan @ 2023-11-14  2:03 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 11/13/2023 10:30 PM, Michael S. Tsirkin wrote:
> On Mon, Nov 13, 2023 at 11:41:07AM +0800, Jason Wang wrote:
>> On Fri, Nov 10, 2023 at 2:46 PM Parav Pandit <parav@nvidia.com> wrote:
>>> Hi Michael,
>>>
>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>> Sent: Thursday, November 9, 2023 1:29 PM
>>> [..]
>>>>> Besides the issue of performance, it's also racy, assuming we are logging
>>>> IOVA.
>>>>> 0) device log IOVA
>>>>> 1) hypervisor fetches IOVA from log buffer
>>>>> 2) guest map IOVA to a new GPA
>>>>> 3) hypervisor traverse guest table to get IOVA to new GPA
>>>>>
>>>>> Then we lost the old GPA.
>>>> Interesting and a good point. And by the way e.g. vhost has the same issue.  You
>>>> need to flush dirty tracking info when changing the mappings somehow.  Parav
>>>> what's the plan for this? Should be addressed in the spec too.
>>>>
>>> As you listed the flush is needed for vhost or device-based DPT.
>> What does DPT mean? Device Page Table? Let's not invent terminology
>> which is not known by others please.
>>
>> We have discussed it many times. You can't just depend on ATS or
>> reinventing wheels in virtio.
>>
>> What's more, please try not to give me the impression that the
>> proposal is optimized for a specific vendor (like device IOMMU
>> stuffs).
> Devices with IOMMU exist.
> So if it's for device IOMMU that's fine, as long as it's well isolated.
device side IOMMU is not a must and I hope virito features don't depend 
on it.
>
>
>>> The necessary plumbing is already covered for this in the query (read and clear) command of this v3 proposal.
>> The issue is logging via IOVA ... I don't see how "read and clear" can help.
>>
>>> It is listed in Device Write Records Read Command.
>> Please explain how your proposal can solve the above race.
>>
>>> When the page write record is fully read, it is flushed.
>>> How/when to use, I think its hypervisor specific, so we probably better off not documenting those details.
>> Well, as the author of this proposal, at least you need to know how a
>> hypervisor can work with your proposal, no?
>>
>>> May be such read is needed in some other path too depending on how hypervisor implemented.
>> What do you mean by "May be ... some other path" here? You're
>> inventing a mechanism that you don't know how a hypervisor can use?
>>
>> Thanks
>
> It seems like a subtle enough race that it really should be documented.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-13  6:57                                 ` Michael S. Tsirkin
@ 2023-11-14  7:34                                   ` Zhu, Lingshan
  2023-11-14  7:59                                     ` Jason Wang
  2023-11-14  8:27                                     ` Michael S. Tsirkin
  2023-11-14  7:57                                   ` Jason Wang
  1 sibling, 2 replies; 157+ messages in thread
From: Zhu, Lingshan @ 2023-11-14  7:34 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 11/13/2023 2:57 PM, Michael S. Tsirkin wrote:
> On Mon, Nov 13, 2023 at 11:31:37AM +0800, Jason Wang wrote:
>> On Thu, Nov 9, 2023 at 3:59 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote:
>>>> On Wed, Nov 8, 2023 at 4:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
>>>>>> On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>> On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
>>>>>>>>>>> Each virtio and non virtio devices who wants to report their dirty page report,
>>>>>>>>>> will do their way.
>>>>>>>>>>>> 3) inventing it in the virtio layer will be deprecated in the future
>>>>>>>>>>>> for sure, as platform will provide much rich features for logging
>>>>>>>>>>>> e.g it can do it per PASID etc, I don't see any reason virtio need
>>>>>>>>>>>> to compete with the features that will be provided by the platform
>>>>>>>>>>> Can you bring the cpu vendors and committement to virtio tc with timelines
>>>>>>>>>> so that virtio TC can omit?
>>>>>>>>>>
>>>>>>>>>> Why do we need to bring CPU vendors in the virtio TC? Virtio needs to be built
>>>>>>>>>> on top of transport or platform. There's no need to duplicate their job.
>>>>>>>>>> Especially considering that virtio can't do better than them.
>>>>>>>>>>
>>>>>>>>> I wanted to see a strong commitment for the cpu vendors to support dirty page tracking.
>>>>>>>> The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
>>>>>>>> ARM are all supporting that now.
>>>>>>>>
>>>>>>>>> And the work seems to have started for some platforms.
>>>>>>>> Let me quote from the above link:
>>>>>>>>
>>>>>>>> """
>>>>>>>> Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
>>>>>>>> alongside VT-D rev3.x also do support.
>>>>>>>> """
>>>>>>>>
>>>>>>>>> Without such platform commitment, virtio also skipping it would not work.
>>>>>>>> Is the above sufficient? I'm a little bit more familiar with vtd, the
>>>>>>>> hw feature has been there for years.
>>>>>>>
>>>>>>> Repeating myself - I'm not sure that will work well for all workloads.
>>>>>> I think this comment applies to this proposal as well.
>>>>> Yes - some systems might be better off with platform tracking.
>>>>> And I think supporting shadow vq better would be nice too.
>>>> For shadow vq, did you mean the work that is done by Eugenio?
>>> Yes.
>> That's exactly why vDPA starts with shadow virtqueue. We've evaluated
>> various possible approaches, each of them have their shortcomings and
>> shadow virtqueue is the only one that doesn't require any additional
>> hardware features to work in every platform.
> What I would like to see is effort to switch shadow on/off not keep it
> on at all times. That's only good enough for a PoC. And to work on top
> of virtio that will require effort in the spec.  If I see spec patches
> that do that I personally would support that.  It needs to be reasonably
> generic though, a single 16 bit RW number is not going to be enough. I
> think it's likely admin commands is a good interface for this. If it's a
> hack making vendor specific assumptions, just keep it in vdpa.
>
>>>>>>> Definitely KVM did
>>>>>>> not scan PTEs. It used pagefaults with bit per page and later as VM size
>>>>>>> grew switched to PLM.  This interface is analogous to PLM,
>>>>>> I think you meant PML actually. And it doesn't work like PML. To
>>>>>> behave like PML it needs to
>>>>>>
>>>>>> 1) log buffers were organized as a queue with indices
>>>>>> 2) device needs to suspend (as a #vmexit in PML) if it runs out of the buffers
>>>>>> 3) device need to send a notification to the driver if it runs out of the buffer
>>>>>>
>>>>>> I don't see any of the above in this proposal. If we do that it would
>>>>>> be less problematic than what is being proposed here.
>>>>> What is common between this and PML is that you get the addresses
>>>>> directly without scanning megabytes of bitmaps or worse -
>>>>> hundreds of megabytes of page tables.
>>>> Yes, it has overhead but this is the method we use for vhost and KVM (earlier).
>>>>
>>>> To me the  important advantage of PML is that it uses limited
>>>> resources on the host which
>>>>
>>>> 1) doesn't require resources in the device
>>>> 2) doesn't scale as the guest memory increases. (but this advantage
>>>> doesn't exist in neither this nor bitmap)
>>> it seems 2 exactly exists here.
>> Actually not, Parav said the device needs to reserve sufficient
>> resources in another thread.
>>
>>>
>>>>> The data structure is different but I don't see why it is critical.
>>>>>
>>>>> I agree that I don't see out of buffers notifications too which implies
>>>>> device has to maintain something like a bitmap internally.  Which I
>>>>> guess could be fine but it is not clear to me how large that bitmap has
>>>>> to be. How does the device know? Needs to be addressed.
>>>> This is the question I asked Parav in another thread. Using host
>>>> memory as a queue with notification (like PML) might be much better.
>>> Well if queue is what you want to do you can just do it internally.
>> Then it's not the proposal here, Parav has explained it in another
>> reply, and as explained it lacks a lot of other facilities.
>>
>>> Problem of course is that it might overflow and cause things like
>>> packet drops.
>> Exactly like PML. So sticking to wire speed should not be a general
>> goal in the context of migration. It can be done if the speed of the
>> migration interface is faster than the virtio device that needs to be
>> migrated.
> People buy hardware to improve performance. Apparently there are people
> who want to build this hardware. It is not our role to tell either
> of the groups "this should not be a general goal".
>
>
>>>
>>>>>
>>>>>> Even if we manage to do that, it doesn't mean we won't have issues.
>>>>>>
>>>>>> 1) For many reasons it can neither see nor log via GPA, so this
>>>>>> requires a traversal of the vIOMMU mapping tables by the hypervisor
>>>>>> afterwards, it would be expensive and need synchronization with the
>>>>>> guest modification of the IO page table which looks very hard.
>>>>> vIOMMU is fast enough to be used on data path but not fast enough for
>>>>> dirty tracking?
>>>> We set up SPTEs or using nesting offloading where the PTEs could be
>>>> iterated by hardware directly which is fast.
>>> There's a way to have hardware find dirty PTEs for you quickly?
>> Scanning PTEs on the host is faster and more secure than scanning
>> guests, that's what I want to say:
>>
>> 1) the guest page could be swapped out but not the host one.
>> 2) no guest triggerable behavior
>>
>>> I don't know how it's done. Do tell.
>>>
>>>
>>>> This is not the case here where software needs to iterate the IO page
>>>> tables in the guest which could be slow.
>>>>
>>>>> Hard to believe.  If true and you want to speed up
>>>>> vIOMMU then you implement an efficient datastructure for that.
>>>> Besides the issue of performance, it's also racy, assuming we are logging IOVA.
>>>>
>>>> 0) device log IOVA
>>>> 1) hypervisor fetches IOVA from log buffer
>>>> 2) guest map IOVA to a new GPA
>>>> 3) hypervisor traverse guest table to get IOVA to new GPA
>>>>
>>>> Then we lost the old GPA.
>>> Interesting and a good point.
>> Note that PML logs at GPA as it works at L1 of EPT.
> And that's perfect for migration.
>
>>> And by the way e.g. vhost has the same
>>> issue.  You need to flush dirty tracking info when changing the mappings
>>> somehow.
>> It's not,
>>
>> 1) memory translation is done by vhost
>> 2) vhost knows GPA and it doesn't log via IOVA.
>>
>> See this for example, and DPDK has similar fixes.
>>
>> commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
>> Author: Jason Wang <jasowang@redhat.com>
>> Date:   Wed Jan 16 16:54:42 2019 +0800
>>
>>      vhost: log dirty page correctly
>>
>>      Vhost dirty page logging API is designed to sync through GPA. But we
>>      try to log GIOVA when device IOTLB is enabled. This is wrong and may
>>      lead to missing data after migration.
>>
>>      To solve this issue, when logging with device IOTLB enabled, we will:
>>
>>      1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
>>         get HVA, for writable descriptor, get HVA through iovec. For used
>>         ring update, translate its GIOVA to HVA
>>      2) traverse the GPA->HVA mapping to get the possible GPA and log
>>         through GPA. Pay attention this reverse mapping is not guaranteed
>>         to be unique, so we should log each possible GPA in this case.
>>
>>      This fix the failure of scp to guest during migration. In -next, we
>>      will probably support passing GIOVA->GPA instead of GIOVA->HVA.
>>
>>      Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
>>      Reported-by: Jintack Lim <jintack@cs.columbia.edu>
>>      Cc: Jintack Lim <jintack@cs.columbia.edu>
>>      Signed-off-by: Jason Wang <jasowang@redhat.com>
>>      Acked-by: Michael S. Tsirkin <mst@redhat.com>
>>      Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>> All of the above is not what virtio did right now.
> Any IOMMU flushes IOTLB on translation changes. If vhost doesn't then
> it's highly likely to be a bug.
>
>
>>> Parav what's the plan for this? Should be addressed in the
>>> spec too.
>>>
>> AFAIK, there's no easy/efficient way to do that. I hope I was wrong.
>>
>>>
>>>>>> 2) There are a lot of special or reserved IOVA ranges (for example the
>>>>>> interrupt areas in x86) that need special care which is architectural
>>>>>> and where it is beyond the scope or knowledge of the virtio device but
>>>>>> the platform IOMMU. Things would be more complicated when SVA is
>>>>>> enabled.
>>>>> SVA being what here?
>>>> For example, IOMMU may treat interrupt ranges differently depending on
>>>> whether SVA is enabled or not. It's very hard and unnecessary to teach
>>>> devices about this.
>>> Oh, shared virtual memory. So what you are saying here? virtio
>>> does not care, it just uses some addresses and if you want it to
>>> it can record writes somewhere.
>> One example, PCI allows devices to send translated requests, how can a
>> hypervisor know it's a PA or IOVA in this case? We probably need a new
>> bit. But it's not the only thing we need to deal with.
> virtio must always log PA.
>
>
>> By definition, interrupt ranges and other reserved ranges should not
>> belong to dirty pages. And the logging should be done before the DMA
>> where there's no way for the device to know whether or not an IOVA is
>> valid or not. It would be more safe to just not report them from the
>> source instead of leaving it to the hypervisor to deal with but this
>> seems impossible at the device level. Otherwise the hypervisor driver
>> needs to communicate with the (v)IOMMU to be reached with the
>> interrupt(MSI) area, RMRR area etc in order to do the correct things
>> or it might have security implications. And those areas don't make
>> sense at L1 when vSVA is enabled. What's more, when vIOMMU could be
>> fully offloaded, there's no easy way to fetch that information.
>>
>> Again, it's hard to bypass or even duplicate the functionality of the
>> platform or we need to step into every single detail of a specific
>> transport, architecture or IOMMU to figure out whether or not logging
>> at virtio is correct which is awkward and unrealistic. This proposal
>> suffers from an exact similar issue when inventing things like
>> freeze/stop where I've pointed out other branches of issues as well.
>
> Exactly it's a mess.  Instead of making everything 10x more complex,
> let's just keep talking about PA and leave translation to IOMMU.
>
>
>>>>>> And there could be other architecte specific knowledge (e.g
>>>>>> PAGE_SIZE) that might be needed. There's no easy way to deal with
>>>>>> those cases.
>>>>> Good point about page size actually - using 4k unconditionally
>>>>> is a waste of resources.
>>>> Actually, they are more than just PAGE_SIZE, for example, PASID and others.
>>> what does pasid have to do with it? anyway, just give driver control
>>> over page size.
>> For example, two virtqueues have two PASIDs assigned. How can a
>> hypervisor know which specific IOVA belongs to which IOVA? For
>> platform IOMMU, they are handy as it talks to the transport. But I
>> don't think we need to duplicate every transport specific address
>> space feature in core virtio layer:
>>
>> 1) translated/untranslated request
>> 2) request w/ and w/o PASID
> Can't say I understand. All the talk about IOVA is just confusing -
> what we care about for logging is which page to resend.
>
>>>>>
>>>>>> We wouldn't need to care about all of them if it is done at platform
>>>>>> IOMMU level.
>>>>> If someone logs at IOMMU level then nothing needs to be done
>>>>> in the spec at all. This is about capability at the device level.
>>>> True, but my question is where or not it can be done at the device level easily.
>>> there's no "easily" about live migration ever.
>> I think I've stated sufficient issues to demonstrate how hard virtio
>> wants to do it. And I've given the link that it is possible to do that
>> in IOMMU without those issues. So in this context doing it in virtio
>> is much harder.
> Code walks though.
>
>
>>> For example on-device iommus are a thing.
>> I'm not sure that's the way to go considering the platform IOMMU
>> evolves very quickly.
> What do you refer to? People buy hardware and use it for years
> with no chance to add features.
>
>
>>>>>
>>>>>>> what Lingshan
>>>>>>> proposed is analogous to bit per page - problem unfortunately is
>>>>>>> you can't easily set a bit by DMA.
>>>>>>>
>>>>>> I'm not saying bit/bytemap is the best, but it has been used by real
>>>>>> hardware. And we have many other options.
>>>>>>
>>>>>>> So I think this dirty tracking is a good option to have.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>>> i.e. in first year of 2024?
>>>>>>>>>> Why does it matter in 2024?
>>>>>>>>> Because users needs to use it now.
>>>>>>>>>
>>>>>>>>>>> If not, we are better off to offer this, and when/if platform support is, sure,
>>>>>>>>>> this feature can be disabled/not used/not enabled.
>>>>>>>>>>>> 4) if the platform support is missing, we can use software or
>>>>>>>>>>>> leverage transport for assistance like PRI
>>>>>>>>>>> All of these are in theory.
>>>>>>>>>>> Our experiment shows PRI performance is 21x slower than page fault rate
>>>>>>>>>> done by the cpu.
>>>>>>>>>>> It simply does not even pass a simple 10Gbps test.
>>>>>>>>>> If you stick to the wire speed during migration, it can converge.
>>>>>>>>> Do you have perf data for this?
>>>>>>>> No, but it's not hard to imagine the worst case. Wrote a small program
>>>>>>>> that dirty every page by a NIC.
>>>>>>>>
>>>>>>>>> In the internal tests we don’t see this happening.
>>>>>>>> downtime = dirty_rates * PAGE_SIZE / migration_speed
>>>>>>>>
>>>>>>>> So if we get very high dirty rates (e.g by a high speed NIC), we can't
>>>>>>>> satisfy the requirement of the downtime. Or if you see the converge,
>>>>>>>> you might get help from the auto converge support by the hypervisors
>>>>>>>> like KVM where it tries to throttle the VCPU then you can't reach the
>>>>>>>> wire speed.
>>>>>>> Will only work for some device types.
>>>>>>>
>>>>>> Yes, that's the point. Parav said he doesn't see the issue, it's
>>>>>> probably because he is testing a virtio-net and so the vCPU is
>>>>>> automatically throttled. It doesn't mean it can work for other virito
>>>>>> devices.
>>>>> Only for TX, and I'm pretty sure they had the foresight to test RX not
>>>>> just TX but let's confirm. Parav did you test both directions?
>>>> RX speed somehow depends on the speed of refill, so throttling helps
>>>> more or less.
>>> It doesn't depend on speed of refill you just underrun and drop
>>> packets. then your nice 10usec latency becomes more like 10sec.
>> I miss your point here. If the driver can't achieve wire speed without
>> dirty page tracking, it can neither when dirty page tracking is
>> enabled.
> My point is PRI causes rx ring underruns and throttling CPU makes it
> worse not better. And I believe people actually tried, nvidia
> have a pri implementation in hardware. If they come and say
> virtio help is needed for performance I tend to believe them.
>
>
>
>>>>>>>
>>>>>>>>>>> There is no requirement for mandating PRI either.
>>>>>>>>>>> So it is unusable.
>>>>>>>>>> It's not about mandating, it's about doing things in the correct layer. If PRI is
>>>>>>>>>> slow, PCI can evolve for sure.
>>>>>>>>> You should try.
>>>>>>>> Not my duty, I just want to make sure things are done in the correct
>>>>>>>> layer, and once it needs to be done in the virtio, there's nothing
>>>>>>>> obviously wrong.
>>>>>>> Yea but just vague questions don't help to make sure eiter way.
>>>>>> I don't think it's vague, I have explained, if something in the virito
>>>>>> slows down the PRI, we can try to fix them.
>>>>> I don't believe you are going to make PRI fast. No one managed so far.
>>>> So it's the fault of PRI not virito, but it doesn't mean we need to do
>>>> it in virtio.
>>> I keep saying with this approach we would just say "e1000 emulation is
>>> slow and encumbered this is the fault of e1000" and never get virtio at
>>> all.  Assigning blame only gets you so far.
>> I think we are discussing different things. My point is virtio needs
>> to leverage the functionality provided by transport or platform
>> (especially considering they evolve faster than virtio). It seems to
>> me it's hard even to duplicate some basic function of platform IOMMU
>> in virtio.
> Dirty tracking in the IOMMU is annoying enough that I am not
> sure it's usable. Go ahead but I want to see patches then.
>
>>>>>> Missing functions in
>>>>>> platform or transport is not a good excuse to try to workaround it in
>>>>>> the virtio. It's a layer violation and we never had any feature like
>>>>>> this in the past.
>>>>> Yes missing functionality in the platform is exactly why virtio
>>>>> was born in the first place.
>>>> Well the platform can't do device specific logic. But that's not the
>>>> case of dirty page tracking which is device logic agnostic.
>>> Not true platforms have things like NICs on board and have for many
>>> years. It's about performance really.
>> I've stated sufficient issues above. And one more obvious issue for
>> device initiated page logging is that it needs a lot of extra or
>> unnecessary PCI transactions which will throttle the performance of
>> the whole system (and lead to other issues like QOS).
> Maybe. This kind of statement is just vague enough not to be falsifiable.
>
>> So I can't
>> believe it has good performance overall. Logging via IOMMU or using
>> shadow virtqueue doesn't need any extra PCI transactions at least.
> On the other hand they have an extra CPU cost.  Personally if this is
> coming from a hardware vendor, I am inclined to trust them wrt PCI
> transactions.  But anyway, discussing this at a high level theoretically
> is pointless - whoever bothers with actual prototyping for performance
> testing wins, if no one does I'd expect a back of a napkin estimate
> to be included.
if so, Intel has released productions implementing these interfaces 
years ago,
see live migration in 4.1. IFCVF vDPA Implementation, 
https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html
and

But I still believe we are here try our best to work out an industrial spec
with better quality, to serve broad interest. This is not competition 
between companies,
and the spec is not a FIFO, not like a early bird can catch all the worm.
>
>
>
>>> So I'd like Parav to publish some
>>> experiment results and/or some estimates.
>>>
>> That's fine, but the above equation (used by Qemu) is sufficient to
>> demonstrate how hard to stick wire speed in the case.
>>
>>>>>>>>> In the current state, it is mandating.
>>>>>>>>> And if you think PRI is the only way,
>>>>>>>> I don't, it's just an example where virtio can leverage from either
>>>>>>>> transport or platform. Or if it's the fault in virtio that slows down
>>>>>>>> the PRI, then it is something we can do.
>>>>>>>>
>>>>>>>>>   than you should propose that in the dirty page tracking series that you listed above to not do dirty page tracking. Rather depend on PRI, right?
>>>>>>>> No, the point is to not duplicate works especially considering virtio
>>>>>>>> can't do better than platform or transport.
>>>>>>> If someone says they tried and platform's migration support does not
>>>>>>> work for them and they want to build a solution in virtio then
>>>>>>> what exactly is the objection?
>>>>>> The discussion is to make sure whether virtio can do this easily and
>>>>>> correctly, then we can have a conclusion. I've stated some issues
>>>>>> above, and I've asked other questions related to them which are still
>>>>>> not answered.
>>>>>>
>>>>>> I think we had a very hard time in bypassing IOMMU in the past that we
>>>>>> don't want to repeat.
>>>>>>
>>>>>> We've gone through several methods of logging dirty pages in the past
>>>>>> (each with pros/cons), but this proposal never explains why it chooses
>>>>>> one of them but not others. Spec needs to find the best path instead
>>>>>> of just a possible path without any rationale about why.
>>>>> Adding more rationale isn't a bad thing.
>>>>> In particular if platform supplies dirty tracking then how does
>>>>> driver decide which to use platform or device capability?
>>>>> A bit of discussion around this is a good idea.
>>>>>
>>>>>
>>>>>>> virtio is here in the
>>>>>>> first place because emulating devices didn't work well.
>>>>>> I don't understand here. We have supported emulated devices for years.
>>>>>> I'm pretty sure a lot of issues could be uncovered if this proposal
>>>>>> can be prototyped with an emulated device first.
>>>>>>
>>>>>> Thanks
>>>>> virtio was originally PV as opposed to emulation. That there's now
>>>>> hardware virtio and you call software implementation "an emulation" is
>>>>> very meta.
>>>> Yes but I don't see how it relates to dirty page tracking. When we
>>>> find a way it should work for both software and hardware devices.
>>>>
>>>> Thanks
>>> It has to work well on a variety of existing platforms. If it does then
>>> sure, why would we roll our own.
>> If virtio can do that in an efficient way without any issues, I agree.
>> But it seems not.
>>
>> Thanks
>
>
>>
>>
>>
>>
>>
>>> --
>>> MST
>>>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-14  2:03                                     ` Zhu, Lingshan
@ 2023-11-14  7:52                                       ` Jason Wang
  0 siblings, 0 replies; 157+ messages in thread
From: Jason Wang @ 2023-11-14  7:52 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Michael S. Tsirkin, Parav Pandit, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Tue, Nov 14, 2023 at 10:04 AM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
>
>
>
> On 11/13/2023 10:30 PM, Michael S. Tsirkin wrote:
> > On Mon, Nov 13, 2023 at 11:41:07AM +0800, Jason Wang wrote:
> >> On Fri, Nov 10, 2023 at 2:46 PM Parav Pandit <parav@nvidia.com> wrote:
> >>> Hi Michael,
> >>>
> >>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>> Sent: Thursday, November 9, 2023 1:29 PM
> >>> [..]
> >>>>> Besides the issue of performance, it's also racy, assuming we are logging
> >>>> IOVA.
> >>>>> 0) device log IOVA
> >>>>> 1) hypervisor fetches IOVA from log buffer
> >>>>> 2) guest map IOVA to a new GPA
> >>>>> 3) hypervisor traverse guest table to get IOVA to new GPA
> >>>>>
> >>>>> Then we lost the old GPA.
> >>>> Interesting and a good point. And by the way e.g. vhost has the same issue.  You
> >>>> need to flush dirty tracking info when changing the mappings somehow.  Parav
> >>>> what's the plan for this? Should be addressed in the spec too.
> >>>>
> >>> As you listed the flush is needed for vhost or device-based DPT.
> >> What does DPT mean? Device Page Table? Let's not invent terminology
> >> which is not known by others please.
> >>
> >> We have discussed it many times. You can't just depend on ATS or
> >> reinventing wheels in virtio.
> >>
> >> What's more, please try not to give me the impression that the
> >> proposal is optimized for a specific vendor (like device IOMMU
> >> stuffs).
> > Devices with IOMMU exist.

It only exists for very few vendors. Whether or not virtio needs a
device IOMMU is a separate topic.

Again, it's far easier to leverage what platform can give us than
re-inventing stuff in virtio.

> > So if it's for device IOMMU that's fine, as long as it's well isolated.
> device side IOMMU is not a must and I hope virito features don't depend
> on it.

Exactly, and I didn't see any description in this series to:

1) claim the feature depends on device IOMMU
2) invent interfaces for device IOMMU

Thanks



> >
> >
> >>> The necessary plumbing is already covered for this in the query (read and clear) command of this v3 proposal.
> >> The issue is logging via IOVA ... I don't see how "read and clear" can help.
> >>
> >>> It is listed in Device Write Records Read Command.
> >> Please explain how your proposal can solve the above race.
> >>
> >>> When the page write record is fully read, it is flushed.
> >>> How/when to use, I think its hypervisor specific, so we probably better off not documenting those details.
> >> Well, as the author of this proposal, at least you need to know how a
> >> hypervisor can work with your proposal, no?
> >>
> >>> May be such read is needed in some other path too depending on how hypervisor implemented.
> >> What do you mean by "May be ... some other path" here? You're
> >> inventing a mechanism that you don't know how a hypervisor can use?
> >>
> >> Thanks
> >
> > It seems like a subtle enough race that it really should be documented.
> >
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-13  6:57                                 ` Michael S. Tsirkin
  2023-11-14  7:34                                   ` Zhu, Lingshan
@ 2023-11-14  7:57                                   ` Jason Wang
  2023-11-14  9:16                                     ` Michael S. Tsirkin
  1 sibling, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-14  7:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Mon, Nov 13, 2023 at 2:57 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Nov 13, 2023 at 11:31:37AM +0800, Jason Wang wrote:
> > On Thu, Nov 9, 2023 at 3:59 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote:
> > > > On Wed, Nov 8, 2023 at 4:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >
> > > > > On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> > > > > > On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > > > > > Each virtio and non virtio devices who wants to report their dirty page report,
> > > > > > > > > > will do their way.
> > > > > > > > > > >
> > > > > > > > > > > > 3) inventing it in the virtio layer will be deprecated in the future
> > > > > > > > > > > > for sure, as platform will provide much rich features for logging
> > > > > > > > > > > > e.g it can do it per PASID etc, I don't see any reason virtio need
> > > > > > > > > > > > to compete with the features that will be provided by the platform
> > > > > > > > > > > Can you bring the cpu vendors and committement to virtio tc with timelines
> > > > > > > > > > so that virtio TC can omit?
> > > > > > > > > >
> > > > > > > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio needs to be built
> > > > > > > > > > on top of transport or platform. There's no need to duplicate their job.
> > > > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > > > >
> > > > > > > > > I wanted to see a strong commitment for the cpu vendors to support dirty page tracking.
> > > > > > > >
> > > > > > > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
> > > > > > > > ARM are all supporting that now.
> > > > > > > >
> > > > > > > > > And the work seems to have started for some platforms.
> > > > > > > >
> > > > > > > > Let me quote from the above link:
> > > > > > > >
> > > > > > > > """
> > > > > > > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > > > > > > alongside VT-D rev3.x also do support.
> > > > > > > > """
> > > > > > > >
> > > > > > > > > Without such platform commitment, virtio also skipping it would not work.
> > > > > > > >
> > > > > > > > Is the above sufficient? I'm a little bit more familiar with vtd, the
> > > > > > > > hw feature has been there for years.
> > > > > > >
> > > > > > >
> > > > > > > Repeating myself - I'm not sure that will work well for all workloads.
> > > > > >
> > > > > > I think this comment applies to this proposal as well.
> > > > >
> > > > > Yes - some systems might be better off with platform tracking.
> > > > > And I think supporting shadow vq better would be nice too.
> > > >
> > > > For shadow vq, did you mean the work that is done by Eugenio?
> > >
> > > Yes.
> >
> > That's exactly why vDPA starts with shadow virtqueue. We've evaluated
> > various possible approaches, each of them have their shortcomings and
> > shadow virtqueue is the only one that doesn't require any additional
> > hardware features to work in every platform.
>
> What I would like to see is effort to switch shadow on/off not keep it
> on at all times. That's only good enough for a PoC. And to work on top
> of virtio that will require effort in the spec.

Well, there're various approaches. If we just care about the shadow vq
on/off. Virtqueue indexes plus inflight should be sufficient.

Talking about the future, since vDPA allows to conditionally trap a
virtqueue via ASID. I expect virtio can do the same if PASID is
supported (and there used to be a proposal for this in the past).

>  If I see spec patches
> that do that I personally would support that.  It needs to be reasonably
> generic though, a single 16 bit RW number is not going to be enough.

It's really device specific, vDPA has demonstrated that it's
sufficient for networking devices.

> I
> think it's likely admin commands is a good interface for this. If it's a
> hack making vendor specific assumptions, just keep it in vdpa.

This part I don't understand. Most of the virtqueue states were
accessed via common_cfg, I don't see the advantages of separating the
others in other places unless there's a new transport.

>
> > >
> > > > >
> > > > > > > Definitely KVM did
> > > > > > > not scan PTEs. It used pagefaults with bit per page and later as VM size
> > > > > > > grew switched to PLM.  This interface is analogous to PLM,
> > > > > >
> > > > > > I think you meant PML actually. And it doesn't work like PML. To
> > > > > > behave like PML it needs to
> > > > > >
> > > > > > 1) log buffers were organized as a queue with indices
> > > > > > 2) device needs to suspend (as a #vmexit in PML) if it runs out of the buffers
> > > > > > 3) device need to send a notification to the driver if it runs out of the buffer
> > > > > >
> > > > > > I don't see any of the above in this proposal. If we do that it would
> > > > > > be less problematic than what is being proposed here.
> > > > >
> > > > > What is common between this and PML is that you get the addresses
> > > > > directly without scanning megabytes of bitmaps or worse -
> > > > > hundreds of megabytes of page tables.
> > > >
> > > > Yes, it has overhead but this is the method we use for vhost and KVM (earlier).
> > > >
> > > > To me the  important advantage of PML is that it uses limited
> > > > resources on the host which
> > > >
> > > > 1) doesn't require resources in the device
> > > > 2) doesn't scale as the guest memory increases. (but this advantage
> > > > doesn't exist in neither this nor bitmap)
> > >
> > > it seems 2 exactly exists here.
> >
> > Actually not, Parav said the device needs to reserve sufficient
> > resources in another thread.
> >
> > >
> > >
> > > > >
> > > > > The data structure is different but I don't see why it is critical.
> > > > >
> > > > > I agree that I don't see out of buffers notifications too which implies
> > > > > device has to maintain something like a bitmap internally.  Which I
> > > > > guess could be fine but it is not clear to me how large that bitmap has
> > > > > to be. How does the device know? Needs to be addressed.
> > > >
> > > > This is the question I asked Parav in another thread. Using host
> > > > memory as a queue with notification (like PML) might be much better.
> > >
> > > Well if queue is what you want to do you can just do it internally.
> >
> > Then it's not the proposal here, Parav has explained it in another
> > reply, and as explained it lacks a lot of other facilities.
> >
> > > Problem of course is that it might overflow and cause things like
> > > packet drops.
> >
> > Exactly like PML. So sticking to wire speed should not be a general
> > goal in the context of migration. It can be done if the speed of the
> > migration interface is faster than the virtio device that needs to be
> > migrated.
>
> People buy hardware to improve performance. Apparently there are people
> who want to build this hardware.

We are talking about different things. What I'm saying is that
sticking to wire speed somehow conflicts with the goal of downtime. If
mgmt/guest doesn't allow to increase the downtime, it's very hard to
stick the wirespeed during live dirty page tracking. This doesn't
prevent people from building and using faster hardware, the hardware
might just run slower when doing live migration. If I was wrong,
please explain why.

> It is not our role to tell either
> of the groups "this should not be a general goal".

Well, the downtime has been well studied and used for years, and I
describe the assumptions:

"
It can be done if the speed of the migration interface is faster than
the virtio device that needs to be migrated.
"

KVM and Qemu have a lot of mechanisms to throttle as well.

>
>
> > >
> > >
> > > > >
> > > > >
> > > > > > Even if we manage to do that, it doesn't mean we won't have issues.
> > > > > >
> > > > > > 1) For many reasons it can neither see nor log via GPA, so this
> > > > > > requires a traversal of the vIOMMU mapping tables by the hypervisor
> > > > > > afterwards, it would be expensive and need synchronization with the
> > > > > > guest modification of the IO page table which looks very hard.
> > > > >
> > > > > vIOMMU is fast enough to be used on data path but not fast enough for
> > > > > dirty tracking?
> > > >
> > > > We set up SPTEs or using nesting offloading where the PTEs could be
> > > > iterated by hardware directly which is fast.
> > >
> > > There's a way to have hardware find dirty PTEs for you quickly?
> >
> > Scanning PTEs on the host is faster and more secure than scanning
> > guests, that's what I want to say:
> >
> > 1) the guest page could be swapped out but not the host one.
> > 2) no guest triggerable behavior
> >
> > > I don't know how it's done. Do tell.
> > >
> > >
> > > > This is not the case here where software needs to iterate the IO page
> > > > tables in the guest which could be slow.
> > > >
> > > > > Hard to believe.  If true and you want to speed up
> > > > > vIOMMU then you implement an efficient datastructure for that.
> > > >
> > > > Besides the issue of performance, it's also racy, assuming we are logging IOVA.
> > > >
> > > > 0) device log IOVA
> > > > 1) hypervisor fetches IOVA from log buffer
> > > > 2) guest map IOVA to a new GPA
> > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > >
> > > > Then we lost the old GPA.
> > >
> > > Interesting and a good point.
> >
> > Note that PML logs at GPA as it works at L1 of EPT.
>
> And that's perfect for migration.

Right.

>
> > > And by the way e.g. vhost has the same
> > > issue.  You need to flush dirty tracking info when changing the mappings
> > > somehow.
> >
> > It's not,
> >
> > 1) memory translation is done by vhost
> > 2) vhost knows GPA and it doesn't log via IOVA.
> >
> > See this for example, and DPDK has similar fixes.
> >
> > commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
> > Author: Jason Wang <jasowang@redhat.com>
> > Date:   Wed Jan 16 16:54:42 2019 +0800
> >
> >     vhost: log dirty page correctly
> >
> >     Vhost dirty page logging API is designed to sync through GPA. But we
> >     try to log GIOVA when device IOTLB is enabled. This is wrong and may
> >     lead to missing data after migration.
> >
> >     To solve this issue, when logging with device IOTLB enabled, we will:
> >
> >     1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
> >        get HVA, for writable descriptor, get HVA through iovec. For used
> >        ring update, translate its GIOVA to HVA
> >     2) traverse the GPA->HVA mapping to get the possible GPA and log
> >        through GPA. Pay attention this reverse mapping is not guaranteed
> >        to be unique, so we should log each possible GPA in this case.
> >
> >     This fix the failure of scp to guest during migration. In -next, we
> >     will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> >
> >     Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> >     Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> >     Cc: Jintack Lim <jintack@cs.columbia.edu>
> >     Signed-off-by: Jason Wang <jasowang@redhat.com>
> >     Acked-by: Michael S. Tsirkin <mst@redhat.com>
> >     Signed-off-by: David S. Miller <davem@davemloft.net>
> >
> > All of the above is not what virtio did right now.
>
> Any IOMMU flushes IOTLB on translation changes. If vhost doesn't then
> it's highly likely to be a bug.

It is exactly what vhost did.

>
>
> > > Parav what's the plan for this? Should be addressed in the
> > > spec too.
> > >
> >
> > AFAIK, there's no easy/efficient way to do that. I hope I was wrong.
> >
> > >
> > >
> > > > >
> > > > > > 2) There are a lot of special or reserved IOVA ranges (for example the
> > > > > > interrupt areas in x86) that need special care which is architectural
> > > > > > and where it is beyond the scope or knowledge of the virtio device but
> > > > > > the platform IOMMU. Things would be more complicated when SVA is
> > > > > > enabled.
> > > > >
> > > > > SVA being what here?
> > > >
> > > > For example, IOMMU may treat interrupt ranges differently depending on
> > > > whether SVA is enabled or not. It's very hard and unnecessary to teach
> > > > devices about this.
> > >
> > > Oh, shared virtual memory. So what you are saying here? virtio
> > > does not care, it just uses some addresses and if you want it to
> > > it can record writes somewhere.
> >
> > One example, PCI allows devices to send translated requests, how can a
> > hypervisor know it's a PA or IOVA in this case? We probably need a new
> > bit. But it's not the only thing we need to deal with.
>
> virtio must always log PA.

How? Without ATS, the device can't see PA since it can only use
untranslated requests ...

>
>
> > By definition, interrupt ranges and other reserved ranges should not
> > belong to dirty pages. And the logging should be done before the DMA
> > where there's no way for the device to know whether or not an IOVA is
> > valid or not. It would be more safe to just not report them from the
> > source instead of leaving it to the hypervisor to deal with but this
> > seems impossible at the device level. Otherwise the hypervisor driver
> > needs to communicate with the (v)IOMMU to be reached with the
> > interrupt(MSI) area, RMRR area etc in order to do the correct things
> > or it might have security implications. And those areas don't make
> > sense at L1 when vSVA is enabled. What's more, when vIOMMU could be
> > fully offloaded, there's no easy way to fetch that information.
> >
> > Again, it's hard to bypass or even duplicate the functionality of the
> > platform or we need to step into every single detail of a specific
> > transport, architecture or IOMMU to figure out whether or not logging
> > at virtio is correct which is awkward and unrealistic. This proposal
> > suffers from an exact similar issue when inventing things like
> > freeze/stop where I've pointed out other branches of issues as well.
>
>
> Exactly it's a mess.  Instead of making everything 10x more complex,
> let's just keep talking about PA and leave translation to IOMMU.

For many reasons, the device can't see PA.

Even with PA, it's still problematic, is it GPA or HPA? GPA may only
work if the device is abstracted as two dimension I/O page tables like
IOMMU. For HPA, we can't just report it to the userspace which
requires a software translation again. What's more, as stated above,
there's no way for the device to know if the PA is valid or not
(unless there's an ATS), logging an invalid PA is dangerous and may
have security implications.

>
>
> > >
> > > > >
> > > > > > And there could be other architecte specific knowledge (e.g
> > > > > > PAGE_SIZE) that might be needed. There's no easy way to deal with
> > > > > > those cases.
> > > > >
> > > > > Good point about page size actually - using 4k unconditionally
> > > > > is a waste of resources.
> > > >
> > > > Actually, they are more than just PAGE_SIZE, for example, PASID and others.
> > >
> > > what does pasid have to do with it? anyway, just give driver control
> > > over page size.
> >
> > For example, two virtqueues have two PASIDs assigned. How can a
> > hypervisor know which specific IOVA belongs to which IOVA? For
> > platform IOMMU, they are handy as it talks to the transport. But I
> > don't think we need to duplicate every transport specific address
> > space feature in core virtio layer:
> >
> > 1) translated/untranslated request
> > 2) request w/ and w/o PASID
>
> Can't say I understand. All the talk about IOVA is just confusing -
> what we care about for logging is which page to resend.

See above.

>
> > > > >
> > > > >
> > > > > > We wouldn't need to care about all of them if it is done at platform
> > > > > > IOMMU level.
> > > > >
> > > > > If someone logs at IOMMU level then nothing needs to be done
> > > > > in the spec at all. This is about capability at the device level.
> > > >
> > > > True, but my question is where or not it can be done at the device level easily.
> > >
> > > there's no "easily" about live migration ever.
> >
> > I think I've stated sufficient issues to demonstrate how hard virtio
> > wants to do it. And I've given the link that it is possible to do that
> > in IOMMU without those issues. So in this context doing it in virtio
> > is much harder.
>
> Code walks though.

There's even no code work from Parav to describe how it can work for a
hypervisor.

>
>
> > > For example on-device iommus are a thing.
> >
> > I'm not sure that's the way to go considering the platform IOMMU
> > evolves very quickly.
>
> What do you refer to? People buy hardware and use it for years
> with no chance to add features.

IOMMU evolves quickly, duplicating its functionality looks like a
re-inventing of the wheels.

Again, I think we don't want to suffer from the hard times in
bypassing the platform IOMMU again like in the past.

>
>
> > >
> > > > >
> > > > >
> > > > > > > what Lingshan
> > > > > > > proposed is analogous to bit per page - problem unfortunately is
> > > > > > > you can't easily set a bit by DMA.
> > > > > > >
> > > > > >
> > > > > > I'm not saying bit/bytemap is the best, but it has been used by real
> > > > > > hardware. And we have many other options.
> > > > > >
> > > > > > > So I think this dirty tracking is a good option to have.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > > >
> > > > > > > > > > Why does it matter in 2024?
> > > > > > > > > Because users needs to use it now.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > If not, we are better off to offer this, and when/if platform support is, sure,
> > > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > > >
> > > > > > > > > > > > 4) if the platform support is missing, we can use software or
> > > > > > > > > > > > leverage transport for assistance like PRI
> > > > > > > > > > > All of these are in theory.
> > > > > > > > > > > Our experiment shows PRI performance is 21x slower than page fault rate
> > > > > > > > > > done by the cpu.
> > > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > > >
> > > > > > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > > > > > Do you have perf data for this?
> > > > > > > >
> > > > > > > > No, but it's not hard to imagine the worst case. Wrote a small program
> > > > > > > > that dirty every page by a NIC.
> > > > > > > >
> > > > > > > > > In the internal tests we don’t see this happening.
> > > > > > > >
> > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > > >
> > > > > > > > So if we get very high dirty rates (e.g by a high speed NIC), we can't
> > > > > > > > satisfy the requirement of the downtime. Or if you see the converge,
> > > > > > > > you might get help from the auto converge support by the hypervisors
> > > > > > > > like KVM where it tries to throttle the VCPU then you can't reach the
> > > > > > > > wire speed.
> > > > > > >
> > > > > > > Will only work for some device types.
> > > > > > >
> > > > > >
> > > > > > Yes, that's the point. Parav said he doesn't see the issue, it's
> > > > > > probably because he is testing a virtio-net and so the vCPU is
> > > > > > automatically throttled. It doesn't mean it can work for other virito
> > > > > > devices.
> > > > >
> > > > > Only for TX, and I'm pretty sure they had the foresight to test RX not
> > > > > just TX but let's confirm. Parav did you test both directions?
> > > >
> > > > RX speed somehow depends on the speed of refill, so throttling helps
> > > > more or less.
> > >
> > > It doesn't depend on speed of refill you just underrun and drop
> > > packets. then your nice 10usec latency becomes more like 10sec.
> >
> > I miss your point here. If the driver can't achieve wire speed without
> > dirty page tracking, it can neither when dirty page tracking is
> > enabled.
>
> My point is PRI causes rx ring underruns and throttling CPU makes it
> worse not better. And I believe people actually tried, nvidia
> have a pri implementation in hardware. If they come and say
> virtio help is needed for performance I tend to believe them.

I'm not saying I'm not trusting NV. It's not about trust at all, I'm
saying: if they fail with PRI,

1) if there's any fault in virtio that damages the performance of PRI,
let's fix it in virtio
2) if it's not the fault of virtio in the context of PRI, it doesn't
necessarily mean logging via virtio is the only way to go, we can seek
support from others which fit better

Unfortunately, they didn't explain why they chose to do it in virtio
until I pointed out the issues.

>
>
>
> > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > > So it is unusable.
> > > > > > > > > >
> > > > > > > > > > It's not about mandating, it's about doing things in the correct layer. If PRI is
> > > > > > > > > > slow, PCI can evolve for sure.
> > > > > > > > > You should try.
> > > > > > > >
> > > > > > > > Not my duty, I just want to make sure things are done in the correct
> > > > > > > > layer, and once it needs to be done in the virtio, there's nothing
> > > > > > > > obviously wrong.
> > > > > > >
> > > > > > > Yea but just vague questions don't help to make sure eiter way.
> > > > > >
> > > > > > I don't think it's vague, I have explained, if something in the virito
> > > > > > slows down the PRI, we can try to fix them.
> > > > >
> > > > > I don't believe you are going to make PRI fast. No one managed so far.
> > > >
> > > > So it's the fault of PRI not virito, but it doesn't mean we need to do
> > > > it in virtio.
> > >
> > > I keep saying with this approach we would just say "e1000 emulation is
> > > slow and encumbered this is the fault of e1000" and never get virtio at
> > > all.  Assigning blame only gets you so far.
> >
> > I think we are discussing different things. My point is virtio needs
> > to leverage the functionality provided by transport or platform
> > (especially considering they evolve faster than virtio). It seems to
> > me it's hard even to duplicate some basic function of platform IOMMU
> > in virtio.
>
> Dirty tracking in the IOMMU is annoying enough that I am not

What issue did you see? We can report them to platform vendors anyhow.

> sure it's usable. Go ahead but I want to see patches then.

If we agree to log via IOMMU what kind of patches did you expect to see?

>
> > >
> > > > >
> > > > > > Missing functions in
> > > > > > platform or transport is not a good excuse to try to workaround it in
> > > > > > the virtio. It's a layer violation and we never had any feature like
> > > > > > this in the past.
> > > > >
> > > > > Yes missing functionality in the platform is exactly why virtio
> > > > > was born in the first place.
> > > >
> > > > Well the platform can't do device specific logic. But that's not the
> > > > case of dirty page tracking which is device logic agnostic.
> > >
> > > Not true platforms have things like NICs on board and have for many
> > > years. It's about performance really.
> >
> > I've stated sufficient issues above. And one more obvious issue for
> > device initiated page logging is that it needs a lot of extra or
> > unnecessary PCI transactions which will throttle the performance of
> > the whole system (and lead to other issues like QOS).
>
> Maybe. This kind of statement is just vague enough not to be falsifiable.

I don't think so. It could be falsifiable if some vendor comes with
real numbers:

1) demonstrate the possibility of converging a migration when virito
is running at wire speed
2) demonstrate logging dirty pages in one VF doesn't damage the
performance of other

with reasonable explanations. It's not hard to test the above two simple cases.

>
> > So I can't
> > believe it has good performance overall. Logging via IOMMU or using
> > shadow virtqueue doesn't need any extra PCI transactions at least.
>
> On the other hand they have an extra CPU cost.

This is the way current vhost is working. We know the pros/cons. And
there are many ways to limit the bandwidth/QOS of a software based
dirty tracking.

> Personally if this is
> coming from a hardware vendor, I am inclined to trust them wrt PCI
> transactions.

The point is not about trust. I think Parav has said in another thread
that RX performance is throttled by the dirty tracking.

> But anyway, discussing this at a high level theoretically
> is pointless -

As a reviewer, the most important thing for me is to make sure the
proposal is theoretically correct before I can go through the details.

> whoever bothers with actual prototyping for performance
> testing wins,

This part I don't understand.

LingShan has given you the proof that Intel has done it several years
ago. And shadow virtqueue is inspired by those works as well.
LingShan's proposal is based on those experiences and that's why
LingShan's proposal does not come with dirty page tracking.

My understanding is, being an open device standard, the spec needs to
seek the best way to go instead of just one of the possible ways to
go. We never claim "we are the first so let's go with my way".

> if no one does I'd expect a back of a napkin estimate
> to be included.

I'd expect any huge feature like this needs to be prototyped before
they can be discussed or it needs to be tagged as RFC.

Thanks






>
>
>
> > > So I'd like Parav to publish some
> > > experiment results and/or some estimates.
> > >
> >
> > That's fine, but the above equation (used by Qemu) is sufficient to
> > demonstrate how hard to stick wire speed in the case.
> >
> > >
> > > > >
> > > > > > >
> > > > > > > > > In the current state, it is mandating.
> > > > > > > > > And if you think PRI is the only way,
> > > > > > > >
> > > > > > > > I don't, it's just an example where virtio can leverage from either
> > > > > > > > transport or platform. Or if it's the fault in virtio that slows down
> > > > > > > > the PRI, then it is something we can do.
> > > > > > > >
> > > > > > > > >  than you should propose that in the dirty page tracking series that you listed above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > > > >
> > > > > > > > No, the point is to not duplicate works especially considering virtio
> > > > > > > > can't do better than platform or transport.
> > > > > > >
> > > > > > > If someone says they tried and platform's migration support does not
> > > > > > > work for them and they want to build a solution in virtio then
> > > > > > > what exactly is the objection?
> > > > > >
> > > > > > The discussion is to make sure whether virtio can do this easily and
> > > > > > correctly, then we can have a conclusion. I've stated some issues
> > > > > > above, and I've asked other questions related to them which are still
> > > > > > not answered.
> > > > > >
> > > > > > I think we had a very hard time in bypassing IOMMU in the past that we
> > > > > > don't want to repeat.
> > > > > >
> > > > > > We've gone through several methods of logging dirty pages in the past
> > > > > > (each with pros/cons), but this proposal never explains why it chooses
> > > > > > one of them but not others. Spec needs to find the best path instead
> > > > > > of just a possible path without any rationale about why.
> > > > >
> > > > > Adding more rationale isn't a bad thing.
> > > > > In particular if platform supplies dirty tracking then how does
> > > > > driver decide which to use platform or device capability?
> > > > > A bit of discussion around this is a good idea.
> > > > >
> > > > >
> > > > > > > virtio is here in the
> > > > > > > first place because emulating devices didn't work well.
> > > > > >
> > > > > > I don't understand here. We have supported emulated devices for years.
> > > > > > I'm pretty sure a lot of issues could be uncovered if this proposal
> > > > > > can be prototyped with an emulated device first.
> > > > > >
> > > > > > Thanks
> > > > >
> > > > > virtio was originally PV as opposed to emulation. That there's now
> > > > > hardware virtio and you call software implementation "an emulation" is
> > > > > very meta.
> > > >
> > > > Yes but I don't see how it relates to dirty page tracking. When we
> > > > find a way it should work for both software and hardware devices.
> > > >
> > > > Thanks
> > >
> > > It has to work well on a variety of existing platforms. If it does then
> > > sure, why would we roll our own.
> >
> > If virtio can do that in an efficient way without any issues, I agree.
> > But it seems not.
> >
> > Thanks
>
>
>
> >
> >
> >
> >
> >
> >
> > >
> > > --
> > > MST
> > >
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-14  7:34                                   ` Zhu, Lingshan
@ 2023-11-14  7:59                                     ` Jason Wang
  2023-11-14  8:27                                     ` Michael S. Tsirkin
  1 sibling, 0 replies; 157+ messages in thread
From: Jason Wang @ 2023-11-14  7:59 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Michael S. Tsirkin, Parav Pandit, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Tue, Nov 14, 2023 at 3:34 PM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
>
>
>
> On 11/13/2023 2:57 PM, Michael S. Tsirkin wrote:
> > On Mon, Nov 13, 2023 at 11:31:37AM +0800, Jason Wang wrote:
> >> On Thu, Nov 9, 2023 at 3:59 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote:
> >>>> On Wed, Nov 8, 2023 at 4:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>> On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> >>>>>> On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>>>> On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> >>>>>>>>>>> Each virtio and non virtio devices who wants to report their dirty page report,
> >>>>>>>>>> will do their way.
> >>>>>>>>>>>> 3) inventing it in the virtio layer will be deprecated in the future
> >>>>>>>>>>>> for sure, as platform will provide much rich features for logging
> >>>>>>>>>>>> e.g it can do it per PASID etc, I don't see any reason virtio need
> >>>>>>>>>>>> to compete with the features that will be provided by the platform
> >>>>>>>>>>> Can you bring the cpu vendors and committement to virtio tc with timelines
> >>>>>>>>>> so that virtio TC can omit?
> >>>>>>>>>>

[...]

> > On the other hand they have an extra CPU cost.  Personally if this is
> > coming from a hardware vendor, I am inclined to trust them wrt PCI
> > transactions.  But anyway, discussing this at a high level theoretically
> > is pointless - whoever bothers with actual prototyping for performance
> > testing wins, if no one does I'd expect a back of a napkin estimate
> > to be included.
> if so, Intel has released productions implementing these interfaces
> years ago,
> see live migration in 4.1. IFCVF vDPA Implementation,
> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html
> and
>
> But I still believe we are here try our best to work out an industrial spec
> with better quality, to serve broad interest. This is not competition
> between companies,
> and the spec is not a FIFO, not like a early bird can catch all the worm.


This is my understanding as well.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-14  7:34                                   ` Zhu, Lingshan
  2023-11-14  7:59                                     ` Jason Wang
@ 2023-11-14  8:27                                     ` Michael S. Tsirkin
  2023-11-15  4:05                                       ` Zhu, Lingshan
  1 sibling, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-14  8:27 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Jason Wang, Parav Pandit, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Tue, Nov 14, 2023 at 03:34:32PM +0800, Zhu, Lingshan wrote:
> > > So I can't
> > > believe it has good performance overall. Logging via IOMMU or using
> > > shadow virtqueue doesn't need any extra PCI transactions at least.
> > On the other hand they have an extra CPU cost.  Personally if this is
> > coming from a hardware vendor, I am inclined to trust them wrt PCI
> > transactions.  But anyway, discussing this at a high level theoretically
> > is pointless - whoever bothers with actual prototyping for performance
> > testing wins, if no one does I'd expect a back of a napkin estimate
> > to be included.
> if so, Intel has released productions implementing these interfaces years
> ago,
> see live migration in 4.1. IFCVF vDPA Implementation,
> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html
> and

That one is based on shadow queue, right? Which I think this shows
is worth supporting.

> But I still believe we are here try our best to work out an industrial spec
> with better quality, to serve broad interest. This is not competition
> between companies,
> and the spec is not a FIFO, not like a early bird can catch all the worm.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-14  7:57                                   ` Jason Wang
@ 2023-11-14  9:16                                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-14  9:16 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Tue, Nov 14, 2023 at 03:57:01PM +0800, Jason Wang wrote:
> On Mon, Nov 13, 2023 at 2:57 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Mon, Nov 13, 2023 at 11:31:37AM +0800, Jason Wang wrote:
> > > On Thu, Nov 9, 2023 at 3:59 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote:
> > > > > On Wed, Nov 8, 2023 at 4:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> > > > > > > On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > > > > > > Each virtio and non virtio devices who wants to report their dirty page report,
> > > > > > > > > > > will do their way.
> > > > > > > > > > > >
> > > > > > > > > > > > > 3) inventing it in the virtio layer will be deprecated in the future
> > > > > > > > > > > > > for sure, as platform will provide much rich features for logging
> > > > > > > > > > > > > e.g it can do it per PASID etc, I don't see any reason virtio need
> > > > > > > > > > > > > to compete with the features that will be provided by the platform
> > > > > > > > > > > > Can you bring the cpu vendors and committement to virtio tc with timelines
> > > > > > > > > > > so that virtio TC can omit?
> > > > > > > > > > >
> > > > > > > > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio needs to be built
> > > > > > > > > > > on top of transport or platform. There's no need to duplicate their job.
> > > > > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > > > > >
> > > > > > > > > > I wanted to see a strong commitment for the cpu vendors to support dirty page tracking.
> > > > > > > > >
> > > > > > > > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
> > > > > > > > > ARM are all supporting that now.
> > > > > > > > >
> > > > > > > > > > And the work seems to have started for some platforms.
> > > > > > > > >
> > > > > > > > > Let me quote from the above link:
> > > > > > > > >
> > > > > > > > > """
> > > > > > > > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > > > > > > > alongside VT-D rev3.x also do support.
> > > > > > > > > """
> > > > > > > > >
> > > > > > > > > > Without such platform commitment, virtio also skipping it would not work.
> > > > > > > > >
> > > > > > > > > Is the above sufficient? I'm a little bit more familiar with vtd, the
> > > > > > > > > hw feature has been there for years.
> > > > > > > >
> > > > > > > >
> > > > > > > > Repeating myself - I'm not sure that will work well for all workloads.
> > > > > > >
> > > > > > > I think this comment applies to this proposal as well.
> > > > > >
> > > > > > Yes - some systems might be better off with platform tracking.
> > > > > > And I think supporting shadow vq better would be nice too.
> > > > >
> > > > > For shadow vq, did you mean the work that is done by Eugenio?
> > > >
> > > > Yes.
> > >
> > > That's exactly why vDPA starts with shadow virtqueue. We've evaluated
> > > various possible approaches, each of them have their shortcomings and
> > > shadow virtqueue is the only one that doesn't require any additional
> > > hardware features to work in every platform.
> >
> > What I would like to see is effort to switch shadow on/off not keep it
> > on at all times. That's only good enough for a PoC. And to work on top
> > of virtio that will require effort in the spec.
> 
> Well, there're various approaches. If we just care about the shadow vq
> on/off. Virtqueue indexes plus inflight should be sufficient.

I'm not sure what "inflight" is and what "indexes" are but yes, you need
information about buffers that have been made available to device
and have not been consumed yet.

> Talking about the future, since vDPA allows to conditionally trap a
> virtqueue via ASID. I expect virtio can do the same if PASID is
> supported (and there used to be a proposal for this in the past).

I don't know what does "trap" mean in this sentence.

> >  If I see spec patches
> > that do that I personally would support that.  It needs to be reasonably
> > generic though, a single 16 bit RW number is not going to be enough.
> 
> It's really device specific, vDPA has demonstrated that it's
> sufficient for networking devices.

I think that existing vdpa devices are just silently in order.
If device is in order, and given it's networking so there's no
processing as such - just DMA - then I think the state of the
ring is fully described by the available and used index in memory.
Maybe I'm missing something obvious.

> > I
> > think it's likely admin commands is a good interface for this. If it's a
> > hack making vendor specific assumptions, just keep it in vdpa.
> 
> This part I don't understand. Most of the virtqueue states were
> accessed via common_cfg, I don't see the advantages of separating the
> others in other places unless there's a new transport.

A ring has up to 64k buffers available and not used.  I'm not sure how
much info is necessary for each but even with a byte per buffer, and
multiplied by 32k queues we are pushing a gigabyte.  Reading this out
through a register mapped interface from the hypervisor, with an exit
per dword is going to be unreasonably slow.

So you are going to do DMA, and pass some commands back and forth.  Why
not reuse the admin command structure for this? The admin command header
is 16 bytes for write portion and 8 bytes for read portion.  And that is
overkill? Saving 24 bytes of DMA on slow path is worth inventing a
custom format for? Color me unimpressed.

Yes, in order is simpler and you might get away without this.
I am not very excited about a feature so limited, but hey -
make the dependency explicit, we can discuss.

> >
> > > >
> > > > > >
> > > > > > > > Definitely KVM did
> > > > > > > > not scan PTEs. It used pagefaults with bit per page and later as VM size
> > > > > > > > grew switched to PLM.  This interface is analogous to PLM,
> > > > > > >
> > > > > > > I think you meant PML actually. And it doesn't work like PML. To
> > > > > > > behave like PML it needs to
> > > > > > >
> > > > > > > 1) log buffers were organized as a queue with indices
> > > > > > > 2) device needs to suspend (as a #vmexit in PML) if it runs out of the buffers
> > > > > > > 3) device need to send a notification to the driver if it runs out of the buffer
> > > > > > >
> > > > > > > I don't see any of the above in this proposal. If we do that it would
> > > > > > > be less problematic than what is being proposed here.
> > > > > >
> > > > > > What is common between this and PML is that you get the addresses
> > > > > > directly without scanning megabytes of bitmaps or worse -
> > > > > > hundreds of megabytes of page tables.
> > > > >
> > > > > Yes, it has overhead but this is the method we use for vhost and KVM (earlier).
> > > > >
> > > > > To me the  important advantage of PML is that it uses limited
> > > > > resources on the host which
> > > > >
> > > > > 1) doesn't require resources in the device
> > > > > 2) doesn't scale as the guest memory increases. (but this advantage
> > > > > doesn't exist in neither this nor bitmap)
> > > >
> > > > it seems 2 exactly exists here.
> > >
> > > Actually not, Parav said the device needs to reserve sufficient
> > > resources in another thread.
> > >
> > > >
> > > >
> > > > > >
> > > > > > The data structure is different but I don't see why it is critical.
> > > > > >
> > > > > > I agree that I don't see out of buffers notifications too which implies
> > > > > > device has to maintain something like a bitmap internally.  Which I
> > > > > > guess could be fine but it is not clear to me how large that bitmap has
> > > > > > to be. How does the device know? Needs to be addressed.
> > > > >
> > > > > This is the question I asked Parav in another thread. Using host
> > > > > memory as a queue with notification (like PML) might be much better.
> > > >
> > > > Well if queue is what you want to do you can just do it internally.
> > >
> > > Then it's not the proposal here, Parav has explained it in another
> > > reply, and as explained it lacks a lot of other facilities.
> > >
> > > > Problem of course is that it might overflow and cause things like
> > > > packet drops.
> > >
> > > Exactly like PML. So sticking to wire speed should not be a general
> > > goal in the context of migration. It can be done if the speed of the
> > > migration interface is faster than the virtio device that needs to be
> > > migrated.
> >
> > People buy hardware to improve performance. Apparently there are people
> > who want to build this hardware.
> 
> We are talking about different things. What I'm saying is that
> sticking to wire speed somehow conflicts with the goal of downtime. If
> mgmt/guest doesn't allow to increase the downtime, it's very hard to
> stick the wirespeed during live dirty page tracking. This doesn't
> prevent people from building and using faster hardware, the hardware
> might just run slower when doing live migration. If I was wrong,
> please explain why.

Which wire? Think about it. If your "wire speed" is saturating the pci
link then extra traffic on that link is going to mean you go slower.
This does not immediately mean you can just ignore speed completely
either btw.  Are all devices and all work-loads always saturating pci? I
doubt it.  For example, latency matters for a lot of people. You don't
saturate pci but you don't want your hypervisor to be on the data path.
That's a problem for shadow and for PRI.


> > It is not our role to tell either
> > of the groups "this should not be a general goal".
> 
> Well, the downtime has been well studied and used for years, and I
> describe the assumptions:
> 
> "
> It can be done if the speed of the migration interface is faster than
> the virtio device that needs to be migrated.
> "
> 
> KVM and Qemu have a lot of mechanisms to throttle as well.

Yes, and so? That all exists, if people are satisfied with what exists
we can call it a day and not bother adding stuff to spec.


> >
> >
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > > > Even if we manage to do that, it doesn't mean we won't have issues.
> > > > > > >
> > > > > > > 1) For many reasons it can neither see nor log via GPA, so this
> > > > > > > requires a traversal of the vIOMMU mapping tables by the hypervisor
> > > > > > > afterwards, it would be expensive and need synchronization with the
> > > > > > > guest modification of the IO page table which looks very hard.
> > > > > >
> > > > > > vIOMMU is fast enough to be used on data path but not fast enough for
> > > > > > dirty tracking?
> > > > >
> > > > > We set up SPTEs or using nesting offloading where the PTEs could be
> > > > > iterated by hardware directly which is fast.
> > > >
> > > > There's a way to have hardware find dirty PTEs for you quickly?
> > >
> > > Scanning PTEs on the host is faster and more secure than scanning
> > > guests, that's what I want to say:
> > >
> > > 1) the guest page could be swapped out but not the host one.
> > > 2) no guest triggerable behavior
> > >
> > > > I don't know how it's done. Do tell.
> > > >
> > > >
> > > > > This is not the case here where software needs to iterate the IO page
> > > > > tables in the guest which could be slow.
> > > > >
> > > > > > Hard to believe.  If true and you want to speed up
> > > > > > vIOMMU then you implement an efficient datastructure for that.
> > > > >
> > > > > Besides the issue of performance, it's also racy, assuming we are logging IOVA.
> > > > >
> > > > > 0) device log IOVA
> > > > > 1) hypervisor fetches IOVA from log buffer
> > > > > 2) guest map IOVA to a new GPA
> > > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > > >
> > > > > Then we lost the old GPA.
> > > >
> > > > Interesting and a good point.
> > >
> > > Note that PML logs at GPA as it works at L1 of EPT.
> >
> > And that's perfect for migration.
> 
> Right.
> 
> >
> > > > And by the way e.g. vhost has the same
> > > > issue.  You need to flush dirty tracking info when changing the mappings
> > > > somehow.
> > >
> > > It's not,
> > >
> > > 1) memory translation is done by vhost
> > > 2) vhost knows GPA and it doesn't log via IOVA.
> > >
> > > See this for example, and DPDK has similar fixes.
> > >
> > > commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
> > > Author: Jason Wang <jasowang@redhat.com>
> > > Date:   Wed Jan 16 16:54:42 2019 +0800
> > >
> > >     vhost: log dirty page correctly
> > >
> > >     Vhost dirty page logging API is designed to sync through GPA. But we
> > >     try to log GIOVA when device IOTLB is enabled. This is wrong and may
> > >     lead to missing data after migration.
> > >
> > >     To solve this issue, when logging with device IOTLB enabled, we will:
> > >
> > >     1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
> > >        get HVA, for writable descriptor, get HVA through iovec. For used
> > >        ring update, translate its GIOVA to HVA
> > >     2) traverse the GPA->HVA mapping to get the possible GPA and log
> > >        through GPA. Pay attention this reverse mapping is not guaranteed
> > >        to be unique, so we should log each possible GPA in this case.
> > >
> > >     This fix the failure of scp to guest during migration. In -next, we
> > >     will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> > >
> > >     Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> > >     Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> > >     Cc: Jintack Lim <jintack@cs.columbia.edu>
> > >     Signed-off-by: Jason Wang <jasowang@redhat.com>
> > >     Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > >
> > > All of the above is not what virtio did right now.
> >
> > Any IOMMU flushes IOTLB on translation changes. If vhost doesn't then
> > it's highly likely to be a bug.
> 
> It is exactly what vhost did.
> 
> >
> >
> > > > Parav what's the plan for this? Should be addressed in the
> > > > spec too.
> > > >
> > >
> > > AFAIK, there's no easy/efficient way to do that. I hope I was wrong.
> > >
> > > >
> > > >
> > > > > >
> > > > > > > 2) There are a lot of special or reserved IOVA ranges (for example the
> > > > > > > interrupt areas in x86) that need special care which is architectural
> > > > > > > and where it is beyond the scope or knowledge of the virtio device but
> > > > > > > the platform IOMMU. Things would be more complicated when SVA is
> > > > > > > enabled.
> > > > > >
> > > > > > SVA being what here?
> > > > >
> > > > > For example, IOMMU may treat interrupt ranges differently depending on
> > > > > whether SVA is enabled or not. It's very hard and unnecessary to teach
> > > > > devices about this.
> > > >
> > > > Oh, shared virtual memory. So what you are saying here? virtio
> > > > does not care, it just uses some addresses and if you want it to
> > > > it can record writes somewhere.
> > >
> > > One example, PCI allows devices to send translated requests, how can a
> > > hypervisor know it's a PA or IOVA in this case? We probably need a new
> > > bit. But it's not the only thing we need to deal with.
> >
> > virtio must always log PA.
> 
> How? Without ATS, the device can't see PA since it can only use
> untranslated requests ...

Please can we speak the spec terms?
It does not matter that there's some IOMMU somewhere and then it
wants to call addresses on the physical pci link virtual addresses.
device vendors without an iommu only know one kind of address.
And so the only place where virtio spec mentions IOVA is in iommu device part.
The rest of the spec calls whatever is in the ring "physical address".


> >
> >
> > > By definition, interrupt ranges and other reserved ranges should not
> > > belong to dirty pages. And the logging should be done before the DMA
> > > where there's no way for the device to know whether or not an IOVA is
> > > valid or not. It would be more safe to just not report them from the
> > > source instead of leaving it to the hypervisor to deal with but this
> > > seems impossible at the device level. Otherwise the hypervisor driver
> > > needs to communicate with the (v)IOMMU to be reached with the
> > > interrupt(MSI) area, RMRR area etc in order to do the correct things
> > > or it might have security implications. And those areas don't make
> > > sense at L1 when vSVA is enabled. What's more, when vIOMMU could be
> > > fully offloaded, there's no easy way to fetch that information.
> > >
> > > Again, it's hard to bypass or even duplicate the functionality of the
> > > platform or we need to step into every single detail of a specific
> > > transport, architecture or IOMMU to figure out whether or not logging
> > > at virtio is correct which is awkward and unrealistic. This proposal
> > > suffers from an exact similar issue when inventing things like
> > > freeze/stop where I've pointed out other branches of issues as well.
> >
> >
> > Exactly it's a mess.  Instead of making everything 10x more complex,
> > let's just keep talking about PA and leave translation to IOMMU.
> 
> For many reasons, the device can't see PA.
> 
> Even with PA, it's still problematic, is it GPA or HPA? GPA may only
> work if the device is abstracted as two dimension I/O page tables like
> IOMMU. For HPA, we can't just report it to the userspace which
> requires a software translation again. What's more, as stated above,
> there's no way for the device to know if the PA is valid or not
> (unless there's an ATS), logging an invalid PA is dangerous and may
> have security implications.

/facepalm

virtio only knows one type of address. it calls it "physical address"
for historical reasons. don't program an invalid address into
the device otherwise you will break it and get to keep both pieces.


> >
> >
> > > >
> > > > > >
> > > > > > > And there could be other architecte specific knowledge (e.g
> > > > > > > PAGE_SIZE) that might be needed. There's no easy way to deal with
> > > > > > > those cases.
> > > > > >
> > > > > > Good point about page size actually - using 4k unconditionally
> > > > > > is a waste of resources.
> > > > >
> > > > > Actually, they are more than just PAGE_SIZE, for example, PASID and others.
> > > >
> > > > what does pasid have to do with it? anyway, just give driver control
> > > > over page size.
> > >
> > > For example, two virtqueues have two PASIDs assigned. How can a
> > > hypervisor know which specific IOVA belongs to which IOVA? For
> > > platform IOMMU, they are handy as it talks to the transport. But I
> > > don't think we need to duplicate every transport specific address
> > > space feature in core virtio layer:
> > >
> > > 1) translated/untranslated request
> > > 2) request w/ and w/o PASID
> >
> > Can't say I understand. All the talk about IOVA is just confusing -
> > what we care about for logging is which page to resend.
> 
> See above.

I still see nother relevant above.


> >
> > > > > >
> > > > > >
> > > > > > > We wouldn't need to care about all of them if it is done at platform
> > > > > > > IOMMU level.
> > > > > >
> > > > > > If someone logs at IOMMU level then nothing needs to be done
> > > > > > in the spec at all. This is about capability at the device level.
> > > > >
> > > > > True, but my question is where or not it can be done at the device level easily.
> > > >
> > > > there's no "easily" about live migration ever.
> > >
> > > I think I've stated sufficient issues to demonstrate how hard virtio
> > > wants to do it. And I've given the link that it is possible to do that
> > > in IOMMU without those issues. So in this context doing it in virtio
> > > is much harder.
> >
> > Code walks though.
> 
> There's even no code work from Parav to describe how it can work for a
> hypervisor.
> 
> >
> >
> > > > For example on-device iommus are a thing.
> > >
> > > I'm not sure that's the way to go considering the platform IOMMU
> > > evolves very quickly.
> >
> > What do you refer to? People buy hardware and use it for years
> > with no chance to add features.
> 
> IOMMU evolves quickly, duplicating its functionality looks like a
> re-inventing of the wheels.
> 
> Again, I think we don't want to suffer from the hard times in
> bypassing the platform IOMMU again like in the past.

This is just a weird claim. Platforms historically evolved much slower
than devices.  Which IOMMUs evolve quickly? What is quickly in your
world?

> >
> >
> > > >
> > > > > >
> > > > > >
> > > > > > > > what Lingshan
> > > > > > > > proposed is analogous to bit per page - problem unfortunately is
> > > > > > > > you can't easily set a bit by DMA.
> > > > > > > >
> > > > > > >
> > > > > > > I'm not saying bit/bytemap is the best, but it has been used by real
> > > > > > > hardware. And we have many other options.
> > > > > > >
> > > > > > > > So I think this dirty tracking is a good option to have.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > > > >
> > > > > > > > > > > Why does it matter in 2024?
> > > > > > > > > > Because users needs to use it now.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > If not, we are better off to offer this, and when/if platform support is, sure,
> > > > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > > > >
> > > > > > > > > > > > > 4) if the platform support is missing, we can use software or
> > > > > > > > > > > > > leverage transport for assistance like PRI
> > > > > > > > > > > > All of these are in theory.
> > > > > > > > > > > > Our experiment shows PRI performance is 21x slower than page fault rate
> > > > > > > > > > > done by the cpu.
> > > > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > > > >
> > > > > > > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > > > > > > Do you have perf data for this?
> > > > > > > > >
> > > > > > > > > No, but it's not hard to imagine the worst case. Wrote a small program
> > > > > > > > > that dirty every page by a NIC.
> > > > > > > > >
> > > > > > > > > > In the internal tests we don’t see this happening.
> > > > > > > > >
> > > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > > > >
> > > > > > > > > So if we get very high dirty rates (e.g by a high speed NIC), we can't
> > > > > > > > > satisfy the requirement of the downtime. Or if you see the converge,
> > > > > > > > > you might get help from the auto converge support by the hypervisors
> > > > > > > > > like KVM where it tries to throttle the VCPU then you can't reach the
> > > > > > > > > wire speed.
> > > > > > > >
> > > > > > > > Will only work for some device types.
> > > > > > > >
> > > > > > >
> > > > > > > Yes, that's the point. Parav said he doesn't see the issue, it's
> > > > > > > probably because he is testing a virtio-net and so the vCPU is
> > > > > > > automatically throttled. It doesn't mean it can work for other virito
> > > > > > > devices.
> > > > > >
> > > > > > Only for TX, and I'm pretty sure they had the foresight to test RX not
> > > > > > just TX but let's confirm. Parav did you test both directions?
> > > > >
> > > > > RX speed somehow depends on the speed of refill, so throttling helps
> > > > > more or less.
> > > >
> > > > It doesn't depend on speed of refill you just underrun and drop
> > > > packets. then your nice 10usec latency becomes more like 10sec.
> > >
> > > I miss your point here. If the driver can't achieve wire speed without
> > > dirty page tracking, it can neither when dirty page tracking is
> > > enabled.
> >
> > My point is PRI causes rx ring underruns and throttling CPU makes it
> > worse not better. And I believe people actually tried, nvidia
> > have a pri implementation in hardware. If they come and say
> > virtio help is needed for performance I tend to believe them.
> 
> I'm not saying I'm not trusting NV. It's not about trust at all, I'm
> saying: if they fail with PRI,
> 
> 1) if there's any fault in virtio that damages the performance of PRI,
> let's fix it in virtio

PRI is just slow nothing to do with virtio.

> 2) if it's not the fault of virtio in the context of PRI, it doesn't
> necessarily mean logging via virtio is the only way to go, we can seek
> support from others which fit better

I don't know how is anyone going to do anything useful with feedback
like this. monkey see problem monkey fix problem.

> Unfortunately, they didn't explain why they chose to do it in virtio
> until I pointed out the issues.

More motivation is always nice to have.

> >
> >
> >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > > > So it is unusable.
> > > > > > > > > > >
> > > > > > > > > > > It's not about mandating, it's about doing things in the correct layer. If PRI is
> > > > > > > > > > > slow, PCI can evolve for sure.
> > > > > > > > > > You should try.
> > > > > > > > >
> > > > > > > > > Not my duty, I just want to make sure things are done in the correct
> > > > > > > > > layer, and once it needs to be done in the virtio, there's nothing
> > > > > > > > > obviously wrong.
> > > > > > > >
> > > > > > > > Yea but just vague questions don't help to make sure eiter way.
> > > > > > >
> > > > > > > I don't think it's vague, I have explained, if something in the virito
> > > > > > > slows down the PRI, we can try to fix them.
> > > > > >
> > > > > > I don't believe you are going to make PRI fast. No one managed so far.
> > > > >
> > > > > So it's the fault of PRI not virito, but it doesn't mean we need to do
> > > > > it in virtio.
> > > >
> > > > I keep saying with this approach we would just say "e1000 emulation is
> > > > slow and encumbered this is the fault of e1000" and never get virtio at
> > > > all.  Assigning blame only gets you so far.
> > >
> > > I think we are discussing different things. My point is virtio needs
> > > to leverage the functionality provided by transport or platform
> > > (especially considering they evolve faster than virtio). It seems to
> > > me it's hard even to duplicate some basic function of platform IOMMU
> > > in virtio.
> >
> > Dirty tracking in the IOMMU is annoying enough that I am not
> 
> What issue did you see? We can report them to platform vendors anyhow.

IIUC there's no log. You need to scan all PTEs to test and
clear the dirty bit. This costs CPU time. The issues were discussed
when kvm switched to PML - the reason PML is nice is not IMHO
as you say that it stops the VM - that's more of a problem for KVM -
it is that you don't need to keep rescanning memory.


> > sure it's usable. Go ahead but I want to see patches then.
> 
> If we agree to log via IOMMU what kind of patches did you expect to see?

A patch to iommufd that lets you find out which memory was modified
so you can migrate it.


> >
> > > >
> > > > > >
> > > > > > > Missing functions in
> > > > > > > platform or transport is not a good excuse to try to workaround it in
> > > > > > > the virtio. It's a layer violation and we never had any feature like
> > > > > > > this in the past.
> > > > > >
> > > > > > Yes missing functionality in the platform is exactly why virtio
> > > > > > was born in the first place.
> > > > >
> > > > > Well the platform can't do device specific logic. But that's not the
> > > > > case of dirty page tracking which is device logic agnostic.
> > > >
> > > > Not true platforms have things like NICs on board and have for many
> > > > years. It's about performance really.
> > >
> > > I've stated sufficient issues above. And one more obvious issue for
> > > device initiated page logging is that it needs a lot of extra or
> > > unnecessary PCI transactions which will throttle the performance of
> > > the whole system (and lead to other issues like QOS).
> >
> > Maybe. This kind of statement is just vague enough not to be falsifiable.
> 
> I don't think so. It could be falsifiable if some vendor comes with
> real numbers:
> 
> 1) demonstrate the possibility of converging a migration when virito
> is running at wire speed
> 2) demonstrate logging dirty pages in one VF doesn't damage the
> performance of other
> 
> with reasonable explanations. It's not hard to test the above two simple cases.

what does the above have to do with "unnecessary PCI transactions" and
"issues like QOS"?

> >
> > > So I can't
> > > believe it has good performance overall. Logging via IOMMU or using
> > > shadow virtqueue doesn't need any extra PCI transactions at least.
> >
> > On the other hand they have an extra CPU cost.
> 
> This is the way current vhost is working. We know the pros/cons. And
> there are many ways to limit the bandwidth/QOS of a software based
> dirty tracking.

So good. Leave it also, it works. You like how it works, whoever
is satisfied can just use it. Can we move on?


> > Personally if this is
> > coming from a hardware vendor, I am inclined to trust them wrt PCI
> > transactions.
> 
> The point is not about trust. I think Parav has said in another thread
> that RX performance is throttled by the dirty tracking.
> 
> > But anyway, discussing this at a high level theoretically
> > is pointless -
> 
> As a reviewer, the most important thing for me is to make sure the
> proposal is theoretically correct before I can go through the details.
> 
> > whoever bothers with actual prototyping for performance
> > testing wins,
> 
> This part I don't understand.

You just asked for a prototype and performance numbers yourself.


> LingShan has given you the proof that Intel has done it several years
> ago. And shadow virtqueue is inspired by those works as well.
> LingShan's proposal is based on those experiences and that's why
> LingShan's proposal does not come with dirty page tracking.

Fine. So dirty tracking should be optional. Sounds good.  And there
should be some info showing how dirty tracking if available beings a
performance benefit.  Sounds even better.

> My understanding is, being an open device standard, the spec needs to
> seek the best way to go instead of just one of the possible ways to
> go. We never claim "we are the first so let's go with my way".
> 
> > if no one does I'd expect a back of a napkin estimate
> > to be included.
> 
> I'd expect any huge feature like this needs to be prototyped before
> they can be discussed or it needs to be tagged as RFC.
> 
> Thanks
> 
> 

I think this was already done. Parav?



> 
> 
> 
> >
> >
> >
> > > > So I'd like Parav to publish some
> > > > experiment results and/or some estimates.
> > > >
> > >
> > > That's fine, but the above equation (used by Qemu) is sufficient to
> > > demonstrate how hard to stick wire speed in the case.
> > >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > In the current state, it is mandating.
> > > > > > > > > > And if you think PRI is the only way,
> > > > > > > > >
> > > > > > > > > I don't, it's just an example where virtio can leverage from either
> > > > > > > > > transport or platform. Or if it's the fault in virtio that slows down
> > > > > > > > > the PRI, then it is something we can do.
> > > > > > > > >
> > > > > > > > > >  than you should propose that in the dirty page tracking series that you listed above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > > > > >
> > > > > > > > > No, the point is to not duplicate works especially considering virtio
> > > > > > > > > can't do better than platform or transport.
> > > > > > > >
> > > > > > > > If someone says they tried and platform's migration support does not
> > > > > > > > work for them and they want to build a solution in virtio then
> > > > > > > > what exactly is the objection?
> > > > > > >
> > > > > > > The discussion is to make sure whether virtio can do this easily and
> > > > > > > correctly, then we can have a conclusion. I've stated some issues
> > > > > > > above, and I've asked other questions related to them which are still
> > > > > > > not answered.
> > > > > > >
> > > > > > > I think we had a very hard time in bypassing IOMMU in the past that we
> > > > > > > don't want to repeat.
> > > > > > >
> > > > > > > We've gone through several methods of logging dirty pages in the past
> > > > > > > (each with pros/cons), but this proposal never explains why it chooses
> > > > > > > one of them but not others. Spec needs to find the best path instead
> > > > > > > of just a possible path without any rationale about why.
> > > > > >
> > > > > > Adding more rationale isn't a bad thing.
> > > > > > In particular if platform supplies dirty tracking then how does
> > > > > > driver decide which to use platform or device capability?
> > > > > > A bit of discussion around this is a good idea.
> > > > > >
> > > > > >
> > > > > > > > virtio is here in the
> > > > > > > > first place because emulating devices didn't work well.
> > > > > > >
> > > > > > > I don't understand here. We have supported emulated devices for years.
> > > > > > > I'm pretty sure a lot of issues could be uncovered if this proposal
> > > > > > > can be prototyped with an emulated device first.
> > > > > > >
> > > > > > > Thanks
> > > > > >
> > > > > > virtio was originally PV as opposed to emulation. That there's now
> > > > > > hardware virtio and you call software implementation "an emulation" is
> > > > > > very meta.
> > > > >
> > > > > Yes but I don't see how it relates to dirty page tracking. When we
> > > > > find a way it should work for both software and hardware devices.
> > > > >
> > > > > Thanks
> > > >
> > > > It has to work well on a variety of existing platforms. If it does then
> > > > sure, why would we roll our own.
> > >
> > > If virtio can do that in an efficient way without any issues, I agree.
> > > But it seems not.
> > >
> > > Thanks
> >
> >
> >
> > >
> > >
> > >
> > >
> > >
> > >
> > > >
> > > > --
> > > > MST
> > > >
> >


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-14  8:27                                     ` Michael S. Tsirkin
@ 2023-11-15  4:05                                       ` Zhu, Lingshan
  2023-11-15  7:51                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Zhu, Lingshan @ 2023-11-15  4:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Parav Pandit, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 11/14/2023 4:27 PM, Michael S. Tsirkin wrote:
> On Tue, Nov 14, 2023 at 03:34:32PM +0800, Zhu, Lingshan wrote:
>>>> So I can't
>>>> believe it has good performance overall. Logging via IOMMU or using
>>>> shadow virtqueue doesn't need any extra PCI transactions at least.
>>> On the other hand they have an extra CPU cost.  Personally if this is
>>> coming from a hardware vendor, I am inclined to trust them wrt PCI
>>> transactions.  But anyway, discussing this at a high level theoretically
>>> is pointless - whoever bothers with actual prototyping for performance
>>> testing wins, if no one does I'd expect a back of a napkin estimate
>>> to be included.
>> if so, Intel has released productions implementing these interfaces years
>> ago,
>> see live migration in 4.1. IFCVF vDPA Implementation,
>> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html
>> and
> That one is based on shadow queue, right? Which I think this shows
> is worth supporting.
Yes, it is shadow virtqueue, I assume this is already mostly done,
do you see any gaps we need to address in our series that we should work on?

Thanks
>
>> But I still believe we are here try our best to work out an industrial spec
>> with better quality, to serve broad interest. This is not competition
>> between companies,
>> and the spec is not a FIFO, not like a early bird can catch all the worm.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-15  4:05                                       ` Zhu, Lingshan
@ 2023-11-15  7:51                                         ` Michael S. Tsirkin
  2023-11-15  7:59                                           ` Zhu, Lingshan
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-15  7:51 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Jason Wang, Parav Pandit, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Nov 15, 2023 at 12:05:59PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/14/2023 4:27 PM, Michael S. Tsirkin wrote:
> > On Tue, Nov 14, 2023 at 03:34:32PM +0800, Zhu, Lingshan wrote:
> > > > > So I can't
> > > > > believe it has good performance overall. Logging via IOMMU or using
> > > > > shadow virtqueue doesn't need any extra PCI transactions at least.
> > > > On the other hand they have an extra CPU cost.  Personally if this is
> > > > coming from a hardware vendor, I am inclined to trust them wrt PCI
> > > > transactions.  But anyway, discussing this at a high level theoretically
> > > > is pointless - whoever bothers with actual prototyping for performance
> > > > testing wins, if no one does I'd expect a back of a napkin estimate
> > > > to be included.
> > > if so, Intel has released productions implementing these interfaces years
> > > ago,
> > > see live migration in 4.1. IFCVF vDPA Implementation,
> > > https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html
> > > and
> > That one is based on shadow queue, right? Which I think this shows
> > is worth supporting.
> Yes, it is shadow virtqueue, I assume this is already mostly done,
> do you see any gaps we need to address in our series that we should work on?
> 
> Thanks

There were a ton of comments posted on your series.

> > 
> > > But I still believe we are here try our best to work out an industrial spec
> > > with better quality, to serve broad interest. This is not competition
> > > between companies,
> > > and the spec is not a FIFO, not like a early bird can catch all the worm.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-09  6:24                     ` Parav Pandit
  2023-11-13  3:37                       ` [virtio-comment] " Jason Wang
@ 2023-11-15  7:58                       ` Michael S. Tsirkin
  1 sibling, 0 replies; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-15  7:58 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 09, 2023 at 06:24:56AM +0000, Parav Pandit wrote:
> Once PRI is enabled, even without migration, there is basic perf issues.

So you keep saying this and this gives the impression you have a PoC
that you tested to check performance based on admin commands.
If so that's great because one of the main points here is performance.
Could you share these numbers? And, could you compare to shadow
vq based one which is upstream in qemu?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-15  7:51                                         ` Michael S. Tsirkin
@ 2023-11-15  7:59                                           ` Zhu, Lingshan
  2023-11-15  8:05                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Zhu, Lingshan @ 2023-11-15  7:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Parav Pandit, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 11/15/2023 3:51 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 15, 2023 at 12:05:59PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/14/2023 4:27 PM, Michael S. Tsirkin wrote:
>>> On Tue, Nov 14, 2023 at 03:34:32PM +0800, Zhu, Lingshan wrote:
>>>>>> So I can't
>>>>>> believe it has good performance overall. Logging via IOMMU or using
>>>>>> shadow virtqueue doesn't need any extra PCI transactions at least.
>>>>> On the other hand they have an extra CPU cost.  Personally if this is
>>>>> coming from a hardware vendor, I am inclined to trust them wrt PCI
>>>>> transactions.  But anyway, discussing this at a high level theoretically
>>>>> is pointless - whoever bothers with actual prototyping for performance
>>>>> testing wins, if no one does I'd expect a back of a napkin estimate
>>>>> to be included.
>>>> if so, Intel has released productions implementing these interfaces years
>>>> ago,
>>>> see live migration in 4.1. IFCVF vDPA Implementation,
>>>> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html
>>>> and
>>> That one is based on shadow queue, right? Which I think this shows
>>> is worth supporting.
>> Yes, it is shadow virtqueue, I assume this is already mostly done,
>> do you see any gaps we need to address in our series that we should work on?
>>
>> Thanks
> There were a ton of comments posted on your series.
Hope I didn't miss anything. I see your latest comments are about vq states,
as replied before, I think we can record the states by two le16 and the 
in-flight
descriptor tracking facility.

For this shadow virtqueue, do you think I should address this in my V4?
Like saying: acknowledged control commands through the control virtqueue
should be recorded, and we want to use shadow virtqueue to track dirty 
pages.
>
>>>> But I still believe we are here try our best to work out an industrial spec
>>>> with better quality, to serve broad interest. This is not competition
>>>> between companies,
>>>> and the spec is not a FIFO, not like a early bird can catch all the worm.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-09  6:26                         ` [virtio-comment] " Parav Pandit
@ 2023-11-15  7:59                           ` Michael S. Tsirkin
  2023-11-15 17:42                             ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-15  7:59 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 09, 2023 at 06:26:44AM +0000, Parav Pandit wrote:
> 
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, November 8, 2023 9:59 AM
> > 
> > On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > Each virtio and non virtio devices who wants to report their
> > > > > > > dirty page report,
> > > > > > will do their way.
> > > > > > >
> > > > > > > > 3) inventing it in the virtio layer will be deprecated in
> > > > > > > > the future for sure, as platform will provide much rich
> > > > > > > > features for logging e.g it can do it per PASID etc, I don't
> > > > > > > > see any reason virtio need to compete with the features that
> > > > > > > > will be provided by the platform
> > > > > > > Can you bring the cpu vendors and committement to virtio tc
> > > > > > > with timelines
> > > > > > so that virtio TC can omit?
> > > > > >
> > > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio
> > > > > > needs to be built on top of transport or platform. There's no need to
> > duplicate their job.
> > > > > > Especially considering that virtio can't do better than them.
> > > > > >
> > > > > I wanted to see a strong commitment for the cpu vendors to support dirty
> > page tracking.
> > > >
> > > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
> > > > ARM are all supporting that now.
> > > >
> > > > > And the work seems to have started for some platforms.
> > > >
> > > > Let me quote from the above link:
> > > >
> > > > """
> > > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > > alongside VT-D rev3.x also do support.
> > > > """
> > > >
> > > > > Without such platform commitment, virtio also skipping it would not work.
> > > >
> > > > Is the above sufficient? I'm a little bit more familiar with vtd,
> > > > the hw feature has been there for years.
> > >
> > >
> > > Repeating myself - I'm not sure that will work well for all workloads.
> > 
> > I think this comment applies to this proposal as well.
> > 
> > > Definitely KVM did
> > > not scan PTEs. It used pagefaults with bit per page and later as VM
> > > size grew switched to PLM.  This interface is analogous to PLM,
> > 
> > I think you meant PML actually. And it doesn't work like PML. To behave like
> > PML it needs to
> > 
> > 1) log buffers were organized as a queue with indices
> > 2) device needs to suspend (as a #vmexit in PML) if it runs out of the buffers
> > 3) device need to send a notification to the driver if it runs out of the buffer
> > 
> > I don't see any of the above in this proposal. If we do that it would be less
> > problematic than what is being proposed here.
> > 
> In this proposal, its slightly different than PML.
> The log buffer is a write record with the device. It keeps recording it.
> And owner driver queries the recorded pages.
> The device internally can do PML or other different implementations as it finds suitable.

I personally like it that this detail is hidden inside the device.
One important functionality that PML has and that this does not
have is ability to interrupt host e.g. if is running low on
space to record these info. Want to add it in some way?
E.g. a special command that is only used if device is low
on buffers.


> > Even if we manage to do that, it doesn't mean we won't have issues.
> > 
> > 1) For many reasons it can neither see nor log via GPA, so this requires a
> > traversal of the vIOMMU mapping tables by the hypervisor afterwards, it would
> > be expensive and need synchronization with the guest modification of the IO
> > page table which looks very hard.
> > 2) There are a lot of special or reserved IOVA ranges (for example the interrupt
> > areas in x86) that need special care which is architectural and where it is
> > beyond the scope or knowledge of the virtio device but the platform IOMMU.
> > Things would be more complicated when SVA is enabled. And there could be
> > other architecte specific knowledge (e.g
> > PAGE_SIZE) that might be needed. There's no easy way to deal with those cases.
> > 
> 
> Current and future iommufd and OS interface likely can support this already.
> In current proposal, multiple ranges are supplied to the device, the reserved ranges are not part of it.
> 
> > We wouldn't need to care about all of them if it is done at platform IOMMU
> > level.
> > 
> I agree that when platform IOMMU has support and if its better it should be first priority to use by the hypervisor.
> Mainly because the D bit of the page already there, and not a special PML queue or a racy bitmap like what was proposed in other series.

BTW your bitmap is also racy if there's a vIOMMU, unless hypervisor is
very careful to empty the bitmap when mappings change.
You should document this requirement.


-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-15  7:59                                           ` Zhu, Lingshan
@ 2023-11-15  8:05                                             ` Michael S. Tsirkin
  2023-11-15  8:42                                               ` Zhu, Lingshan
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-15  8:05 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Jason Wang, Parav Pandit, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Nov 15, 2023 at 03:59:16PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/15/2023 3:51 PM, Michael S. Tsirkin wrote:
> > On Wed, Nov 15, 2023 at 12:05:59PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 11/14/2023 4:27 PM, Michael S. Tsirkin wrote:
> > > > On Tue, Nov 14, 2023 at 03:34:32PM +0800, Zhu, Lingshan wrote:
> > > > > > > So I can't
> > > > > > > believe it has good performance overall. Logging via IOMMU or using
> > > > > > > shadow virtqueue doesn't need any extra PCI transactions at least.
> > > > > > On the other hand they have an extra CPU cost.  Personally if this is
> > > > > > coming from a hardware vendor, I am inclined to trust them wrt PCI
> > > > > > transactions.  But anyway, discussing this at a high level theoretically
> > > > > > is pointless - whoever bothers with actual prototyping for performance
> > > > > > testing wins, if no one does I'd expect a back of a napkin estimate
> > > > > > to be included.
> > > > > if so, Intel has released productions implementing these interfaces years
> > > > > ago,
> > > > > see live migration in 4.1. IFCVF vDPA Implementation,
> > > > > https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html
> > > > > and
> > > > That one is based on shadow queue, right? Which I think this shows
> > > > is worth supporting.
> > > Yes, it is shadow virtqueue, I assume this is already mostly done,
> > > do you see any gaps we need to address in our series that we should work on?
> > > 
> > > Thanks
> > There were a ton of comments posted on your series.
> Hope I didn't miss anything. I see your latest comments are about vq states,
> as replied before, I think we can record the states by two le16 and the
> in-flight
> descriptor tracking facility.

I don't know why you need the le16. in-flight tracking should be enough.
And given it needs DMA I would try really hard to actually use
admin commands for this. 

> For this shadow virtqueue, do you think I should address this in my V4?
> Like saying: acknowledged control commands through the control virtqueue
> should be recorded, and we want to use shadow virtqueue to track dirty
> pages.

What you need to do is actually describe what do you expect the device
to do when it enters this suspend state. since you mention control
virtqueue then it seems that there needs to be a device type
specific text explaining the behaviour. If so this implies there
needs to be a list of device types that support suspend
until someone looks at each type and documents what it does.

> > 
> > > > > But I still believe we are here try our best to work out an industrial spec
> > > > > with better quality, to serve broad interest. This is not competition
> > > > > between companies,
> > > > > and the spec is not a FIFO, not like a early bird can catch all the worm.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-15  8:05                                             ` Michael S. Tsirkin
@ 2023-11-15  8:42                                               ` Zhu, Lingshan
  2023-11-15 11:52                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Zhu, Lingshan @ 2023-11-15  8:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Parav Pandit, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 11/15/2023 4:05 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 15, 2023 at 03:59:16PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/15/2023 3:51 PM, Michael S. Tsirkin wrote:
>>> On Wed, Nov 15, 2023 at 12:05:59PM +0800, Zhu, Lingshan wrote:
>>>> On 11/14/2023 4:27 PM, Michael S. Tsirkin wrote:
>>>>> On Tue, Nov 14, 2023 at 03:34:32PM +0800, Zhu, Lingshan wrote:
>>>>>>>> So I can't
>>>>>>>> believe it has good performance overall. Logging via IOMMU or using
>>>>>>>> shadow virtqueue doesn't need any extra PCI transactions at least.
>>>>>>> On the other hand they have an extra CPU cost.  Personally if this is
>>>>>>> coming from a hardware vendor, I am inclined to trust them wrt PCI
>>>>>>> transactions.  But anyway, discussing this at a high level theoretically
>>>>>>> is pointless - whoever bothers with actual prototyping for performance
>>>>>>> testing wins, if no one does I'd expect a back of a napkin estimate
>>>>>>> to be included.
>>>>>> if so, Intel has released productions implementing these interfaces years
>>>>>> ago,
>>>>>> see live migration in 4.1. IFCVF vDPA Implementation,
>>>>>> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html
>>>>>> and
>>>>> That one is based on shadow queue, right? Which I think this shows
>>>>> is worth supporting.
>>>> Yes, it is shadow virtqueue, I assume this is already mostly done,
>>>> do you see any gaps we need to address in our series that we should work on?
>>>>
>>>> Thanks
>>> There were a ton of comments posted on your series.
>> Hope I didn't miss anything. I see your latest comments are about vq states,
>> as replied before, I think we can record the states by two le16 and the
>> in-flight
>> descriptor tracking facility.
> I don't know why you need the le16. in-flight tracking should be enough.
> And given it needs DMA I would try really hard to actually use
> admin commands for this.
we need to record the on-device avail_idx and used_idx, or
how can the destination side know the device internal values.
>
>> For this shadow virtqueue, do you think I should address this in my V4?
>> Like saying: acknowledged control commands through the control virtqueue
>> should be recorded, and we want to use shadow virtqueue to track dirty
>> pages.
> What you need to do is actually describe what do you expect the device
> to do when it enters this suspend state. since you mention control
> virtqueue then it seems that there needs to be a device type
> specific text explaining the behaviour. If so this implies there
> needs to be a list of device types that support suspend
> until someone looks at each type and documents what it does.
On a second thought, shadow vqs are hypervisor behaviors, maybe should 
not be
described in this device spec.

Since SUSPEND is in device status, so for now I see every type of device 
implements
device_status should support SUSPEND. This should be a general facility.
>
>>>>>> But I still believe we are here try our best to work out an industrial spec
>>>>>> with better quality, to serve broad interest. This is not competition
>>>>>> between companies,
>>>>>> and the spec is not a FIFO, not like a early bird can catch all the worm.
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-15  8:42                                               ` Zhu, Lingshan
@ 2023-11-15 11:52                                                 ` Michael S. Tsirkin
  2023-11-16  9:38                                                   ` Zhu, Lingshan
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-15 11:52 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Jason Wang, Parav Pandit, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Nov 15, 2023 at 04:42:56PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/15/2023 4:05 PM, Michael S. Tsirkin wrote:
> > On Wed, Nov 15, 2023 at 03:59:16PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 11/15/2023 3:51 PM, Michael S. Tsirkin wrote:
> > > > On Wed, Nov 15, 2023 at 12:05:59PM +0800, Zhu, Lingshan wrote:
> > > > > On 11/14/2023 4:27 PM, Michael S. Tsirkin wrote:
> > > > > > On Tue, Nov 14, 2023 at 03:34:32PM +0800, Zhu, Lingshan wrote:
> > > > > > > > > So I can't
> > > > > > > > > believe it has good performance overall. Logging via IOMMU or using
> > > > > > > > > shadow virtqueue doesn't need any extra PCI transactions at least.
> > > > > > > > On the other hand they have an extra CPU cost.  Personally if this is
> > > > > > > > coming from a hardware vendor, I am inclined to trust them wrt PCI
> > > > > > > > transactions.  But anyway, discussing this at a high level theoretically
> > > > > > > > is pointless - whoever bothers with actual prototyping for performance
> > > > > > > > testing wins, if no one does I'd expect a back of a napkin estimate
> > > > > > > > to be included.
> > > > > > > if so, Intel has released productions implementing these interfaces years
> > > > > > > ago,
> > > > > > > see live migration in 4.1. IFCVF vDPA Implementation,
> > > > > > > https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html
> > > > > > > and
> > > > > > That one is based on shadow queue, right? Which I think this shows
> > > > > > is worth supporting.
> > > > > Yes, it is shadow virtqueue, I assume this is already mostly done,
> > > > > do you see any gaps we need to address in our series that we should work on?
> > > > > 
> > > > > Thanks
> > > > There were a ton of comments posted on your series.
> > > Hope I didn't miss anything. I see your latest comments are about vq states,
> > > as replied before, I think we can record the states by two le16 and the
> > > in-flight
> > > descriptor tracking facility.
> > I don't know why you need the le16. in-flight tracking should be enough.
> > And given it needs DMA I would try really hard to actually use
> > admin commands for this.
> we need to record the on-device avail_idx and used_idx, or
> how can the destination side know the device internal values.

Again you never documented what state the device is in so I can't really
say for sure.  But generally whenever a buffer is used the internal
values are written out to memory.

> > 
> > > For this shadow virtqueue, do you think I should address this in my V4?
> > > Like saying: acknowledged control commands through the control virtqueue
> > > should be recorded, and we want to use shadow virtqueue to track dirty
> > > pages.
> > What you need to do is actually describe what do you expect the device
> > to do when it enters this suspend state. since you mention control
> > virtqueue then it seems that there needs to be a device type
> > specific text explaining the behaviour. If so this implies there
> > needs to be a list of device types that support suspend
> > until someone looks at each type and documents what it does.
> On a second thought, shadow vqs are hypervisor behaviors, maybe should not
> be
> described in this device spec.
> 
> Since SUSPEND is in device status, so for now I see every type of device
> implements
> device_status should support SUSPEND. This should be a general facility.



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-13  3:41                                 ` [virtio-comment] " Jason Wang
  2023-11-13 14:30                                   ` Michael S. Tsirkin
@ 2023-11-15 17:37                                   ` Parav Pandit
  2023-11-16  4:24                                     ` [virtio-comment] " Jason Wang
  2023-11-16  6:50                                     ` Michael S. Tsirkin
  1 sibling, 2 replies; 157+ messages in thread
From: Parav Pandit @ 2023-11-15 17:37 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, November 13, 2023 9:11 AM
> 
> On Fri, Nov 10, 2023 at 2:46 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> > Hi Michael,
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, November 9, 2023 1:29 PM
> >
> > [..]
> > > > Besides the issue of performance, it's also racy, assuming we are
> > > > logging
> > > IOVA.
> > > >
> > > > 0) device log IOVA
> > > > 1) hypervisor fetches IOVA from log buffer
> > > > 2) guest map IOVA to a new GPA
> > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > >
> > > > Then we lost the old GPA.
> > >
> > > Interesting and a good point. And by the way e.g. vhost has the same
> > > issue.  You need to flush dirty tracking info when changing the
> > > mappings somehow.  Parav what's the plan for this? Should be addressed in
> the spec too.
> > >
> > As you listed the flush is needed for vhost or device-based DPT.
> 
> What does DPT mean? Device Page Table? Let's not invent terminology which is
> not known by others please.
>
Sorry for using the acronym. I meant dirty page tracking.
 
> We have discussed it many times. You can't just depend on ATS or reinventing
> wheels in virtio.
The dependency is on the iommu which would have the mapping of GIOVA to GPA like any sw implementation.
No dependency on ATS.

> 
> What's more, please try not to give me the impression that the proposal is
> optimized for a specific vendor (like device IOMMU stuffs).
>
You should stop calling this specific vendor thing.
One can equally say that suspend bit proposal is for the sw_vendor device who is forcing virtio hw device to only implement ioqueues + PASID + non_unified interface for PF, VF, SIOVs + non_TDISP based devices.
 
> > The necessary plumbing is already covered for this in the query (read and
> clear) command of this v3 proposal.
> 
> The issue is logging via IOVA ... I don't see how "read and clear" can help.
> 
Read and clear helps that ensures that all the dirty pages are reported, hence there is no mapping/unmapping race.
As everything is reported.

> > It is listed in Device Write Records Read Command.
> 
> Please explain how your proposal can solve the above race.
> 
In below manner.
1. guest has GIOVA to GPA_1 mapping
2. RX packets occurred to GIOVA
3. device reported dirty page log for GIOVA (hypervisor is yet to read)
4. guest requested mapping change from GIOVA to GPA_2
4.1 During this IOTLB is invalidated and dirty page report is queried ensuring, it can change the mapping

> >
> > When the page write record is fully read, it is flushed.
> > How/when to use, I think its hypervisor specific, so we probably better off not
> documenting those details.
> 
> Well, as the author of this proposal, at least you need to know how a hypervisor
> can work with your proposal, no?
>
Likely yes, but it is not the scope of the spec to list those paths etc.

> > May be such read is needed in some other path too depending on how
> hypervisor implemented.
> 
> What do you mean by "May be ... some other path" here? You're inventing a
> mechanism that you don't know how a hypervisor can use?

No. I meant hypervisor may have more operations that map/unmap/flush where it may need to implement it.
Some one may call it set_map(), some may say dma_map()...

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-13  3:37                       ` [virtio-comment] " Jason Wang
@ 2023-11-15 17:38                         ` Parav Pandit
  2023-11-16  4:23                           ` [virtio-comment] " Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-15 17:38 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, November 13, 2023 9:07 AM
> 
> On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, November 7, 2023 9:34 AM
> > >
> > > On Mon, Nov 6, 2023 at 2:54 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Monday, November 6, 2023 12:04 PM
> > > > >
> > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Thursday, November 2, 2023 9:54 AM
> > > > > > >
> > > > > > > On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Sent: Wednesday, November 1, 2023 6:00 AM
> > > > > > > > >
> > > > > > > > > On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit
> > > > > > > > > <parav@nvidia.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit
> > > > > > > > > > > <parav@nvidia.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > During a device migration flow (typically in a
> > > > > > > > > > > > precopy phase of the live migration), a device may
> > > > > > > > > > > > write to the guest memory. Some iommu/hypervisor
> > > > > > > > > > > > may not be able to track these
> > > > > > > written pages.
> > > > > > > > > > > > These pages to be migrated from source to
> > > > > > > > > > > > destination
> > > hypervisor.
> > > > > > > > > > > >
> > > > > > > > > > > > A device which writes to these pages, provides the
> > > > > > > > > > > > page address record of the to the owner device.
> > > > > > > > > > > > The owner device starts write recording for the
> > > > > > > > > > > > device and queries all the page addresses written by the
> device.
> > > > > > > > > > > >
> > > > > > > > > > > > Fixes:
> > > > > > > > > > > > https://github.com/oasis-tcs/virtio-spec/issues/17
> > > > > > > > > > > > 6
> > > > > > > > > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > > > > > > > > Signed-off-by: Satananda Burla
> > > > > > > > > > > > <sburla@marvell.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > > changelog:
> > > > > > > > > > > > v1->v2:
> > > > > > > > > > > > - addressed comments from Michael
> > > > > > > > > > > > - replaced iova with physical address
> > > > > > > > > > > > ---
> > > > > > > > > > > >  admin-cmds-device-migration.tex | 15
> > > > > > > > > > > > +++++++++++++++
> > > > > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/admin-cmds-device-migration.tex
> > > > > > > > > > > > b/admin-cmds-device-migration.tex index
> > > > > > > > > > > > ed911e4..2e32f2c
> > > > > > > > > > > > 100644
> > > > > > > > > > > > --- a/admin-cmds-device-migration.tex
> > > > > > > > > > > > +++ b/admin-cmds-device-migration.tex
> > > > > > > > > > > > @@ -95,6 +95,21 @@ \subsubsection{Device
> > > > > > > > > > > > Migration}\label{sec:Basic Facilities of a Virtio
> > > > > > > > > > > > Device / The owner driver can discard any
> > > > > > > > > > > > partially read or written device context when  any
> > > > > > > > > > > > of the device migration flow
> > > > > > > > > > > should be aborted.
> > > > > > > > > > > >
> > > > > > > > > > > > +During the device migration flow, a passthrough
> > > > > > > > > > > > +device may write data to the guest virtual
> > > > > > > > > > > > +machine's memory, a source hypervisor needs to
> > > > > > > > > > > > +keep track of these written memory to migrate
> > > > > > > > > > > > +such memory to destination
> > > > > > > > > > > hypervisor.
> > > > > > > > > > > > +Some systems may not be able to keep track of
> > > > > > > > > > > > +such memory write addresses at hypervisor level.
> > > > > > > > > > > > +In such a scenario, a device records and reports
> > > > > > > > > > > > +these written memory addresses to the owner
> > > > > > > > > > > > +device. The owner driver enables write recording
> > > > > > > > > > > > +for one or more physical address ranges per
> > > > > > > > > > > > +device during device
> > > > > > > migration flow.
> > > > > > > > > > > > +The owner driver periodically queries these
> > > > > > > > > > > > +written physical address
> > > > > > > > > records from the device.
> > > > > > > > > > >
> > > > > > > > > > > I wonder how PA works in this case. Device uses
> > > > > > > > > > > untranslated requests so it can only see IOVA. We
> > > > > > > > > > > can't mandate
> > > ATS anyhow.
> > > > > > > > > > Michael suggested to keep the language uniform as PA
> > > > > > > > > > as this is ultimately
> > > > > > > > > what the guest driver is supplying during vq creation
> > > > > > > > > and in posting buffers as physical address.
> > > > > > > > >
> > > > > > > > > This seems to need some work. And, can you show me how
> > > > > > > > > it can
> > > work?
> > > > > > > > >
> > > > > > > > > 1) e.g if GAW is 48 bit, is the hypervisor expected to
> > > > > > > > > do a bisection of the whole range?
> > > > > > > > > 2) does the device need to reserve sufficient internal
> > > > > > > > > resources for logging the dirty page and why (not)?
> > > > > > > > No when dirty page logging starts, only at that time,
> > > > > > > > device will reserve
> > > > > > > enough resources.
> > > > > > >
> > > > > > > GAW is 48bit, how large would it have then?
> > > > > > Dirty page tracking is not dependent on the size of the GAW.
> > > > > > It is function of address ranges for the amount of guest
> > > > > > memory regardless of
> > > > > GAW.
> > > > >
> > > > > The problem is, e.g when vIOMMU is enabled, you can't know which
> > > > > IOVA is actually used by guests. And even for the case when
> > > > > vIOMMU is not enabled, the guest may have several TBs. Is it
> > > > > easy to reserve sufficient resources by the device itself?
> > > > >
> > > > When page tracking is enabled per device, it knows about the range
> > > > and it can
> > > reserve certain resource.
> > >
> > > I didn't see such an interface in this series. Anything I miss?
> > >
> > Yes, this patch and the next patch is covering the page tracking start,stop and
> query commands.
> > They are named as write recording commands.
> 
> So I still don't see how the device can reserve sufficient resources?
> Guests may map a very large area of memory to IOMMU (or when vIOMMU is
> disabled, GPA is used). It would be several TBs, how can the device reserve
> sufficient resources in this case? 
When the map is established, the ranges are supplied to the device to know how much to reserve.
If device does not have enough resource, it fails the command.

One can advance it further to provision for the desired range..
> 
> >
> > > Btw, the IOVA is allocated by the guest actually, how can we know the
> range?
> > > (or using the host range?)
> > >
> > Hypervisor would have mapping translation.
> 
> That's really tricky and can only work in some cases:
> 
> 1) It requires the hypervisor to traverse the guest I/O page tables which could
> be very large range
> 2) It requests the hypervisor to trap the modification of guest I/O page tables
> and synchronize with the range changes, which is inefficient and can only be
> done when we are doing shadow PTEs. It won't work when the nesting
> translation could be offloaded to the hardware
> 3) It is racy with the guest modification of I/O page tables which is explained in
> another thread
Mapping changes with more hw mmu's is not a frequent event and IOTLB flush is done using querying the dirty log for the smaller range.

> 4) No aware of new features like PASID which has been explained in another
> thread
For all the pinned work with non sw based IOMMU, it is typically small subset.
PASID is guest controlled.

> 
> >
> > > >
> > > > > Host should always have more resources than device, in that
> > > > > sense there could be several methods that tries to utilize host
> > > > > memory instead of the one in the device. I think we've discussed
> > > > > this when going through the doc prepared by Eugenio.
> > > > >
> > > > > >
> > > > > > > What happens if we're trying to migrate more than 1 device?
> > > > > > >
> > > > > > That is perfectly fine.
> > > > > > Each device is updating its log of pages it wrote.
> > > > > > The hypervisor is collecting their sum.
> > > > >
> > > > > See above.
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > 3) DMA is part of the transport, it's natural to do
> > > > > > > > > logging there, why duplicate efforts in the virtio layer?
> > > > > > > > He he, you have funny comment.
> > > > > > > > When an abstract facility is added to virtio you say to do in
> transport.
> > > > > > >
> > > > > > > So it's not done in the general facility but tied to the admin part.
> > > > > > > And we all know dirty page tracking is a challenge and
> > > > > > > Eugenio has a good summary of pros/cons. A revisit of those
> > > > > > > docs make me think virtio is not the good place for doing that for
> may reasons:
> > > > > > >
> > > > > > > 1) as stated, platform will evolve to be able to tracking
> > > > > > > dirty pages, actually, it has been supported by a lot of
> > > > > > > major IOMMU vendors
> > > > > >
> > > > > > This is optional facility in virtio.
> > > > > > Can you please point to the references? I don’t see it in the
> > > > > > common Linux
> > > > > kernel support for it.
> > > > >
> > > > > Note that when IOMMUFD is being proposed, dirty page tracking is
> > > > > one of the major considerations.
> > > > >
> > > > > This is one recent proposal:
> > > > >
> > > > > https://www.spinics.net/lists/kvm/msg330894.html
> > > > >
> > > > Sure, so if platform supports it. it can be used from the platform.
> > > > If it does not, the device supplies it.
> > > >
> > > > > > Instead Linux kernel choose to extend to the devices.
> > > > >
> > > > > Well, as I stated, tracking dirty pages is challenging if you
> > > > > want to do it on a device, and you can't simply invent dirty
> > > > > page tracking for each type of the devices.
> > > > >
> > > > It is not invented.
> > > > It is generic framework for all virtio device types as proposed here.
> > > > Keep in mind, that it is optional already in v3 series.
> > > >
> > > > > > At least not seen to arrive this in any near term in start of
> > > > > > 2024 which is
> > > > > where users must use this.
> > > > > >
> > > > > > > 2) you can't assume virtio is the only device that can be
> > > > > > > used by the guest, having dirty pages tracking to be
> > > > > > > implemented in each type of device is unrealistic
> > > > > > Of course, there is no such assumption made. Where did you see
> > > > > > a text that
> > > > > made such assumption?
> > > > >
> > > > > So what happens if you have a guest with virtio and other devices
> assigned?
> > > > >
> > > > What happens? Each device type would do its own dirty page tracking.
> > > > And if all devices does not have support, hypervisor knows to fall
> > > > back to
> > > platform iommu or its own.
> > > >
> > > > > > Each virtio and non virtio devices who wants to report their
> > > > > > dirty page report,
> > > > > will do their way.
> > > > > >
> > > > > > > 3) inventing it in the virtio layer will be deprecated in
> > > > > > > the future for sure, as platform will provide much rich
> > > > > > > features for logging e.g it can do it per PASID etc, I don't
> > > > > > > see any reason virtio need to compete with the features that
> > > > > > > will be provided by the platform
> > > > > > Can you bring the cpu vendors and committement to virtio tc
> > > > > > with timelines
> > > > > so that virtio TC can omit?
> > > > >
> > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio
> > > > > needs to be built on top of transport or platform. There's no
> > > > > need to duplicate
> > > their job.
> > > > > Especially considering that virtio can't do better than them.
> > > > >
> > > > I wanted to see a strong commitment for the cpu vendors to support
> > > > dirty
> > > page tracking.
> > >
> > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
> > > ARM are all supporting that now.
> > >
> > > > And the work seems to have started for some platforms.
> > >
> > > Let me quote from the above link:
> > >
> > > """
> > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > alongside VT-D rev3.x also do support.
> > > """
> > >
> > > > Without such platform commitment, virtio also skipping it would not work.
> > >
> > > Is the above sufficient? I'm a little bit more familiar with vtd,
> > > the hw feature has been there for years.
> > >
> > Vtd has a sticky D bit that requires synchronization with IOPTE page caches
> when sw wants to clear it.
> 
> This is by design.
> 
> > Do you know if is it reliable when device does multiple writes, ie,
> >
> > a. iommu write D bit
> > b. software read it
> > c. sw synchronize cache
> > d. iommu write D bit on next write by device
> 
> What issue did you see here? But that's not even an excuse, if there's a bug,
> let's report it to IOMMU vendors and let them fix it. The thread I point to you is
> actually a good space.
> 
So we cannot claim that it is there in the platform.

> Again, the point is to let the correct role play.
>
How many more years should we block the virtio device migration when platform do not have it?
 
> >
> > ARM SMMU based servers to be present with D bit tracking.
> > It is still early to say platform is ready.
> 
> This is not what I read from both the series I posted and the spec, dirty bit has
> been supported several years ago at least for vtd.
Supported, but spec listed it as sticky bit that may require special handling.
May be it is working, but not all cpu platforms have it.

> 
> >
> > It is optional so whichever has the support it will be used.
> 
> I can't see the point of this, it is already available. And migration doesn't exist in
> virtio spec yet.
> 
> >
> > > >
> > > > > > i.e. in first year of 2024?
> > > > >
> > > > > Why does it matter in 2024?
> > > > Because users needs to use it now.
> > > >
> > > > >
> > > > > > If not, we are better off to offer this, and when/if platform
> > > > > > support is, sure,
> > > > > this feature can be disabled/not used/not enabled.
> > > > > >
> > > > > > > 4) if the platform support is missing, we can use software
> > > > > > > or leverage transport for assistance like PRI
> > > > > > All of these are in theory.
> > > > > > Our experiment shows PRI performance is 21x slower than page
> > > > > > fault rate
> > > > > done by the cpu.
> > > > > > It simply does not even pass a simple 10Gbps test.
> > > > >
> > > > > If you stick to the wire speed during migration, it can converge.
> > > > Do you have perf data for this?
> > >
> > > No, but it's not hard to imagine the worst case. Wrote a small
> > > program that dirty every page by a NIC.
> > >
> > > > In the internal tests we don’t see this happening.
> > >
> > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > >
> > > So if we get very high dirty rates (e.g by a high speed NIC), we
> > > can't satisfy the requirement of the downtime. Or if you see the
> > > converge, you might get help from the auto converge support by the
> > > hypervisors like KVM where it tries to throttle the VCPU then you can't reach
> the wire speed.
> > >
> > Once PRI is enabled, even without migration, there is basic perf issues.
> 
> The context is not PRI here...
> 
> It's about if you can stick to wire speed during live migration. Based on the
> analysis so far, you can't achieve wirespeed and downtime at the same time.
> That's why the hypervisor needs to throttle VCPU or devices.
>
So? 
Device also may throttle itself.

> For PRI, it really depends on how you want to use it. E.g if you don't want to pin
> a page, the performance is the price you must pay.
PRI without pinning does not make sense for device to make large mapping queries.

> 
> >
> > > >
> > > > >
> > > > > > There is no requirement for mandating PRI either.
> > > > > > So it is unusable.
> > > > >
> > > > > It's not about mandating, it's about doing things in the correct
> > > > > layer. If PRI is slow, PCI can evolve for sure.
> > > > You should try.
> > >
> > > Not my duty, I just want to make sure things are done in the correct
> > > layer, and once it needs to be done in the virtio, there's nothing obviously
> wrong.
> > >
> > At present, it looks all platforms are not equally ready for page tracking.
> 
> That's not an excuse to let virtio support that. 
It is wrong attribution as excuse.

> And we need also to figure out if
> virtio can do that easily. I've pointed out sufficient issues, I'm pretty sure there
> would be more as the platform evolves.
>
I am not sure if virtio feeds the log into the platform.

> >
> > > > In the current state, it is mandating.
> > > > And if you think PRI is the only way,
> > >
> > > I don't, it's just an example where virtio can leverage from either
> > > transport or platform. Or if it's the fault in virtio that slows
> > > down the PRI, then it is something we can do.
> > >
> > Yea, it does not seem to be ready yet.
> >
> > > >  than you should propose that in the dirty page tracking series
> > > > that you listed
> > > above to not do dirty page tracking. Rather depend on PRI, right?
> > >
> > > No, the point is to not duplicate works especially considering
> > > virtio can't do better than platform or transport.
> > >
> > Both the platform and virtio work is ongoing.
> 
> Why duplicate the work then?
>
Not all cpu platforms support as far as I know.
 
> >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > When one does something in transport, you say, this is
> > > > > > > > transport specific, do
> > > > > > > some generic.
> > > > > > > >
> > > > > > > > Here the device is being tracked is virtio device.
> > > > > > > > PCI-SIG has told already that PCIM interface is outside the scope of
> it.
> > > > > > > > Hence, this is done in virtio layer here in abstract way.
> > > > > > >
> > > > > > > You will end up with a competition with the
> > > > > > > platform/transport one that will fail.
> > > > > > >
> > > > > > I don’t see a reason. There is no competition.
> > > > > > Platform always have a choice to not use device side page
> > > > > > tracking when it is
> > > > > supported.
> > > > >
> > > > > Platform provides a lot of other functionalities for dirty logging:
> > > > > e.g per PASID, granular, etc. So you want to duplicate them
> > > > > again in the virtio? If not, why choose this way?
> > > > >
> > > > It is optional for the platforms where platform do not have it.
> > >
> > > We are developing new virtio functionalities that are targeted for
> > > future platforms. Otherwise we would end up with a feature with a
> > > very narrow use case.
> > In general I agree that platform is an option too.
> > Hypervisor will be able to make the decision to use platform when available
> and fallback to device method when platform does not have it.
> >
> > Future and to be equally usable in near term :)
> 
> Please don't double standard again:
> 
> When you are talking about TDISP, you want virtio to be designed to fit for the
> future where the platform is ready in the future When you are talking about
> dirty tracking, you want it to work now even if
> 
The proposal of transport VQ is anti-TDISP.
The proposal of dirty tracking is not anti-platform. It is optional like rest of the platform.

> 1) most of the platform is ready now
Can you list a ARM server CPU in production that has it? (not in some pdf spec).

> 2) whether or not virtio can log dirty page correctly is still suspicious
> 
> Thanks

There is no double standard. The feature is optional which co-exists as explained above.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-13  3:31                               ` Jason Wang
  2023-11-13  6:57                                 ` Michael S. Tsirkin
@ 2023-11-15 17:42                                 ` Parav Pandit
  2023-11-16  4:18                                   ` [virtio-comment] " Jason Wang
  2023-11-17 10:15                                   ` [virtio-comment] " Michael S. Tsirkin
  1 sibling, 2 replies; 157+ messages in thread
From: Parav Pandit @ 2023-11-15 17:42 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin
  Cc: virtio-comment, cohuck, sburla, Shahaf Shuler, Maor Gottlieb,
	Yishai Hadas, lingshan.zhu


> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, November 13, 2023 9:02 AM
> 
> On Thu, Nov 9, 2023 at 3:59 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote:
> > > On Wed, Nov 8, 2023 at 4:17 PM Michael S. Tsirkin <mst@redhat.com>
> wrote:
> > > >
> > > > On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> > > > > On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com>
> wrote:
> > > > > >
> > > > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > > > > Each virtio and non virtio devices who wants to report
> > > > > > > > > > their dirty page report,
> > > > > > > > > will do their way.
> > > > > > > > > >
> > > > > > > > > > > 3) inventing it in the virtio layer will be
> > > > > > > > > > > deprecated in the future for sure, as platform will
> > > > > > > > > > > provide much rich features for logging e.g it can do
> > > > > > > > > > > it per PASID etc, I don't see any reason virtio need
> > > > > > > > > > > to compete with the features that will be provided
> > > > > > > > > > > by the platform
> > > > > > > > > > Can you bring the cpu vendors and committement to
> > > > > > > > > > virtio tc with timelines
> > > > > > > > > so that virtio TC can omit?
> > > > > > > > >
> > > > > > > > > Why do we need to bring CPU vendors in the virtio TC?
> > > > > > > > > Virtio needs to be built on top of transport or platform. There's
> no need to duplicate their job.
> > > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > > >
> > > > > > > > I wanted to see a strong commitment for the cpu vendors to
> support dirty page tracking.
> > > > > > >
> > > > > > > The RFC of IOMMUFD support can go back to early 2022. Intel,
> > > > > > > AMD and ARM are all supporting that now.
> > > > > > >
> > > > > > > > And the work seems to have started for some platforms.
> > > > > > >
> > > > > > > Let me quote from the above link:
> > > > > > >
> > > > > > > """
> > > > > > > Today, AMD Milan (or more recent) supports it while ARM
> > > > > > > SMMUv3.2 alongside VT-D rev3.x also do support.
> > > > > > > """
> > > > > > >
> > > > > > > > Without such platform commitment, virtio also skipping it would
> not work.
> > > > > > >
> > > > > > > Is the above sufficient? I'm a little bit more familiar with
> > > > > > > vtd, the hw feature has been there for years.
> > > > > >
> > > > > >
> > > > > > Repeating myself - I'm not sure that will work well for all workloads.
> > > > >
> > > > > I think this comment applies to this proposal as well.
> > > >
> > > > Yes - some systems might be better off with platform tracking.
> > > > And I think supporting shadow vq better would be nice too.
> > >
> > > For shadow vq, did you mean the work that is done by Eugenio?
> >
> > Yes.
> 
> That's exactly why vDPA starts with shadow virtqueue. We've evaluated various
> possible approaches, each of them have their shortcomings and shadow
> virtqueue is the only one that doesn't require any additional hardware features
> to work in every platform.
> 
> >
> > > >
> > > > > > Definitely KVM did
> > > > > > not scan PTEs. It used pagefaults with bit per page and later
> > > > > > as VM size grew switched to PLM.  This interface is analogous
> > > > > > to PLM,
> > > > >
> > > > > I think you meant PML actually. And it doesn't work like PML. To
> > > > > behave like PML it needs to
> > > > >
> > > > > 1) log buffers were organized as a queue with indices
> > > > > 2) device needs to suspend (as a #vmexit in PML) if it runs out
> > > > > of the buffers
> > > > > 3) device need to send a notification to the driver if it runs
> > > > > out of the buffer
> > > > >
> > > > > I don't see any of the above in this proposal. If we do that it
> > > > > would be less problematic than what is being proposed here.
> > > >
> > > > What is common between this and PML is that you get the addresses
> > > > directly without scanning megabytes of bitmaps or worse - hundreds
> > > > of megabytes of page tables.
> > >
> > > Yes, it has overhead but this is the method we use for vhost and KVM
> (earlier).
> > >
> > > To me the  important advantage of PML is that it uses limited
> > > resources on the host which
> > >
> > > 1) doesn't require resources in the device
> > > 2) doesn't scale as the guest memory increases. (but this advantage
> > > doesn't exist in neither this nor bitmap)
> >
> > it seems 2 exactly exists here.
> 
> Actually not, Parav said the device needs to reserve sufficient resources in
> another thread.
The device resource reservation starts only when the device migration starts.
i.e. with WRITE_RECORDS_START command of patch 7 in the series.

> 
> >
> >
> > > >
> > > > The data structure is different but I don't see why it is critical.
> > > >
> > > > I agree that I don't see out of buffers notifications too which
> > > > implies device has to maintain something like a bitmap internally.
> > > > Which I guess could be fine but it is not clear to me how large
> > > > that bitmap has to be. How does the device know? Needs to be addressed.
> > >
> > > This is the question I asked Parav in another thread. Using host
> > > memory as a queue with notification (like PML) might be much better.
> >
> > Well if queue is what you want to do you can just do it internally.
> 
> Then it's not the proposal here, Parav has explained it in another reply, and as
> explained it lacks a lot of other facilities.
> 
PML is yet another option that requires small pci writes.
In the current proposal, there are no small PCI writes.
It is a query interface from the device.

> > Problem of course is that it might overflow and cause things like
> > packet drops.
> 
> Exactly like PML. So sticking to wire speed should not be a general goal in the
> context of migration. It can be done if the speed of the migration interface is
> faster than the virtio device that needs to be migrated.
May not have to be.
Speed of page recording should be fast enough.
It usually improves with subsequent generation.
> 
> >
> >
> > > >
> > > >
> > > > > Even if we manage to do that, it doesn't mean we won't have issues.
> > > > >
> > > > > 1) For many reasons it can neither see nor log via GPA, so this
> > > > > requires a traversal of the vIOMMU mapping tables by the
> > > > > hypervisor afterwards, it would be expensive and need
> > > > > synchronization with the guest modification of the IO page table which
> looks very hard.
> > > >
> > > > vIOMMU is fast enough to be used on data path but not fast enough
> > > > for dirty tracking?
> > >
> > > We set up SPTEs or using nesting offloading where the PTEs could be
> > > iterated by hardware directly which is fast.
> >
> > There's a way to have hardware find dirty PTEs for you quickly?
> 
> Scanning PTEs on the host is faster and more secure than scanning guests, that's
> what I want to say:
> 
> 1) the guest page could be swapped out but not the host one.
> 2) no guest triggerable behavior
> 

Device page tracking table to be consulted to flush on mapping change.

> > I don't know how it's done. Do tell.
> >
> >
> > > This is not the case here where software needs to iterate the IO
> > > page tables in the guest which could be slow.
> > >
> > > > Hard to believe.  If true and you want to speed up vIOMMU then you
> > > > implement an efficient datastructure for that.
> > >
> > > Besides the issue of performance, it's also racy, assuming we are logging
> IOVA.
> > >
> > > 0) device log IOVA
> > > 1) hypervisor fetches IOVA from log buffer
> > > 2) guest map IOVA to a new GPA
> > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > >
> > > Then we lost the old GPA.
> >
> > Interesting and a good point.
> 
> Note that PML logs at GPA as it works at L1 of EPT.
> 
> > And by the way e.g. vhost has the same issue.  You need to flush dirty
> > tracking info when changing the mappings somehow.
> 
> It's not,
> 
> 1) memory translation is done by vhost
> 2) vhost knows GPA and it doesn't log via IOVA.
> 
> See this for example, and DPDK has similar fixes.
> 
> commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
> Author: Jason Wang <jasowang@redhat.com>
> Date:   Wed Jan 16 16:54:42 2019 +0800
> 
>     vhost: log dirty page correctly
> 
>     Vhost dirty page logging API is designed to sync through GPA. But we
>     try to log GIOVA when device IOTLB is enabled. This is wrong and may
>     lead to missing data after migration.
> 
>     To solve this issue, when logging with device IOTLB enabled, we will:
> 
>     1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
>        get HVA, for writable descriptor, get HVA through iovec. For used
>        ring update, translate its GIOVA to HVA
>     2) traverse the GPA->HVA mapping to get the possible GPA and log
>        through GPA. Pay attention this reverse mapping is not guaranteed
>        to be unique, so we should log each possible GPA in this case.
> 
>     This fix the failure of scp to guest during migration. In -next, we
>     will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> 
>     Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
>     Reported-by: Jintack Lim <jintack@cs.columbia.edu>
>     Cc: Jintack Lim <jintack@cs.columbia.edu>
>     Signed-off-by: Jason Wang <jasowang@redhat.com>
>     Acked-by: Michael S. Tsirkin <mst@redhat.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> All of the above is not what virtio did right now.
> 
> > Parav what's the plan for this? Should be addressed in the spec too.
> >
> 
> AFAIK, there's no easy/efficient way to do that. I hope I was wrong.
> 

The query interface in this proposal works on the granular boundary to read and clear.
This will ensure that mapping is consistent.

> >
> >
> > > >
> > > > > 2) There are a lot of special or reserved IOVA ranges (for
> > > > > example the interrupt areas in x86) that need special care which
> > > > > is architectural and where it is beyond the scope or knowledge
> > > > > of the virtio device but the platform IOMMU. Things would be
> > > > > more complicated when SVA is enabled.
> > > >
> > > > SVA being what here?
> > >
> > > For example, IOMMU may treat interrupt ranges differently depending
> > > on whether SVA is enabled or not. It's very hard and unnecessary to
> > > teach devices about this.
> >
> > Oh, shared virtual memory. So what you are saying here? virtio does
> > not care, it just uses some addresses and if you want it to it can
> > record writes somewhere.
> 
> One example, PCI allows devices to send translated requests, how can a
> hypervisor know it's a PA or IOVA in this case? We probably need a new bit. But
> it's not the only thing we need to deal with.
> 
> By definition, interrupt ranges and other reserved ranges should not belong to
> dirty pages. And the logging should be done before the DMA where there's no
> way for the device to know whether or not an IOVA is valid or not. It would be
> more safe to just not report them from the source instead of leaving it to the
> hypervisor to deal with but this seems impossible at the device level. Otherwise
> the hypervisor driver needs to communicate with the (v)IOMMU to be reached
> with the
> interrupt(MSI) area, RMRR area etc in order to do the correct things or it might
> have security implications. And those areas don't make sense at L1 when vSVA
> is enabled. What's more, when vIOMMU could be fully offloaded, there's no
> easy way to fetch that information.
> 
There cannot be logging before the DMA.
Only requirement is before the mapping changes, the dirty page tracking to be synced.

In most common cases where the perf is critical, such mapping wont change so often dynamically anyway.

> Again, it's hard to bypass or even duplicate the functionality of the platform or
> we need to step into every single detail of a specific transport, architecture or
> IOMMU to figure out whether or not logging at virtio is correct which is
> awkward and unrealistic. This proposal suffers from an exact similar issue when
> inventing things like freeze/stop where I've pointed out other branches of issues
> as well.
> 
It is incorrect attribution that platform is duplicated here.
It feeds the data to the platform as needed without replicating.

I do agree that there is overlap of IOMMU tracking the dirty and storing it in the per PTE vs device supplying its dirty track via its own interface.
Both are consolidated at hypervisor level.

> >
> > > >
> > > > > And there could be other architecte specific knowledge (e.g
> > > > > PAGE_SIZE) that might be needed. There's no easy way to deal
> > > > > with those cases.
> > > >
> > > > Good point about page size actually - using 4k unconditionally is
> > > > a waste of resources.
> > >
> > > Actually, they are more than just PAGE_SIZE, for example, PASID and others.
> >
> > what does pasid have to do with it? anyway, just give driver control
> > over page size.
> 
> For example, two virtqueues have two PASIDs assigned. How can a hypervisor
> know which specific IOVA belongs to which IOVA? For platform IOMMU, they
> are handy as it talks to the transport. But I don't think we need to duplicate
> every transport specific address space feature in core virtio layer:
> 
PASID to vq assignment won't be duplicated.
It is configured fully by the guest without consulting hypervisor at the device level.
Guest IOMMU would consult hypervisor to setup any PASID mapping as part of any mapping method.

> 1) translated/untranslated request
> 2) request w/ and w/o PASID
> 
> >
> > > >
> > > >
> > > > > We wouldn't need to care about all of them if it is done at
> > > > > platform IOMMU level.
> > > >
> > > > If someone logs at IOMMU level then nothing needs to be done in
> > > > the spec at all. This is about capability at the device level.
> > >
> > > True, but my question is where or not it can be done at the device level
> easily.
> >
> > there's no "easily" about live migration ever.
> 
> I think I've stated sufficient issues to demonstrate how hard virtio wants to do it.
> And I've given the link that it is possible to do that in IOMMU without those
> issues. So in this context doing it in virtio is much harder.
> 
> > For example on-device iommus are a thing.
> 
> I'm not sure that's the way to go considering the platform IOMMU evolves very
> quickly.
> 
> >
> > > >
> > > >
> > > > > > what Lingshan
> > > > > > proposed is analogous to bit per page - problem unfortunately
> > > > > > is you can't easily set a bit by DMA.
> > > > > >
> > > > >
> > > > > I'm not saying bit/bytemap is the best, but it has been used by
> > > > > real hardware. And we have many other options.
> > > > >
> > > > > > So I think this dirty tracking is a good option to have.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > >
> > > > > > > > > Why does it matter in 2024?
> > > > > > > > Because users needs to use it now.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > If not, we are better off to offer this, and when/if
> > > > > > > > > > platform support is, sure,
> > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > >
> > > > > > > > > > > 4) if the platform support is missing, we can use
> > > > > > > > > > > software or leverage transport for assistance like
> > > > > > > > > > > PRI
> > > > > > > > > > All of these are in theory.
> > > > > > > > > > Our experiment shows PRI performance is 21x slower
> > > > > > > > > > than page fault rate
> > > > > > > > > done by the cpu.
> > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > >
> > > > > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > > > > Do you have perf data for this?
> > > > > > >
> > > > > > > No, but it's not hard to imagine the worst case. Wrote a
> > > > > > > small program that dirty every page by a NIC.
> > > > > > >
> > > > > > > > In the internal tests we don’t see this happening.
> > > > > > >
> > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > >
> > > > > > > So if we get very high dirty rates (e.g by a high speed
> > > > > > > NIC), we can't satisfy the requirement of the downtime. Or
> > > > > > > if you see the converge, you might get help from the auto
> > > > > > > converge support by the hypervisors like KVM where it tries
> > > > > > > to throttle the VCPU then you can't reach the wire speed.
> > > > > >
> > > > > > Will only work for some device types.
> > > > > >
> > > > >
> > > > > Yes, that's the point. Parav said he doesn't see the issue, it's
> > > > > probably because he is testing a virtio-net and so the vCPU is
> > > > > automatically throttled. It doesn't mean it can work for other
> > > > > virito devices.
> > > >
> > > > Only for TX, and I'm pretty sure they had the foresight to test RX
> > > > not just TX but let's confirm. Parav did you test both directions?
> > >
> > > RX speed somehow depends on the speed of refill, so throttling helps
> > > more or less.
> >
> > It doesn't depend on speed of refill you just underrun and drop
> > packets. then your nice 10usec latency becomes more like 10sec.
> 
> I miss your point here. If the driver can't achieve wire speed without dirty page
> tracking, it can neither when dirty page tracking is enabled.
> 
> >
> > > >
> > > > > >
> > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > So it is unusable.
> > > > > > > > >
> > > > > > > > > It's not about mandating, it's about doing things in the
> > > > > > > > > correct layer. If PRI is slow, PCI can evolve for sure.
> > > > > > > > You should try.
> > > > > > >
> > > > > > > Not my duty, I just want to make sure things are done in the
> > > > > > > correct layer, and once it needs to be done in the virtio,
> > > > > > > there's nothing obviously wrong.
> > > > > >
> > > > > > Yea but just vague questions don't help to make sure eiter way.
> > > > >
> > > > > I don't think it's vague, I have explained, if something in the
> > > > > virito slows down the PRI, we can try to fix them.
> > > >
> > > > I don't believe you are going to make PRI fast. No one managed so far.
> > >
> > > So it's the fault of PRI not virito, but it doesn't mean we need to
> > > do it in virtio.
> >
> > I keep saying with this approach we would just say "e1000 emulation is
> > slow and encumbered this is the fault of e1000" and never get virtio
> > at all.  Assigning blame only gets you so far.
> 
> I think we are discussing different things. My point is virtio needs to leverage
> the functionality provided by transport or platform (especially considering they
> evolve faster than virtio). It seems to me it's hard even to duplicate some basic
> function of platform IOMMU in virtio.
> 
Not duplicated. Feeding into the platform.

> >
> > > >
> > > > > Missing functions in
> > > > > platform or transport is not a good excuse to try to workaround
> > > > > it in the virtio. It's a layer violation and we never had any
> > > > > feature like this in the past.
> > > >
> > > > Yes missing functionality in the platform is exactly why virtio
> > > > was born in the first place.
> > >
> > > Well the platform can't do device specific logic. But that's not the
> > > case of dirty page tracking which is device logic agnostic.
> >
> > Not true platforms have things like NICs on board and have for many
> > years. It's about performance really.
> 
> I've stated sufficient issues above. And one more obvious issue for device
> initiated page logging is that it needs a lot of extra or unnecessary PCI
> transactions which will throttle the performance of the whole system (and lead
> to other issues like QOS). So I can't believe it has good performance overall.
> Logging via IOMMU or using shadow virtqueue doesn't need any extra PCI
> transactions at least.
> 
In the current proposal, it does not required PCI transactions, as there is only a hypervisor-initiated query interface.
It is a trade off of using svq + pasid vs using something from the device.

Again, both has different use case and value. One uses cpu and one uses device.
Depending how much power one wants to spend where..

> > So I'd like Parav to publish some
> > experiment results and/or some estimates.
> >
> 
> That's fine, but the above equation (used by Qemu) is sufficient to demonstrate
> how hard to stick wire speed in the case.
> 
> >
> > > >
> > > > > >
> > > > > > > > In the current state, it is mandating.
> > > > > > > > And if you think PRI is the only way,
> > > > > > >
> > > > > > > I don't, it's just an example where virtio can leverage from
> > > > > > > either transport or platform. Or if it's the fault in virtio
> > > > > > > that slows down the PRI, then it is something we can do.
> > > > > > >
> > > > > > > >  than you should propose that in the dirty page tracking series that
> you listed above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > > >
> > > > > > > No, the point is to not duplicate works especially
> > > > > > > considering virtio can't do better than platform or transport.
> > > > > >
> > > > > > If someone says they tried and platform's migration support
> > > > > > does not work for them and they want to build a solution in
> > > > > > virtio then what exactly is the objection?
> > > > >
> > > > > The discussion is to make sure whether virtio can do this easily
> > > > > and correctly, then we can have a conclusion. I've stated some
> > > > > issues above, and I've asked other questions related to them
> > > > > which are still not answered.
> > > > >
> > > > > I think we had a very hard time in bypassing IOMMU in the past
> > > > > that we don't want to repeat.
> > > > >
> > > > > We've gone through several methods of logging dirty pages in the
> > > > > past (each with pros/cons), but this proposal never explains why
> > > > > it chooses one of them but not others. Spec needs to find the
> > > > > best path instead of just a possible path without any rationale about
> why.
> > > >
> > > > Adding more rationale isn't a bad thing.
> > > > In particular if platform supplies dirty tracking then how does
> > > > driver decide which to use platform or device capability?
> > > > A bit of discussion around this is a good idea.
> > > >
> > > >
> > > > > > virtio is here in the
> > > > > > first place because emulating devices didn't work well.
> > > > >
> > > > > I don't understand here. We have supported emulated devices for years.
> > > > > I'm pretty sure a lot of issues could be uncovered if this
> > > > > proposal can be prototyped with an emulated device first.
> > > > >
> > > > > Thanks
> > > >
> > > > virtio was originally PV as opposed to emulation. That there's now
> > > > hardware virtio and you call software implementation "an
> > > > emulation" is very meta.
> > >
> > > Yes but I don't see how it relates to dirty page tracking. When we
> > > find a way it should work for both software and hardware devices.
> > >
> > > Thanks
> >
> > It has to work well on a variety of existing platforms. If it does
> > then sure, why would we roll our own.
> 
> If virtio can do that in an efficient way without any issues, I agree.
> But it seems not.
> 
> Thanks

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-15  7:59                           ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-15 17:42                             ` Parav Pandit
  0 siblings, 0 replies; 157+ messages in thread
From: Parav Pandit @ 2023-11-15 17:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, November 15, 2023 1:30 PM
> 
> On Thu, Nov 09, 2023 at 06:26:44AM +0000, Parav Pandit wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Wednesday, November 8, 2023 9:59 AM
> > >
> > > On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com>
> wrote:
> > > >
> > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > > Each virtio and non virtio devices who wants to report
> > > > > > > > their dirty page report,
> > > > > > > will do their way.
> > > > > > > >
> > > > > > > > > 3) inventing it in the virtio layer will be deprecated
> > > > > > > > > in the future for sure, as platform will provide much
> > > > > > > > > rich features for logging e.g it can do it per PASID
> > > > > > > > > etc, I don't see any reason virtio need to compete with
> > > > > > > > > the features that will be provided by the platform
> > > > > > > > Can you bring the cpu vendors and committement to virtio
> > > > > > > > tc with timelines
> > > > > > > so that virtio TC can omit?
> > > > > > >
> > > > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio
> > > > > > > needs to be built on top of transport or platform. There's
> > > > > > > no need to
> > > duplicate their job.
> > > > > > > Especially considering that virtio can't do better than them.
> > > > > > >
> > > > > > I wanted to see a strong commitment for the cpu vendors to
> > > > > > support dirty
> > > page tracking.
> > > > >
> > > > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD
> > > > > and ARM are all supporting that now.
> > > > >
> > > > > > And the work seems to have started for some platforms.
> > > > >
> > > > > Let me quote from the above link:
> > > > >
> > > > > """
> > > > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > > > alongside VT-D rev3.x also do support.
> > > > > """
> > > > >
> > > > > > Without such platform commitment, virtio also skipping it would not
> work.
> > > > >
> > > > > Is the above sufficient? I'm a little bit more familiar with
> > > > > vtd, the hw feature has been there for years.
> > > >
> > > >
> > > > Repeating myself - I'm not sure that will work well for all workloads.
> > >
> > > I think this comment applies to this proposal as well.
> > >
> > > > Definitely KVM did
> > > > not scan PTEs. It used pagefaults with bit per page and later as
> > > > VM size grew switched to PLM.  This interface is analogous to PLM,
> > >
> > > I think you meant PML actually. And it doesn't work like PML. To
> > > behave like PML it needs to
> > >
> > > 1) log buffers were organized as a queue with indices
> > > 2) device needs to suspend (as a #vmexit in PML) if it runs out of
> > > the buffers
> > > 3) device need to send a notification to the driver if it runs out
> > > of the buffer
> > >
> > > I don't see any of the above in this proposal. If we do that it
> > > would be less problematic than what is being proposed here.
> > >
> > In this proposal, its slightly different than PML.
> > The log buffer is a write record with the device. It keeps recording it.
> > And owner driver queries the recorded pages.
> > The device internally can do PML or other different implementations as it
> finds suitable.
> 
> I personally like it that this detail is hidden inside the device.
> One important functionality that PML has and that this does not have is ability
> to interrupt host e.g. if is running low on space to record these info. Want to
> add it in some way?
Page tracking using PML equivalent can be an additional method.
It can possibly live as independent feature as well and also extension of it.

One trade-off to deal with in that approach is, when iotlb flush is needed, it needs to query the partial range.
This requires search in the log buffer and create holes in it.
And hypervisor needs to do search and also maintain the shadow to overcome the problem with the shadow.

Using vq for out-of-order generates too many writes.

In the current device-based query interface there are zero pci writes like PML.

I would say, we should invent the PML incrementally when the first round of features are done.

> E.g. a special command that is only used if device is low on buffers.
> 
> 
> > > Even if we manage to do that, it doesn't mean we won't have issues.
> > >
> > > 1) For many reasons it can neither see nor log via GPA, so this
> > > requires a traversal of the vIOMMU mapping tables by the hypervisor
> > > afterwards, it would be expensive and need synchronization with the
> > > guest modification of the IO page table which looks very hard.
> > > 2) There are a lot of special or reserved IOVA ranges (for example
> > > the interrupt areas in x86) that need special care which is
> > > architectural and where it is beyond the scope or knowledge of the virtio
> device but the platform IOMMU.
> > > Things would be more complicated when SVA is enabled. And there
> > > could be other architecte specific knowledge (e.g
> > > PAGE_SIZE) that might be needed. There's no easy way to deal with those
> cases.
> > >
> >
> > Current and future iommufd and OS interface likely can support this already.
> > In current proposal, multiple ranges are supplied to the device, the reserved
> ranges are not part of it.
> >
> > > We wouldn't need to care about all of them if it is done at platform
> > > IOMMU level.
> > >
> > I agree that when platform IOMMU has support and if its better it should be
> first priority to use by the hypervisor.
> > Mainly because the D bit of the page already there, and not a special PML
> queue or a racy bitmap like what was proposed in other series.
> 
> BTW your bitmap is also racy if there's a vIOMMU, unless hypervisor is very
> careful to empty the bitmap when mappings change.
> You should document this requirement.
> 
When to query the dirty page log is hypervisor's decision, map/unmap IOTLB flush etc is hard to document in the spec.
We can write some guiding notes for hypervisor but not a requirement.

> 
> --
> MST


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-15 17:42                                 ` [virtio-comment] " Parav Pandit
@ 2023-11-16  4:18                                   ` Jason Wang
  2023-11-16  5:27                                     ` [virtio-comment] " Parav Pandit
  2023-11-17 10:15                                   ` [virtio-comment] " Michael S. Tsirkin
  1 sibling, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-16  4:18 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 16, 2023 at 1:42 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, November 13, 2023 9:02 AM
> >
> > On Thu, Nov 9, 2023 at 3:59 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote:
> > > > On Wed, Nov 8, 2023 at 4:17 PM Michael S. Tsirkin <mst@redhat.com>
> > wrote:
> > > > >
> > > > > On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> > > > > > On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com>
> > wrote:
> > > > > > >
> > > > > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > > > > > Each virtio and non virtio devices who wants to report
> > > > > > > > > > > their dirty page report,
> > > > > > > > > > will do their way.
> > > > > > > > > > >
> > > > > > > > > > > > 3) inventing it in the virtio layer will be
> > > > > > > > > > > > deprecated in the future for sure, as platform will
> > > > > > > > > > > > provide much rich features for logging e.g it can do
> > > > > > > > > > > > it per PASID etc, I don't see any reason virtio need
> > > > > > > > > > > > to compete with the features that will be provided
> > > > > > > > > > > > by the platform
> > > > > > > > > > > Can you bring the cpu vendors and committement to
> > > > > > > > > > > virtio tc with timelines
> > > > > > > > > > so that virtio TC can omit?
> > > > > > > > > >
> > > > > > > > > > Why do we need to bring CPU vendors in the virtio TC?
> > > > > > > > > > Virtio needs to be built on top of transport or platform. There's
> > no need to duplicate their job.
> > > > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > > > >
> > > > > > > > > I wanted to see a strong commitment for the cpu vendors to
> > support dirty page tracking.
> > > > > > > >
> > > > > > > > The RFC of IOMMUFD support can go back to early 2022. Intel,
> > > > > > > > AMD and ARM are all supporting that now.
> > > > > > > >
> > > > > > > > > And the work seems to have started for some platforms.
> > > > > > > >
> > > > > > > > Let me quote from the above link:
> > > > > > > >
> > > > > > > > """
> > > > > > > > Today, AMD Milan (or more recent) supports it while ARM
> > > > > > > > SMMUv3.2 alongside VT-D rev3.x also do support.
> > > > > > > > """
> > > > > > > >
> > > > > > > > > Without such platform commitment, virtio also skipping it would
> > not work.
> > > > > > > >
> > > > > > > > Is the above sufficient? I'm a little bit more familiar with
> > > > > > > > vtd, the hw feature has been there for years.
> > > > > > >
> > > > > > >
> > > > > > > Repeating myself - I'm not sure that will work well for all workloads.
> > > > > >
> > > > > > I think this comment applies to this proposal as well.
> > > > >
> > > > > Yes - some systems might be better off with platform tracking.
> > > > > And I think supporting shadow vq better would be nice too.
> > > >
> > > > For shadow vq, did you mean the work that is done by Eugenio?
> > >
> > > Yes.
> >
> > That's exactly why vDPA starts with shadow virtqueue. We've evaluated various
> > possible approaches, each of them have their shortcomings and shadow
> > virtqueue is the only one that doesn't require any additional hardware features
> > to work in every platform.
> >
> > >
> > > > >
> > > > > > > Definitely KVM did
> > > > > > > not scan PTEs. It used pagefaults with bit per page and later
> > > > > > > as VM size grew switched to PLM.  This interface is analogous
> > > > > > > to PLM,
> > > > > >
> > > > > > I think you meant PML actually. And it doesn't work like PML. To
> > > > > > behave like PML it needs to
> > > > > >
> > > > > > 1) log buffers were organized as a queue with indices
> > > > > > 2) device needs to suspend (as a #vmexit in PML) if it runs out
> > > > > > of the buffers
> > > > > > 3) device need to send a notification to the driver if it runs
> > > > > > out of the buffer
> > > > > >
> > > > > > I don't see any of the above in this proposal. If we do that it
> > > > > > would be less problematic than what is being proposed here.
> > > > >
> > > > > What is common between this and PML is that you get the addresses
> > > > > directly without scanning megabytes of bitmaps or worse - hundreds
> > > > > of megabytes of page tables.
> > > >
> > > > Yes, it has overhead but this is the method we use for vhost and KVM
> > (earlier).
> > > >
> > > > To me the  important advantage of PML is that it uses limited
> > > > resources on the host which
> > > >
> > > > 1) doesn't require resources in the device
> > > > 2) doesn't scale as the guest memory increases. (but this advantage
> > > > doesn't exist in neither this nor bitmap)
> > >
> > > it seems 2 exactly exists here.
> >
> > Actually not, Parav said the device needs to reserve sufficient resources in
> > another thread.
> The device resource reservation starts only when the device migration starts.
> i.e. with WRITE_RECORDS_START command of patch 7 in the series.

Right, but this is not the question, see below.

>
> >
> > >
> > >
> > > > >
> > > > > The data structure is different but I don't see why it is critical.
> > > > >
> > > > > I agree that I don't see out of buffers notifications too which
> > > > > implies device has to maintain something like a bitmap internally.
> > > > > Which I guess could be fine but it is not clear to me how large
> > > > > that bitmap has to be. How does the device know? Needs to be addressed.
> > > >
> > > > This is the question I asked Parav in another thread. Using host
> > > > memory as a queue with notification (like PML) might be much better.
> > >
> > > Well if queue is what you want to do you can just do it internally.
> >
> > Then it's not the proposal here, Parav has explained it in another reply, and as
> > explained it lacks a lot of other facilities.
> >
> PML is yet another option that requires small pci writes.
> In the current proposal, there are no small PCI writes.
> It is a query interface from the device.

Well, you've explained in another thread that actually it needs small
PCI writes.

E.g during IOTLB invalidation ...

>
> > > Problem of course is that it might overflow and cause things like
> > > packet drops.
> >
> > Exactly like PML. So sticking to wire speed should not be a general goal in the
> > context of migration. It can be done if the speed of the migration interface is
> > faster than the virtio device that needs to be migrated.
> May not have to be.
> Speed of page recording should be fast enough.
> It usually improves with subsequent generation.

If you have something better, let's propose it from the start.

> >
> > >
> > >
> > > > >
> > > > >
> > > > > > Even if we manage to do that, it doesn't mean we won't have issues.
> > > > > >
> > > > > > 1) For many reasons it can neither see nor log via GPA, so this
> > > > > > requires a traversal of the vIOMMU mapping tables by the
> > > > > > hypervisor afterwards, it would be expensive and need
> > > > > > synchronization with the guest modification of the IO page table which
> > looks very hard.
> > > > >
> > > > > vIOMMU is fast enough to be used on data path but not fast enough
> > > > > for dirty tracking?
> > > >
> > > > We set up SPTEs or using nesting offloading where the PTEs could be
> > > > iterated by hardware directly which is fast.
> > >
> > > There's a way to have hardware find dirty PTEs for you quickly?
> >
> > Scanning PTEs on the host is faster and more secure than scanning guests, that's
> > what I want to say:
> >
> > 1) the guest page could be swapped out but not the host one.
> > 2) no guest triggerable behavior
> >
>
> Device page tracking table to be consulted to flush on mapping change.
>
> > > I don't know how it's done. Do tell.
> > >
> > >
> > > > This is not the case here where software needs to iterate the IO
> > > > page tables in the guest which could be slow.
> > > >
> > > > > Hard to believe.  If true and you want to speed up vIOMMU then you
> > > > > implement an efficient datastructure for that.
> > > >
> > > > Besides the issue of performance, it's also racy, assuming we are logging
> > IOVA.
> > > >
> > > > 0) device log IOVA
> > > > 1) hypervisor fetches IOVA from log buffer
> > > > 2) guest map IOVA to a new GPA
> > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > >
> > > > Then we lost the old GPA.
> > >
> > > Interesting and a good point.
> >
> > Note that PML logs at GPA as it works at L1 of EPT.
> >
> > > And by the way e.g. vhost has the same issue.  You need to flush dirty
> > > tracking info when changing the mappings somehow.
> >
> > It's not,
> >
> > 1) memory translation is done by vhost
> > 2) vhost knows GPA and it doesn't log via IOVA.
> >
> > See this for example, and DPDK has similar fixes.
> >
> > commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
> > Author: Jason Wang <jasowang@redhat.com>
> > Date:   Wed Jan 16 16:54:42 2019 +0800
> >
> >     vhost: log dirty page correctly
> >
> >     Vhost dirty page logging API is designed to sync through GPA. But we
> >     try to log GIOVA when device IOTLB is enabled. This is wrong and may
> >     lead to missing data after migration.
> >
> >     To solve this issue, when logging with device IOTLB enabled, we will:
> >
> >     1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
> >        get HVA, for writable descriptor, get HVA through iovec. For used
> >        ring update, translate its GIOVA to HVA
> >     2) traverse the GPA->HVA mapping to get the possible GPA and log
> >        through GPA. Pay attention this reverse mapping is not guaranteed
> >        to be unique, so we should log each possible GPA in this case.
> >
> >     This fix the failure of scp to guest during migration. In -next, we
> >     will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> >
> >     Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> >     Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> >     Cc: Jintack Lim <jintack@cs.columbia.edu>
> >     Signed-off-by: Jason Wang <jasowang@redhat.com>
> >     Acked-by: Michael S. Tsirkin <mst@redhat.com>
> >     Signed-off-by: David S. Miller <davem@davemloft.net>
> >
> > All of the above is not what virtio did right now.
> >
> > > Parav what's the plan for this? Should be addressed in the spec too.
> > >
> >
> > AFAIK, there's no easy/efficient way to do that. I hope I was wrong.
> >
>
> The query interface in this proposal works on the granular boundary to read and clear.
> This will ensure that mapping is consistent.
>
> > >
> > >
> > > > >
> > > > > > 2) There are a lot of special or reserved IOVA ranges (for
> > > > > > example the interrupt areas in x86) that need special care which
> > > > > > is architectural and where it is beyond the scope or knowledge
> > > > > > of the virtio device but the platform IOMMU. Things would be
> > > > > > more complicated when SVA is enabled.
> > > > >
> > > > > SVA being what here?
> > > >
> > > > For example, IOMMU may treat interrupt ranges differently depending
> > > > on whether SVA is enabled or not. It's very hard and unnecessary to
> > > > teach devices about this.
> > >
> > > Oh, shared virtual memory. So what you are saying here? virtio does
> > > not care, it just uses some addresses and if you want it to it can
> > > record writes somewhere.
> >
> > One example, PCI allows devices to send translated requests, how can a
> > hypervisor know it's a PA or IOVA in this case? We probably need a new bit. But
> > it's not the only thing we need to deal with.
> >
> > By definition, interrupt ranges and other reserved ranges should not belong to
> > dirty pages. And the logging should be done before the DMA where there's no
> > way for the device to know whether or not an IOVA is valid or not. It would be
> > more safe to just not report them from the source instead of leaving it to the
> > hypervisor to deal with but this seems impossible at the device level. Otherwise
> > the hypervisor driver needs to communicate with the (v)IOMMU to be reached
> > with the
> > interrupt(MSI) area, RMRR area etc in order to do the correct things or it might
> > have security implications. And those areas don't make sense at L1 when vSVA
> > is enabled. What's more, when vIOMMU could be fully offloaded, there's no
> > easy way to fetch that information.
> >
> There cannot be logging before the DMA.

Well, I don't see how this is related to the issue above. Logging
after the DMA doesn't mean the device can know what sits behind an
IOVA, no?

> Only requirement is before the mapping changes, the dirty page tracking to be synced.
>
> In most common cases where the perf is critical, such mapping wont change so often dynamically anyway.

I've explained the issue in another reply.

>
> > Again, it's hard to bypass or even duplicate the functionality of the platform or
> > we need to step into every single detail of a specific transport, architecture or
> > IOMMU to figure out whether or not logging at virtio is correct which is
> > awkward and unrealistic. This proposal suffers from an exact similar issue when
> > inventing things like freeze/stop where I've pointed out other branches of issues
> > as well.
> >
> It is incorrect attribution that platform is duplicated here.
> It feeds the data to the platform as needed without replicating.
>
> I do agree that there is overlap of IOMMU tracking the dirty and storing it in the per PTE vs device supplying its dirty track via its own interface.
> Both are consolidated at hypervisor level.
>
> > >
> > > > >
> > > > > > And there could be other architecte specific knowledge (e.g
> > > > > > PAGE_SIZE) that might be needed. There's no easy way to deal
> > > > > > with those cases.
> > > > >
> > > > > Good point about page size actually - using 4k unconditionally is
> > > > > a waste of resources.
> > > >
> > > > Actually, they are more than just PAGE_SIZE, for example, PASID and others.
> > >
> > > what does pasid have to do with it? anyway, just give driver control
> > > over page size.
> >
> > For example, two virtqueues have two PASIDs assigned. How can a hypervisor
> > know which specific IOVA belongs to which IOVA? For platform IOMMU, they
> > are handy as it talks to the transport. But I don't think we need to duplicate
> > every transport specific address space feature in core virtio layer:
> >
> PASID to vq assignment won't be duplicated.
> It is configured fully by the guest without consulting hypervisor at the device level.
> Guest IOMMU would consult hypervisor to setup any PASID mapping as part of any mapping method.
>
> > 1) translated/untranslated request
> > 2) request w/ and w/o PASID
> >
> > >
> > > > >
> > > > >
> > > > > > We wouldn't need to care about all of them if it is done at
> > > > > > platform IOMMU level.
> > > > >
> > > > > If someone logs at IOMMU level then nothing needs to be done in
> > > > > the spec at all. This is about capability at the device level.
> > > >
> > > > True, but my question is where or not it can be done at the device level
> > easily.
> > >
> > > there's no "easily" about live migration ever.
> >
> > I think I've stated sufficient issues to demonstrate how hard virtio wants to do it.
> > And I've given the link that it is possible to do that in IOMMU without those
> > issues. So in this context doing it in virtio is much harder.
> >
> > > For example on-device iommus are a thing.
> >
> > I'm not sure that's the way to go considering the platform IOMMU evolves very
> > quickly.
> >
> > >
> > > > >
> > > > >
> > > > > > > what Lingshan
> > > > > > > proposed is analogous to bit per page - problem unfortunately
> > > > > > > is you can't easily set a bit by DMA.
> > > > > > >
> > > > > >
> > > > > > I'm not saying bit/bytemap is the best, but it has been used by
> > > > > > real hardware. And we have many other options.
> > > > > >
> > > > > > > So I think this dirty tracking is a good option to have.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > > >
> > > > > > > > > > Why does it matter in 2024?
> > > > > > > > > Because users needs to use it now.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > If not, we are better off to offer this, and when/if
> > > > > > > > > > > platform support is, sure,
> > > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > > >
> > > > > > > > > > > > 4) if the platform support is missing, we can use
> > > > > > > > > > > > software or leverage transport for assistance like
> > > > > > > > > > > > PRI
> > > > > > > > > > > All of these are in theory.
> > > > > > > > > > > Our experiment shows PRI performance is 21x slower
> > > > > > > > > > > than page fault rate
> > > > > > > > > > done by the cpu.
> > > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > > >
> > > > > > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > > > > > Do you have perf data for this?
> > > > > > > >
> > > > > > > > No, but it's not hard to imagine the worst case. Wrote a
> > > > > > > > small program that dirty every page by a NIC.
> > > > > > > >
> > > > > > > > > In the internal tests we don’t see this happening.
> > > > > > > >
> > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > > >
> > > > > > > > So if we get very high dirty rates (e.g by a high speed
> > > > > > > > NIC), we can't satisfy the requirement of the downtime. Or
> > > > > > > > if you see the converge, you might get help from the auto
> > > > > > > > converge support by the hypervisors like KVM where it tries
> > > > > > > > to throttle the VCPU then you can't reach the wire speed.
> > > > > > >
> > > > > > > Will only work for some device types.
> > > > > > >
> > > > > >
> > > > > > Yes, that's the point. Parav said he doesn't see the issue, it's
> > > > > > probably because he is testing a virtio-net and so the vCPU is
> > > > > > automatically throttled. It doesn't mean it can work for other
> > > > > > virito devices.
> > > > >
> > > > > Only for TX, and I'm pretty sure they had the foresight to test RX
> > > > > not just TX but let's confirm. Parav did you test both directions?
> > > >
> > > > RX speed somehow depends on the speed of refill, so throttling helps
> > > > more or less.
> > >
> > > It doesn't depend on speed of refill you just underrun and drop
> > > packets. then your nice 10usec latency becomes more like 10sec.
> >
> > I miss your point here. If the driver can't achieve wire speed without dirty page
> > tracking, it can neither when dirty page tracking is enabled.
> >
> > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > > So it is unusable.
> > > > > > > > > >
> > > > > > > > > > It's not about mandating, it's about doing things in the
> > > > > > > > > > correct layer. If PRI is slow, PCI can evolve for sure.
> > > > > > > > > You should try.
> > > > > > > >
> > > > > > > > Not my duty, I just want to make sure things are done in the
> > > > > > > > correct layer, and once it needs to be done in the virtio,
> > > > > > > > there's nothing obviously wrong.
> > > > > > >
> > > > > > > Yea but just vague questions don't help to make sure eiter way.
> > > > > >
> > > > > > I don't think it's vague, I have explained, if something in the
> > > > > > virito slows down the PRI, we can try to fix them.
> > > > >
> > > > > I don't believe you are going to make PRI fast. No one managed so far.
> > > >
> > > > So it's the fault of PRI not virito, but it doesn't mean we need to
> > > > do it in virtio.
> > >
> > > I keep saying with this approach we would just say "e1000 emulation is
> > > slow and encumbered this is the fault of e1000" and never get virtio
> > > at all.  Assigning blame only gets you so far.
> >
> > I think we are discussing different things. My point is virtio needs to leverage
> > the functionality provided by transport or platform (especially considering they
> > evolve faster than virtio). It seems to me it's hard even to duplicate some basic
> > function of platform IOMMU in virtio.
> >
> Not duplicated. Feeding into the platform.
>
> > >
> > > > >
> > > > > > Missing functions in
> > > > > > platform or transport is not a good excuse to try to workaround
> > > > > > it in the virtio. It's a layer violation and we never had any
> > > > > > feature like this in the past.
> > > > >
> > > > > Yes missing functionality in the platform is exactly why virtio
> > > > > was born in the first place.
> > > >
> > > > Well the platform can't do device specific logic. But that's not the
> > > > case of dirty page tracking which is device logic agnostic.
> > >
> > > Not true platforms have things like NICs on board and have for many
> > > years. It's about performance really.
> >
> > I've stated sufficient issues above. And one more obvious issue for device
> > initiated page logging is that it needs a lot of extra or unnecessary PCI
> > transactions which will throttle the performance of the whole system (and lead
> > to other issues like QOS). So I can't believe it has good performance overall.
> > Logging via IOMMU or using shadow virtqueue doesn't need any extra PCI
> > transactions at least.
> >
> In the current proposal, it does not required PCI transactions, as there is only a hypervisor-initiated query interface.

Such query requires at least several transactions, no?

Or to make things more clear, could you list the steps how a
hypervisor is expected to do the querying?

> It is a trade off of using svq + pasid vs using something from the device.
>
> Again, both has different use case and value. One uses cpu and one uses device.

That's for sure.

> Depending how much power one wants to spend where..

But as the author of the proposal, you need to elaborate more on how
you expect for a hypervisor instead of letting the reviewer guess.

Thanks



>
> > > So I'd like Parav to publish some
> > > experiment results and/or some estimates.
> > >
> >
> > That's fine, but the above equation (used by Qemu) is sufficient to demonstrate
> > how hard to stick wire speed in the case.
> >
> > >
> > > > >
> > > > > > >
> > > > > > > > > In the current state, it is mandating.
> > > > > > > > > And if you think PRI is the only way,
> > > > > > > >
> > > > > > > > I don't, it's just an example where virtio can leverage from
> > > > > > > > either transport or platform. Or if it's the fault in virtio
> > > > > > > > that slows down the PRI, then it is something we can do.
> > > > > > > >
> > > > > > > > >  than you should propose that in the dirty page tracking series that
> > you listed above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > > > >
> > > > > > > > No, the point is to not duplicate works especially
> > > > > > > > considering virtio can't do better than platform or transport.
> > > > > > >
> > > > > > > If someone says they tried and platform's migration support
> > > > > > > does not work for them and they want to build a solution in
> > > > > > > virtio then what exactly is the objection?
> > > > > >
> > > > > > The discussion is to make sure whether virtio can do this easily
> > > > > > and correctly, then we can have a conclusion. I've stated some
> > > > > > issues above, and I've asked other questions related to them
> > > > > > which are still not answered.
> > > > > >
> > > > > > I think we had a very hard time in bypassing IOMMU in the past
> > > > > > that we don't want to repeat.
> > > > > >
> > > > > > We've gone through several methods of logging dirty pages in the
> > > > > > past (each with pros/cons), but this proposal never explains why
> > > > > > it chooses one of them but not others. Spec needs to find the
> > > > > > best path instead of just a possible path without any rationale about
> > why.
> > > > >
> > > > > Adding more rationale isn't a bad thing.
> > > > > In particular if platform supplies dirty tracking then how does
> > > > > driver decide which to use platform or device capability?
> > > > > A bit of discussion around this is a good idea.
> > > > >
> > > > >
> > > > > > > virtio is here in the
> > > > > > > first place because emulating devices didn't work well.
> > > > > >
> > > > > > I don't understand here. We have supported emulated devices for years.
> > > > > > I'm pretty sure a lot of issues could be uncovered if this
> > > > > > proposal can be prototyped with an emulated device first.
> > > > > >
> > > > > > Thanks
> > > > >
> > > > > virtio was originally PV as opposed to emulation. That there's now
> > > > > hardware virtio and you call software implementation "an
> > > > > emulation" is very meta.
> > > >
> > > > Yes but I don't see how it relates to dirty page tracking. When we
> > > > find a way it should work for both software and hardware devices.
> > > >
> > > > Thanks
> > >
> > > It has to work well on a variety of existing platforms. If it does
> > > then sure, why would we roll our own.
> >
> > If virtio can do that in an efficient way without any issues, I agree.
> > But it seems not.
> >
> > Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-15 17:38                         ` [virtio-comment] " Parav Pandit
@ 2023-11-16  4:23                           ` Jason Wang
  2023-11-16  5:29                             ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-16  4:23 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, November 13, 2023 9:07 AM
> >
> > On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, November 7, 2023 9:34 AM
> > > >
> > > > On Mon, Nov 6, 2023 at 2:54 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Monday, November 6, 2023 12:04 PM
> > > > > >
> > > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Thursday, November 2, 2023 9:54 AM
> > > > > > > >
> > > > > > > > On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > Sent: Wednesday, November 1, 2023 6:00 AM
> > > > > > > > > >
> > > > > > > > > > On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit
> > > > > > > > > > <parav@nvidia.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit
> > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > During a device migration flow (typically in a
> > > > > > > > > > > > > precopy phase of the live migration), a device may
> > > > > > > > > > > > > write to the guest memory. Some iommu/hypervisor
> > > > > > > > > > > > > may not be able to track these
> > > > > > > > written pages.
> > > > > > > > > > > > > These pages to be migrated from source to
> > > > > > > > > > > > > destination
> > > > hypervisor.
> > > > > > > > > > > > >
> > > > > > > > > > > > > A device which writes to these pages, provides the
> > > > > > > > > > > > > page address record of the to the owner device.
> > > > > > > > > > > > > The owner device starts write recording for the
> > > > > > > > > > > > > device and queries all the page addresses written by the
> > device.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Fixes:
> > > > > > > > > > > > > https://github.com/oasis-tcs/virtio-spec/issues/17
> > > > > > > > > > > > > 6
> > > > > > > > > > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > > > > > > > > > Signed-off-by: Satananda Burla
> > > > > > > > > > > > > <sburla@marvell.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > > changelog:
> > > > > > > > > > > > > v1->v2:
> > > > > > > > > > > > > - addressed comments from Michael
> > > > > > > > > > > > > - replaced iova with physical address
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >  admin-cmds-device-migration.tex | 15
> > > > > > > > > > > > > +++++++++++++++
> > > > > > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/admin-cmds-device-migration.tex
> > > > > > > > > > > > > b/admin-cmds-device-migration.tex index
> > > > > > > > > > > > > ed911e4..2e32f2c
> > > > > > > > > > > > > 100644
> > > > > > > > > > > > > --- a/admin-cmds-device-migration.tex
> > > > > > > > > > > > > +++ b/admin-cmds-device-migration.tex
> > > > > > > > > > > > > @@ -95,6 +95,21 @@ \subsubsection{Device
> > > > > > > > > > > > > Migration}\label{sec:Basic Facilities of a Virtio
> > > > > > > > > > > > > Device / The owner driver can discard any
> > > > > > > > > > > > > partially read or written device context when  any
> > > > > > > > > > > > > of the device migration flow
> > > > > > > > > > > > should be aborted.
> > > > > > > > > > > > >
> > > > > > > > > > > > > +During the device migration flow, a passthrough
> > > > > > > > > > > > > +device may write data to the guest virtual
> > > > > > > > > > > > > +machine's memory, a source hypervisor needs to
> > > > > > > > > > > > > +keep track of these written memory to migrate
> > > > > > > > > > > > > +such memory to destination
> > > > > > > > > > > > hypervisor.
> > > > > > > > > > > > > +Some systems may not be able to keep track of
> > > > > > > > > > > > > +such memory write addresses at hypervisor level.
> > > > > > > > > > > > > +In such a scenario, a device records and reports
> > > > > > > > > > > > > +these written memory addresses to the owner
> > > > > > > > > > > > > +device. The owner driver enables write recording
> > > > > > > > > > > > > +for one or more physical address ranges per
> > > > > > > > > > > > > +device during device
> > > > > > > > migration flow.
> > > > > > > > > > > > > +The owner driver periodically queries these
> > > > > > > > > > > > > +written physical address
> > > > > > > > > > records from the device.
> > > > > > > > > > > >
> > > > > > > > > > > > I wonder how PA works in this case. Device uses
> > > > > > > > > > > > untranslated requests so it can only see IOVA. We
> > > > > > > > > > > > can't mandate
> > > > ATS anyhow.
> > > > > > > > > > > Michael suggested to keep the language uniform as PA
> > > > > > > > > > > as this is ultimately
> > > > > > > > > > what the guest driver is supplying during vq creation
> > > > > > > > > > and in posting buffers as physical address.
> > > > > > > > > >
> > > > > > > > > > This seems to need some work. And, can you show me how
> > > > > > > > > > it can
> > > > work?
> > > > > > > > > >
> > > > > > > > > > 1) e.g if GAW is 48 bit, is the hypervisor expected to
> > > > > > > > > > do a bisection of the whole range?
> > > > > > > > > > 2) does the device need to reserve sufficient internal
> > > > > > > > > > resources for logging the dirty page and why (not)?
> > > > > > > > > No when dirty page logging starts, only at that time,
> > > > > > > > > device will reserve
> > > > > > > > enough resources.
> > > > > > > >
> > > > > > > > GAW is 48bit, how large would it have then?
> > > > > > > Dirty page tracking is not dependent on the size of the GAW.
> > > > > > > It is function of address ranges for the amount of guest
> > > > > > > memory regardless of
> > > > > > GAW.
> > > > > >
> > > > > > The problem is, e.g when vIOMMU is enabled, you can't know which
> > > > > > IOVA is actually used by guests. And even for the case when
> > > > > > vIOMMU is not enabled, the guest may have several TBs. Is it
> > > > > > easy to reserve sufficient resources by the device itself?
> > > > > >
> > > > > When page tracking is enabled per device, it knows about the range
> > > > > and it can
> > > > reserve certain resource.
> > > >
> > > > I didn't see such an interface in this series. Anything I miss?
> > > >
> > > Yes, this patch and the next patch is covering the page tracking start,stop and
> > query commands.
> > > They are named as write recording commands.
> >
> > So I still don't see how the device can reserve sufficient resources?
> > Guests may map a very large area of memory to IOMMU (or when vIOMMU is
> > disabled, GPA is used). It would be several TBs, how can the device reserve
> > sufficient resources in this case?
> When the map is established, the ranges are supplied to the device to know how much to reserve.
> If device does not have enough resource, it fails the command.
>
> One can advance it further to provision for the desired range..

Well, I think I've asked whether or not a bisection is needed, and you
told me not ...

But at least we need to document this in the proposal, no?

> >
> > >
> > > > Btw, the IOVA is allocated by the guest actually, how can we know the
> > range?
> > > > (or using the host range?)
> > > >
> > > Hypervisor would have mapping translation.
> >
> > That's really tricky and can only work in some cases:
> >
> > 1) It requires the hypervisor to traverse the guest I/O page tables which could
> > be very large range
> > 2) It requests the hypervisor to trap the modification of guest I/O page tables
> > and synchronize with the range changes, which is inefficient and can only be
> > done when we are doing shadow PTEs. It won't work when the nesting
> > translation could be offloaded to the hardware
> > 3) It is racy with the guest modification of I/O page tables which is explained in
> > another thread
> Mapping changes with more hw mmu's is not a frequent event and IOTLB flush is done using querying the dirty log for the smaller range.
>
> > 4) No aware of new features like PASID which has been explained in another
> > thread
> For all the pinned work with non sw based IOMMU, it is typically small subset.
> PASID is guest controlled.

Let's repeat my points:

1) vq1 use untranslated request with PASID1
2) vq2 use untranslated request with PASID2

Shouldn't we log PASID as well?

And

1) vq1 is using translated request
2) vq2 is using untranslated request

How could we differ?

>
> >
> > >
> > > > >
> > > > > > Host should always have more resources than device, in that
> > > > > > sense there could be several methods that tries to utilize host
> > > > > > memory instead of the one in the device. I think we've discussed
> > > > > > this when going through the doc prepared by Eugenio.
> > > > > >
> > > > > > >
> > > > > > > > What happens if we're trying to migrate more than 1 device?
> > > > > > > >
> > > > > > > That is perfectly fine.
> > > > > > > Each device is updating its log of pages it wrote.
> > > > > > > The hypervisor is collecting their sum.
> > > > > >
> > > > > > See above.
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > 3) DMA is part of the transport, it's natural to do
> > > > > > > > > > logging there, why duplicate efforts in the virtio layer?
> > > > > > > > > He he, you have funny comment.
> > > > > > > > > When an abstract facility is added to virtio you say to do in
> > transport.
> > > > > > > >
> > > > > > > > So it's not done in the general facility but tied to the admin part.
> > > > > > > > And we all know dirty page tracking is a challenge and
> > > > > > > > Eugenio has a good summary of pros/cons. A revisit of those
> > > > > > > > docs make me think virtio is not the good place for doing that for
> > may reasons:
> > > > > > > >
> > > > > > > > 1) as stated, platform will evolve to be able to tracking
> > > > > > > > dirty pages, actually, it has been supported by a lot of
> > > > > > > > major IOMMU vendors
> > > > > > >
> > > > > > > This is optional facility in virtio.
> > > > > > > Can you please point to the references? I don’t see it in the
> > > > > > > common Linux
> > > > > > kernel support for it.
> > > > > >
> > > > > > Note that when IOMMUFD is being proposed, dirty page tracking is
> > > > > > one of the major considerations.
> > > > > >
> > > > > > This is one recent proposal:
> > > > > >
> > > > > > https://www.spinics.net/lists/kvm/msg330894.html
> > > > > >
> > > > > Sure, so if platform supports it. it can be used from the platform.
> > > > > If it does not, the device supplies it.
> > > > >
> > > > > > > Instead Linux kernel choose to extend to the devices.
> > > > > >
> > > > > > Well, as I stated, tracking dirty pages is challenging if you
> > > > > > want to do it on a device, and you can't simply invent dirty
> > > > > > page tracking for each type of the devices.
> > > > > >
> > > > > It is not invented.
> > > > > It is generic framework for all virtio device types as proposed here.
> > > > > Keep in mind, that it is optional already in v3 series.
> > > > >
> > > > > > > At least not seen to arrive this in any near term in start of
> > > > > > > 2024 which is
> > > > > > where users must use this.
> > > > > > >
> > > > > > > > 2) you can't assume virtio is the only device that can be
> > > > > > > > used by the guest, having dirty pages tracking to be
> > > > > > > > implemented in each type of device is unrealistic
> > > > > > > Of course, there is no such assumption made. Where did you see
> > > > > > > a text that
> > > > > > made such assumption?
> > > > > >
> > > > > > So what happens if you have a guest with virtio and other devices
> > assigned?
> > > > > >
> > > > > What happens? Each device type would do its own dirty page tracking.
> > > > > And if all devices does not have support, hypervisor knows to fall
> > > > > back to
> > > > platform iommu or its own.
> > > > >
> > > > > > > Each virtio and non virtio devices who wants to report their
> > > > > > > dirty page report,
> > > > > > will do their way.
> > > > > > >
> > > > > > > > 3) inventing it in the virtio layer will be deprecated in
> > > > > > > > the future for sure, as platform will provide much rich
> > > > > > > > features for logging e.g it can do it per PASID etc, I don't
> > > > > > > > see any reason virtio need to compete with the features that
> > > > > > > > will be provided by the platform
> > > > > > > Can you bring the cpu vendors and committement to virtio tc
> > > > > > > with timelines
> > > > > > so that virtio TC can omit?
> > > > > >
> > > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio
> > > > > > needs to be built on top of transport or platform. There's no
> > > > > > need to duplicate
> > > > their job.
> > > > > > Especially considering that virtio can't do better than them.
> > > > > >
> > > > > I wanted to see a strong commitment for the cpu vendors to support
> > > > > dirty
> > > > page tracking.
> > > >
> > > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
> > > > ARM are all supporting that now.
> > > >
> > > > > And the work seems to have started for some platforms.
> > > >
> > > > Let me quote from the above link:
> > > >
> > > > """
> > > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > > alongside VT-D rev3.x also do support.
> > > > """
> > > >
> > > > > Without such platform commitment, virtio also skipping it would not work.
> > > >
> > > > Is the above sufficient? I'm a little bit more familiar with vtd,
> > > > the hw feature has been there for years.
> > > >
> > > Vtd has a sticky D bit that requires synchronization with IOPTE page caches
> > when sw wants to clear it.
> >
> > This is by design.
> >
> > > Do you know if is it reliable when device does multiple writes, ie,
> > >
> > > a. iommu write D bit
> > > b. software read it
> > > c. sw synchronize cache
> > > d. iommu write D bit on next write by device
> >
> > What issue did you see here? But that's not even an excuse, if there's a bug,
> > let's report it to IOMMU vendors and let them fix it. The thread I point to you is
> > actually a good space.
> >
> So we cannot claim that it is there in the platform.

I'm confused, the thread I point to you did the cache synchronization
which has been explained in the changelog, so what's the issue?

>
> > Again, the point is to let the correct role play.
> >
> How many more years should we block the virtio device migration when platform do not have it?

At least for VT-D, it has been used for years.

>
> > >
> > > ARM SMMU based servers to be present with D bit tracking.
> > > It is still early to say platform is ready.
> >
> > This is not what I read from both the series I posted and the spec, dirty bit has
> > been supported several years ago at least for vtd.
> Supported, but spec listed it as sticky bit that may require special handling.

Please explain why this is "special handling". IOMMU has several
different layers of caching, by design, it can't just open a window
for D bit.

> May be it is working, but not all cpu platforms have it.

I don't see the point. Migration is not supported for virito as well.

>
> >
> > >
> > > It is optional so whichever has the support it will be used.
> >
> > I can't see the point of this, it is already available. And migration doesn't exist in
> > virtio spec yet.
> >
> > >
> > > > >
> > > > > > > i.e. in first year of 2024?
> > > > > >
> > > > > > Why does it matter in 2024?
> > > > > Because users needs to use it now.
> > > > >
> > > > > >
> > > > > > > If not, we are better off to offer this, and when/if platform
> > > > > > > support is, sure,
> > > > > > this feature can be disabled/not used/not enabled.
> > > > > > >
> > > > > > > > 4) if the platform support is missing, we can use software
> > > > > > > > or leverage transport for assistance like PRI
> > > > > > > All of these are in theory.
> > > > > > > Our experiment shows PRI performance is 21x slower than page
> > > > > > > fault rate
> > > > > > done by the cpu.
> > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > >
> > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > Do you have perf data for this?
> > > >
> > > > No, but it's not hard to imagine the worst case. Wrote a small
> > > > program that dirty every page by a NIC.
> > > >
> > > > > In the internal tests we don’t see this happening.
> > > >
> > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > >
> > > > So if we get very high dirty rates (e.g by a high speed NIC), we
> > > > can't satisfy the requirement of the downtime. Or if you see the
> > > > converge, you might get help from the auto converge support by the
> > > > hypervisors like KVM where it tries to throttle the VCPU then you can't reach
> > the wire speed.
> > > >
> > > Once PRI is enabled, even without migration, there is basic perf issues.
> >
> > The context is not PRI here...
> >
> > It's about if you can stick to wire speed during live migration. Based on the
> > analysis so far, you can't achieve wirespeed and downtime at the same time.
> > That's why the hypervisor needs to throttle VCPU or devices.
> >
> So?
> Device also may throttle itself.

That's perfectly fine. We are on the same page, no? It's wrong to
judge the dirty page tracking in the context of live migration by
measuring whether or not the device can work at wire speed.

>
> > For PRI, it really depends on how you want to use it. E.g if you don't want to pin
> > a page, the performance is the price you must pay.
> PRI without pinning does not make sense for device to make large mapping queries.

That's also fine. Hypervisors can choose to enable and use PRI
depending on the different cases.

>
> >
> > >
> > > > >
> > > > > >
> > > > > > > There is no requirement for mandating PRI either.
> > > > > > > So it is unusable.
> > > > > >
> > > > > > It's not about mandating, it's about doing things in the correct
> > > > > > layer. If PRI is slow, PCI can evolve for sure.
> > > > > You should try.
> > > >
> > > > Not my duty, I just want to make sure things are done in the correct
> > > > layer, and once it needs to be done in the virtio, there's nothing obviously
> > wrong.
> > > >
> > > At present, it looks all platforms are not equally ready for page tracking.
> >
> > That's not an excuse to let virtio support that.
> It is wrong attribution as excuse.
>
> > And we need also to figure out if
> > virtio can do that easily. I've pointed out sufficient issues, I'm pretty sure there
> > would be more as the platform evolves.
> >
> I am not sure if virtio feeds the log into the platform.

I don't understand the meaning here.

>
> > >
> > > > > In the current state, it is mandating.
> > > > > And if you think PRI is the only way,
> > > >
> > > > I don't, it's just an example where virtio can leverage from either
> > > > transport or platform. Or if it's the fault in virtio that slows
> > > > down the PRI, then it is something we can do.
> > > >
> > > Yea, it does not seem to be ready yet.
> > >
> > > > >  than you should propose that in the dirty page tracking series
> > > > > that you listed
> > > > above to not do dirty page tracking. Rather depend on PRI, right?
> > > >
> > > > No, the point is to not duplicate works especially considering
> > > > virtio can't do better than platform or transport.
> > > >
> > > Both the platform and virtio work is ongoing.
> >
> > Why duplicate the work then?
> >
> Not all cpu platforms support as far as I know.

Yes, but we all know the platform is working to support this.

Supporting this on the device is hard.

>
> > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > When one does something in transport, you say, this is
> > > > > > > > > transport specific, do
> > > > > > > > some generic.
> > > > > > > > >
> > > > > > > > > Here the device is being tracked is virtio device.
> > > > > > > > > PCI-SIG has told already that PCIM interface is outside the scope of
> > it.
> > > > > > > > > Hence, this is done in virtio layer here in abstract way.
> > > > > > > >
> > > > > > > > You will end up with a competition with the
> > > > > > > > platform/transport one that will fail.
> > > > > > > >
> > > > > > > I don’t see a reason. There is no competition.
> > > > > > > Platform always have a choice to not use device side page
> > > > > > > tracking when it is
> > > > > > supported.
> > > > > >
> > > > > > Platform provides a lot of other functionalities for dirty logging:
> > > > > > e.g per PASID, granular, etc. So you want to duplicate them
> > > > > > again in the virtio? If not, why choose this way?
> > > > > >
> > > > > It is optional for the platforms where platform do not have it.
> > > >
> > > > We are developing new virtio functionalities that are targeted for
> > > > future platforms. Otherwise we would end up with a feature with a
> > > > very narrow use case.
> > > In general I agree that platform is an option too.
> > > Hypervisor will be able to make the decision to use platform when available
> > and fallback to device method when platform does not have it.
> > >
> > > Future and to be equally usable in near term :)
> >
> > Please don't double standard again:
> >
> > When you are talking about TDISP, you want virtio to be designed to fit for the
> > future where the platform is ready in the future When you are talking about
> > dirty tracking, you want it to work now even if
> >
> The proposal of transport VQ is anti-TDISP.

It's nothing about transport VQ, it's about you're saying the adminq
based device context. There's a comment to point out that the current
TDISP spec forbids modifying device state when TVM is attached. Then
you told us the TDISP may evolve for that.

> The proposal of dirty tracking is not anti-platform. It is optional like rest of the platform.
>
> > 1) most of the platform is ready now
> Can you list a ARM server CPU in production that has it? (not in some pdf spec).

Then in the context of a dirty page, I've proved you dirty page
tracking has been supported by all major vendors. Where you refuse to
use the standard you used in explaining adminq for device context in
TDISP.

So I didn't ask you the ETA of the TDISP support for migration or
adminq, but you want me to give you the production information which
is pointless. You might need to ask ARM to get an answer, but a simple
google told me the effort to support dirty page tracking in SMMUv3
could go back to early 2021.

https://lore.kernel.org/linux-iommu/56b001fa-b4fe-c595-dc5e-f362d2f07a19@linux.intel.com/t/

Why is it not merged? It's simply because we agree to do it in the
layer of IOMMUFD so it needs to wait.

Thanks


>
> > 2) whether or not virtio can log dirty page correctly is still suspicious
> >
> > Thanks
>
> There is no double standard. The feature is optional which co-exists as explained above.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-15 17:37                                   ` [virtio-comment] " Parav Pandit
@ 2023-11-16  4:24                                     ` Jason Wang
  2023-11-16  6:49                                       ` Michael S. Tsirkin
  2023-11-16  6:50                                     ` Michael S. Tsirkin
  1 sibling, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-16  4:24 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 16, 2023 at 1:37 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, November 13, 2023 9:11 AM
> >
> > On Fri, Nov 10, 2023 at 2:46 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > > Hi Michael,
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Thursday, November 9, 2023 1:29 PM
> > >
> > > [..]
> > > > > Besides the issue of performance, it's also racy, assuming we are
> > > > > logging
> > > > IOVA.
> > > > >
> > > > > 0) device log IOVA
> > > > > 1) hypervisor fetches IOVA from log buffer
> > > > > 2) guest map IOVA to a new GPA
> > > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > > >
> > > > > Then we lost the old GPA.
> > > >
> > > > Interesting and a good point. And by the way e.g. vhost has the same
> > > > issue.  You need to flush dirty tracking info when changing the
> > > > mappings somehow.  Parav what's the plan for this? Should be addressed in
> > the spec too.
> > > >
> > > As you listed the flush is needed for vhost or device-based DPT.
> >
> > What does DPT mean? Device Page Table? Let's not invent terminology which is
> > not known by others please.
> >
> Sorry for using the acronym. I meant dirty page tracking.
>
> > We have discussed it many times. You can't just depend on ATS or reinventing
> > wheels in virtio.
> The dependency is on the iommu which would have the mapping of GIOVA to GPA like any sw implementation.
> No dependency on ATS.
>
> >
> > What's more, please try not to give me the impression that the proposal is
> > optimized for a specific vendor (like device IOMMU stuffs).
> >
> You should stop calling this specific vendor thing.

Well, as you have explained, the confusion came from "DPT" ...

> One can equally say that suspend bit proposal is for the sw_vendor device who is forcing virtio hw device to only implement ioqueues + PASID + non_unified interface for PF, VF, SIOVs + non_TDISP based devices.
>
> > > The necessary plumbing is already covered for this in the query (read and
> > clear) command of this v3 proposal.
> >
> > The issue is logging via IOVA ... I don't see how "read and clear" can help.
> >
> Read and clear helps that ensures that all the dirty pages are reported, hence there is no mapping/unmapping race.

Reported as IOVA ...

> As everything is reported.
>
> > > It is listed in Device Write Records Read Command.
> >
> > Please explain how your proposal can solve the above race.
> >
> In below manner.
> 1. guest has GIOVA to GPA_1 mapping
> 2. RX packets occurred to GIOVA
> 3. device reported dirty page log for GIOVA (hypervisor is yet to read)
> 4. guest requested mapping change from GIOVA to GPA_2
> 4.1 During this IOTLB is invalidated and dirty page report is queried ensuring, it can change the mapping

It requires

1) hypervisor traps IOTLB invalidation, which doesn't work when
nesting could be offloaded (IOMMUFD has started the work to support
nesting)
2) query the device about the dirty page on each IOTLB invalidation which:
2.1) A huge round trip: guest IOTLB invalidation -> trapped by
hypervisor -> start the query from the device -> device return ->
hypervisor reports IOTLB invalidation is done -> let guest run. Have
you benchmarked the RTT in this case? There are just too many places
that cause the delay in the middle.
2.2) Guest triggerable behaviour, malicious guest can simply do
endless IOTLB invalidation to DOS the e.g admin virtqueue

>
> > >
> > > When the page write record is fully read, it is flushed.
> > > How/when to use, I think its hypervisor specific, so we probably better off not
> > documenting those details.
> >
> > Well, as the author of this proposal, at least you need to know how a hypervisor
> > can work with your proposal, no?
> >
> Likely yes, but it is not the scope of the spec to list those paths etc.

Fine, but as a reviewer I need to know if it can work with a hypervisor well.

>
> > > May be such read is needed in some other path too depending on how
> > hypervisor implemented.
> >
> > What do you mean by "May be ... some other path" here? You're inventing a
> > mechanism that you don't know how a hypervisor can use?
>
> No. I meant hypervisor may have more operations that map/unmap/flush where it may need to implement it.
> Some one may call it set_map(), some may say dma_map()...

Ok.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16  4:18                                   ` [virtio-comment] " Jason Wang
@ 2023-11-16  5:27                                     ` Parav Pandit
  0 siblings, 0 replies; 157+ messages in thread
From: Parav Pandit @ 2023-11-16  5:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu

> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, November 16, 2023 9:49 AM
> 
> On Thu, Nov 16, 2023 at 1:42 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, November 13, 2023 9:02 AM
> > >
> > > On Thu, Nov 9, 2023 at 3:59 PM Michael S. Tsirkin <mst@redhat.com>
> wrote:
> > > >
> > > > On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote:
> > > > > On Wed, Nov 8, 2023 at 4:17 PM Michael S. Tsirkin
> > > > > <mst@redhat.com>
> > > wrote:
> > > > > >
> > > > > > On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> > > > > > > On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin
> > > > > > > <mst@redhat.com>
> > > wrote:
> > > > > > > >
> > > > > > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > > > > > > Each virtio and non virtio devices who wants to
> > > > > > > > > > > > report their dirty page report,
> > > > > > > > > > > will do their way.
> > > > > > > > > > > >
> > > > > > > > > > > > > 3) inventing it in the virtio layer will be
> > > > > > > > > > > > > deprecated in the future for sure, as platform
> > > > > > > > > > > > > will provide much rich features for logging e.g
> > > > > > > > > > > > > it can do it per PASID etc, I don't see any
> > > > > > > > > > > > > reason virtio need to compete with the features
> > > > > > > > > > > > > that will be provided by the platform
> > > > > > > > > > > > Can you bring the cpu vendors and committement to
> > > > > > > > > > > > virtio tc with timelines
> > > > > > > > > > > so that virtio TC can omit?
> > > > > > > > > > >
> > > > > > > > > > > Why do we need to bring CPU vendors in the virtio TC?
> > > > > > > > > > > Virtio needs to be built on top of transport or
> > > > > > > > > > > platform. There's
> > > no need to duplicate their job.
> > > > > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > > > > >
> > > > > > > > > > I wanted to see a strong commitment for the cpu
> > > > > > > > > > vendors to
> > > support dirty page tracking.
> > > > > > > > >
> > > > > > > > > The RFC of IOMMUFD support can go back to early 2022.
> > > > > > > > > Intel, AMD and ARM are all supporting that now.
> > > > > > > > >
> > > > > > > > > > And the work seems to have started for some platforms.
> > > > > > > > >
> > > > > > > > > Let me quote from the above link:
> > > > > > > > >
> > > > > > > > > """
> > > > > > > > > Today, AMD Milan (or more recent) supports it while ARM
> > > > > > > > > SMMUv3.2 alongside VT-D rev3.x also do support.
> > > > > > > > > """
> > > > > > > > >
> > > > > > > > > > Without such platform commitment, virtio also skipping
> > > > > > > > > > it would
> > > not work.
> > > > > > > > >
> > > > > > > > > Is the above sufficient? I'm a little bit more familiar
> > > > > > > > > with vtd, the hw feature has been there for years.
> > > > > > > >
> > > > > > > >
> > > > > > > > Repeating myself - I'm not sure that will work well for all workloads.
> > > > > > >
> > > > > > > I think this comment applies to this proposal as well.
> > > > > >
> > > > > > Yes - some systems might be better off with platform tracking.
> > > > > > And I think supporting shadow vq better would be nice too.
> > > > >
> > > > > For shadow vq, did you mean the work that is done by Eugenio?
> > > >
> > > > Yes.
> > >
> > > That's exactly why vDPA starts with shadow virtqueue. We've
> > > evaluated various possible approaches, each of them have their
> > > shortcomings and shadow virtqueue is the only one that doesn't
> > > require any additional hardware features to work in every platform.
> > >
> > > >
> > > > > >
> > > > > > > > Definitely KVM did
> > > > > > > > not scan PTEs. It used pagefaults with bit per page and
> > > > > > > > later as VM size grew switched to PLM.  This interface is
> > > > > > > > analogous to PLM,
> > > > > > >
> > > > > > > I think you meant PML actually. And it doesn't work like
> > > > > > > PML. To behave like PML it needs to
> > > > > > >
> > > > > > > 1) log buffers were organized as a queue with indices
> > > > > > > 2) device needs to suspend (as a #vmexit in PML) if it runs
> > > > > > > out of the buffers
> > > > > > > 3) device need to send a notification to the driver if it
> > > > > > > runs out of the buffer
> > > > > > >
> > > > > > > I don't see any of the above in this proposal. If we do that
> > > > > > > it would be less problematic than what is being proposed here.
> > > > > >
> > > > > > What is common between this and PML is that you get the
> > > > > > addresses directly without scanning megabytes of bitmaps or
> > > > > > worse - hundreds of megabytes of page tables.
> > > > >
> > > > > Yes, it has overhead but this is the method we use for vhost and
> > > > > KVM
> > > (earlier).
> > > > >
> > > > > To me the  important advantage of PML is that it uses limited
> > > > > resources on the host which
> > > > >
> > > > > 1) doesn't require resources in the device
> > > > > 2) doesn't scale as the guest memory increases. (but this
> > > > > advantage doesn't exist in neither this nor bitmap)
> > > >
> > > > it seems 2 exactly exists here.
> > >
> > > Actually not, Parav said the device needs to reserve sufficient
> > > resources in another thread.
> > The device resource reservation starts only when the device migration starts.
> > i.e. with WRITE_RECORDS_START command of patch 7 in the series.
> 
> Right, but this is not the question, see below.
> 
> >
> > >
> > > >
> > > >
> > > > > >
> > > > > > The data structure is different but I don't see why it is critical.
> > > > > >
> > > > > > I agree that I don't see out of buffers notifications too
> > > > > > which implies device has to maintain something like a bitmap internally.
> > > > > > Which I guess could be fine but it is not clear to me how
> > > > > > large that bitmap has to be. How does the device know? Needs to be
> addressed.
> > > > >
> > > > > This is the question I asked Parav in another thread. Using host
> > > > > memory as a queue with notification (like PML) might be much better.
> > > >
> > > > Well if queue is what you want to do you can just do it internally.
> > >
> > > Then it's not the proposal here, Parav has explained it in another
> > > reply, and as explained it lacks a lot of other facilities.
> > >
> > PML is yet another option that requires small pci writes.
> > In the current proposal, there are no small PCI writes.
> > It is a query interface from the device.
> 
> Well, you've explained in another thread that actually it needs small PCI writes.
>
No. There may be some misunderstanding.
 
> E.g during IOTLB invalidation ...
> 
This is not part of the virtio interface.

> >
> > > > Problem of course is that it might overflow and cause things like
> > > > packet drops.
> > >
> > > Exactly like PML. So sticking to wire speed should not be a general
> > > goal in the context of migration. It can be done if the speed of the
> > > migration interface is faster than the virtio device that needs to be migrated.
> > May not have to be.
> > Speed of page recording should be fast enough.
> > It usually improves with subsequent generation.
> 
> If you have something better, let's propose it from the start.
>
It is unproven that current proposal cannot be done by the device.
So I prefer to do this incrementally when we get there.
 
> > >
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > > > Even if we manage to do that, it doesn't mean we won't have issues.
> > > > > > >
> > > > > > > 1) For many reasons it can neither see nor log via GPA, so
> > > > > > > this requires a traversal of the vIOMMU mapping tables by
> > > > > > > the hypervisor afterwards, it would be expensive and need
> > > > > > > synchronization with the guest modification of the IO page
> > > > > > > table which
> > > looks very hard.
> > > > > >
> > > > > > vIOMMU is fast enough to be used on data path but not fast
> > > > > > enough for dirty tracking?
> > > > >
> > > > > We set up SPTEs or using nesting offloading where the PTEs could
> > > > > be iterated by hardware directly which is fast.
> > > >
> > > > There's a way to have hardware find dirty PTEs for you quickly?
> > >
> > > Scanning PTEs on the host is faster and more secure than scanning
> > > guests, that's what I want to say:
> > >
> > > 1) the guest page could be swapped out but not the host one.
> > > 2) no guest triggerable behavior
> > >
> >
> > Device page tracking table to be consulted to flush on mapping change.
> >
> > > > I don't know how it's done. Do tell.
> > > >
> > > >
> > > > > This is not the case here where software needs to iterate the IO
> > > > > page tables in the guest which could be slow.
> > > > >
> > > > > > Hard to believe.  If true and you want to speed up vIOMMU then
> > > > > > you implement an efficient datastructure for that.
> > > > >
> > > > > Besides the issue of performance, it's also racy, assuming we
> > > > > are logging
> > > IOVA.
> > > > >
> > > > > 0) device log IOVA
> > > > > 1) hypervisor fetches IOVA from log buffer
> > > > > 2) guest map IOVA to a new GPA
> > > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > > >
> > > > > Then we lost the old GPA.
> > > >
> > > > Interesting and a good point.
> > >
> > > Note that PML logs at GPA as it works at L1 of EPT.
> > >
> > > > And by the way e.g. vhost has the same issue.  You need to flush
> > > > dirty tracking info when changing the mappings somehow.
> > >
> > > It's not,
> > >
> > > 1) memory translation is done by vhost
> > > 2) vhost knows GPA and it doesn't log via IOVA.
> > >
> > > See this for example, and DPDK has similar fixes.
> > >
> > > commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
> > > Author: Jason Wang <jasowang@redhat.com>
> > > Date:   Wed Jan 16 16:54:42 2019 +0800
> > >
> > >     vhost: log dirty page correctly
> > >
> > >     Vhost dirty page logging API is designed to sync through GPA. But we
> > >     try to log GIOVA when device IOTLB is enabled. This is wrong and may
> > >     lead to missing data after migration.
> > >
> > >     To solve this issue, when logging with device IOTLB enabled, we will:
> > >
> > >     1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
> > >        get HVA, for writable descriptor, get HVA through iovec. For used
> > >        ring update, translate its GIOVA to HVA
> > >     2) traverse the GPA->HVA mapping to get the possible GPA and log
> > >        through GPA. Pay attention this reverse mapping is not guaranteed
> > >        to be unique, so we should log each possible GPA in this case.
> > >
> > >     This fix the failure of scp to guest during migration. In -next, we
> > >     will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> > >
> > >     Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> > >     Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> > >     Cc: Jintack Lim <jintack@cs.columbia.edu>
> > >     Signed-off-by: Jason Wang <jasowang@redhat.com>
> > >     Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > >
> > > All of the above is not what virtio did right now.
> > >
> > > > Parav what's the plan for this? Should be addressed in the spec too.
> > > >
> > >
> > > AFAIK, there's no easy/efficient way to do that. I hope I was wrong.
> > >
> >
> > The query interface in this proposal works on the granular boundary to read
> and clear.
> > This will ensure that mapping is consistent.
> >
> > > >
> > > >
> > > > > >
> > > > > > > 2) There are a lot of special or reserved IOVA ranges (for
> > > > > > > example the interrupt areas in x86) that need special care
> > > > > > > which is architectural and where it is beyond the scope or
> > > > > > > knowledge of the virtio device but the platform IOMMU.
> > > > > > > Things would be more complicated when SVA is enabled.
> > > > > >
> > > > > > SVA being what here?
> > > > >
> > > > > For example, IOMMU may treat interrupt ranges differently
> > > > > depending on whether SVA is enabled or not. It's very hard and
> > > > > unnecessary to teach devices about this.
> > > >
> > > > Oh, shared virtual memory. So what you are saying here? virtio
> > > > does not care, it just uses some addresses and if you want it to
> > > > it can record writes somewhere.
> > >
> > > One example, PCI allows devices to send translated requests, how can
> > > a hypervisor know it's a PA or IOVA in this case? We probably need a
> > > new bit. But it's not the only thing we need to deal with.
> > >
> > > By definition, interrupt ranges and other reserved ranges should not
> > > belong to dirty pages. And the logging should be done before the DMA
> > > where there's no way for the device to know whether or not an IOVA
> > > is valid or not. It would be more safe to just not report them from
> > > the source instead of leaving it to the hypervisor to deal with but
> > > this seems impossible at the device level. Otherwise the hypervisor
> > > driver needs to communicate with the (v)IOMMU to be reached with the
> > > interrupt(MSI) area, RMRR area etc in order to do the correct things
> > > or it might have security implications. And those areas don't make
> > > sense at L1 when vSVA is enabled. What's more, when vIOMMU could be
> > > fully offloaded, there's no easy way to fetch that information.
> > >
> > There cannot be logging before the DMA.
> 
> Well, I don't see how this is related to the issue above. Logging after the DMA
> doesn't mean the device can know what sits behind an IOVA, no?
> 
Device does not know what sits behind IOVA.
IOVA to GPA mapping is maintained by the iommu.
Hence, device records and hypervor query them.

> > Only requirement is before the mapping changes, the dirty page tracking to be
> synced.
> >
> > In most common cases where the perf is critical, such mapping wont change
> so often dynamically anyway.
> 
> I've explained the issue in another reply.
> 
> >
> > > Again, it's hard to bypass or even duplicate the functionality of
> > > the platform or we need to step into every single detail of a
> > > specific transport, architecture or IOMMU to figure out whether or
> > > not logging at virtio is correct which is awkward and unrealistic.
> > > This proposal suffers from an exact similar issue when inventing
> > > things like freeze/stop where I've pointed out other branches of issues as
> well.
> > >
> > It is incorrect attribution that platform is duplicated here.
> > It feeds the data to the platform as needed without replicating.
> >
> > I do agree that there is overlap of IOMMU tracking the dirty and storing it in
> the per PTE vs device supplying its dirty track via its own interface.
> > Both are consolidated at hypervisor level.
> >
> > > >
> > > > > >
> > > > > > > And there could be other architecte specific knowledge (e.g
> > > > > > > PAGE_SIZE) that might be needed. There's no easy way to deal
> > > > > > > with those cases.
> > > > > >
> > > > > > Good point about page size actually - using 4k unconditionally
> > > > > > is a waste of resources.
> > > > >
> > > > > Actually, they are more than just PAGE_SIZE, for example, PASID and
> others.
> > > >
> > > > what does pasid have to do with it? anyway, just give driver
> > > > control over page size.
> > >
> > > For example, two virtqueues have two PASIDs assigned. How can a
> > > hypervisor know which specific IOVA belongs to which IOVA? For
> > > platform IOMMU, they are handy as it talks to the transport. But I
> > > don't think we need to duplicate every transport specific address space
> feature in core virtio layer:
> > >
> > PASID to vq assignment won't be duplicated.
> > It is configured fully by the guest without consulting hypervisor at the device
> level.
> > Guest IOMMU would consult hypervisor to setup any PASID mapping as part
> of any mapping method.
> >
> > > 1) translated/untranslated request
> > > 2) request w/ and w/o PASID
> > >
> > > >
> > > > > >
> > > > > >
> > > > > > > We wouldn't need to care about all of them if it is done at
> > > > > > > platform IOMMU level.
> > > > > >
> > > > > > If someone logs at IOMMU level then nothing needs to be done
> > > > > > in the spec at all. This is about capability at the device level.
> > > > >
> > > > > True, but my question is where or not it can be done at the
> > > > > device level
> > > easily.
> > > >
> > > > there's no "easily" about live migration ever.
> > >
> > > I think I've stated sufficient issues to demonstrate how hard virtio wants to
> do it.
> > > And I've given the link that it is possible to do that in IOMMU
> > > without those issues. So in this context doing it in virtio is much harder.
> > >
> > > > For example on-device iommus are a thing.
> > >
> > > I'm not sure that's the way to go considering the platform IOMMU
> > > evolves very quickly.
> > >
> > > >
> > > > > >
> > > > > >
> > > > > > > > what Lingshan
> > > > > > > > proposed is analogous to bit per page - problem
> > > > > > > > unfortunately is you can't easily set a bit by DMA.
> > > > > > > >
> > > > > > >
> > > > > > > I'm not saying bit/bytemap is the best, but it has been used
> > > > > > > by real hardware. And we have many other options.
> > > > > > >
> > > > > > > > So I think this dirty tracking is a good option to have.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > > > >
> > > > > > > > > > > Why does it matter in 2024?
> > > > > > > > > > Because users needs to use it now.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > If not, we are better off to offer this, and
> > > > > > > > > > > > when/if platform support is, sure,
> > > > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > > > >
> > > > > > > > > > > > > 4) if the platform support is missing, we can
> > > > > > > > > > > > > use software or leverage transport for
> > > > > > > > > > > > > assistance like PRI
> > > > > > > > > > > > All of these are in theory.
> > > > > > > > > > > > Our experiment shows PRI performance is 21x slower
> > > > > > > > > > > > than page fault rate
> > > > > > > > > > > done by the cpu.
> > > > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > > > >
> > > > > > > > > > > If you stick to the wire speed during migration, it can
> converge.
> > > > > > > > > > Do you have perf data for this?
> > > > > > > > >
> > > > > > > > > No, but it's not hard to imagine the worst case. Wrote a
> > > > > > > > > small program that dirty every page by a NIC.
> > > > > > > > >
> > > > > > > > > > In the internal tests we don’t see this happening.
> > > > > > > > >
> > > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > > > >
> > > > > > > > > So if we get very high dirty rates (e.g by a high speed
> > > > > > > > > NIC), we can't satisfy the requirement of the downtime.
> > > > > > > > > Or if you see the converge, you might get help from the
> > > > > > > > > auto converge support by the hypervisors like KVM where
> > > > > > > > > it tries to throttle the VCPU then you can't reach the wire speed.
> > > > > > > >
> > > > > > > > Will only work for some device types.
> > > > > > > >
> > > > > > >
> > > > > > > Yes, that's the point. Parav said he doesn't see the issue,
> > > > > > > it's probably because he is testing a virtio-net and so the
> > > > > > > vCPU is automatically throttled. It doesn't mean it can work
> > > > > > > for other virito devices.
> > > > > >
> > > > > > Only for TX, and I'm pretty sure they had the foresight to
> > > > > > test RX not just TX but let's confirm. Parav did you test both directions?
> > > > >
> > > > > RX speed somehow depends on the speed of refill, so throttling
> > > > > helps more or less.
> > > >
> > > > It doesn't depend on speed of refill you just underrun and drop
> > > > packets. then your nice 10usec latency becomes more like 10sec.
> > >
> > > I miss your point here. If the driver can't achieve wire speed
> > > without dirty page tracking, it can neither when dirty page tracking is
> enabled.
> > >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > > > So it is unusable.
> > > > > > > > > > >
> > > > > > > > > > > It's not about mandating, it's about doing things in
> > > > > > > > > > > the correct layer. If PRI is slow, PCI can evolve for sure.
> > > > > > > > > > You should try.
> > > > > > > > >
> > > > > > > > > Not my duty, I just want to make sure things are done in
> > > > > > > > > the correct layer, and once it needs to be done in the
> > > > > > > > > virtio, there's nothing obviously wrong.
> > > > > > > >
> > > > > > > > Yea but just vague questions don't help to make sure eiter way.
> > > > > > >
> > > > > > > I don't think it's vague, I have explained, if something in
> > > > > > > the virito slows down the PRI, we can try to fix them.
> > > > > >
> > > > > > I don't believe you are going to make PRI fast. No one managed so far.
> > > > >
> > > > > So it's the fault of PRI not virito, but it doesn't mean we need
> > > > > to do it in virtio.
> > > >
> > > > I keep saying with this approach we would just say "e1000
> > > > emulation is slow and encumbered this is the fault of e1000" and
> > > > never get virtio at all.  Assigning blame only gets you so far.
> > >
> > > I think we are discussing different things. My point is virtio needs
> > > to leverage the functionality provided by transport or platform
> > > (especially considering they evolve faster than virtio). It seems to
> > > me it's hard even to duplicate some basic function of platform IOMMU in
> virtio.
> > >
> > Not duplicated. Feeding into the platform.
> >
> > > >
> > > > > >
> > > > > > > Missing functions in
> > > > > > > platform or transport is not a good excuse to try to
> > > > > > > workaround it in the virtio. It's a layer violation and we
> > > > > > > never had any feature like this in the past.
> > > > > >
> > > > > > Yes missing functionality in the platform is exactly why
> > > > > > virtio was born in the first place.
> > > > >
> > > > > Well the platform can't do device specific logic. But that's not
> > > > > the case of dirty page tracking which is device logic agnostic.
> > > >
> > > > Not true platforms have things like NICs on board and have for
> > > > many years. It's about performance really.
> > >
> > > I've stated sufficient issues above. And one more obvious issue for
> > > device initiated page logging is that it needs a lot of extra or
> > > unnecessary PCI transactions which will throttle the performance of
> > > the whole system (and lead to other issues like QOS). So I can't believe it has
> good performance overall.
> > > Logging via IOMMU or using shadow virtqueue doesn't need any extra
> > > PCI transactions at least.
> > >
> > In the current proposal, it does not required PCI transactions, as there is only a
> hypervisor-initiated query interface.
> 
> Such query requires at least several transactions, no?
>
It depends on how much memory is unmapped.

For the pinned memory cases with hw iommu's mapping unmapping is not frequent event.
 
> Or to make things more clear, could you list the steps how a hypervisor is
> expected to do the querying?
>
It is listed in other email thread where you described the race condition.
 
> > It is a trade off of using svq + pasid vs using something from the device.
> >
> > Again, both has different use case and value. One uses cpu and one uses
> device.
> 
> That's for sure.
> 
> > Depending how much power one wants to spend where..
> 
> But as the author of the proposal, you need to elaborate more on how you
> expect for a hypervisor instead of letting the reviewer guess.
Hypervisor would look for ways on how to do functionality and find it in the spec.
But I will add a short description on the write record read command.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16  4:23                           ` [virtio-comment] " Jason Wang
@ 2023-11-16  5:29                             ` Parav Pandit
  2023-11-16  5:51                               ` [virtio-comment] " Michael S. Tsirkin
  2023-11-21  7:14                               ` Jason Wang
  0 siblings, 2 replies; 157+ messages in thread
From: Parav Pandit @ 2023-11-16  5:29 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, November 16, 2023 9:54 AM
> 
> On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, November 13, 2023 9:07 AM
> > >
> > > On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Tuesday, November 7, 2023 9:34 AM
> > > > >
> > > > > On Mon, Nov 6, 2023 at 2:54 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Monday, November 6, 2023 12:04 PM
> > > > > > >
> > > > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Sent: Thursday, November 2, 2023 9:54 AM
> > > > > > > > >
> > > > > > > > > On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit
> > > > > > > > > <parav@nvidia.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > Sent: Wednesday, November 1, 2023 6:00 AM
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit
> > > > > > > > > > > <parav@nvidia.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit
> > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > During a device migration flow (typically in a
> > > > > > > > > > > > > > precopy phase of the live migration), a device
> > > > > > > > > > > > > > may write to the guest memory. Some
> > > > > > > > > > > > > > iommu/hypervisor may not be able to track
> > > > > > > > > > > > > > these
> > > > > > > > > written pages.
> > > > > > > > > > > > > > These pages to be migrated from source to
> > > > > > > > > > > > > > destination
> > > > > hypervisor.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > A device which writes to these pages, provides
> > > > > > > > > > > > > > the page address record of the to the owner device.
> > > > > > > > > > > > > > The owner device starts write recording for
> > > > > > > > > > > > > > the device and queries all the page addresses
> > > > > > > > > > > > > > written by the
> > > device.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Fixes:
> > > > > > > > > > > > > > https://github.com/oasis-tcs/virtio-spec/issue
> > > > > > > > > > > > > > s/17
> > > > > > > > > > > > > > 6
> > > > > > > > > > > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > > > > > > > > > > Signed-off-by: Satananda Burla
> > > > > > > > > > > > > > <sburla@marvell.com>
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > changelog:
> > > > > > > > > > > > > > v1->v2:
> > > > > > > > > > > > > > - addressed comments from Michael
> > > > > > > > > > > > > > - replaced iova with physical address
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > >  admin-cmds-device-migration.tex | 15
> > > > > > > > > > > > > > +++++++++++++++
> > > > > > > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > b/admin-cmds-device-migration.tex index
> > > > > > > > > > > > > > ed911e4..2e32f2c
> > > > > > > > > > > > > > 100644
> > > > > > > > > > > > > > --- a/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > +++ b/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > @@ -95,6 +95,21 @@ \subsubsection{Device
> > > > > > > > > > > > > > Migration}\label{sec:Basic Facilities of a
> > > > > > > > > > > > > > Virtio Device / The owner driver can discard
> > > > > > > > > > > > > > any partially read or written device context
> > > > > > > > > > > > > > when  any of the device migration flow
> > > > > > > > > > > > > should be aborted.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +During the device migration flow, a
> > > > > > > > > > > > > > +passthrough device may write data to the
> > > > > > > > > > > > > > +guest virtual machine's memory, a source
> > > > > > > > > > > > > > +hypervisor needs to keep track of these
> > > > > > > > > > > > > > +written memory to migrate such memory to
> > > > > > > > > > > > > > +destination
> > > > > > > > > > > > > hypervisor.
> > > > > > > > > > > > > > +Some systems may not be able to keep track of
> > > > > > > > > > > > > > +such memory write addresses at hypervisor level.
> > > > > > > > > > > > > > +In such a scenario, a device records and
> > > > > > > > > > > > > > +reports these written memory addresses to the
> > > > > > > > > > > > > > +owner device. The owner driver enables write
> > > > > > > > > > > > > > +recording for one or more physical address
> > > > > > > > > > > > > > +ranges per device during device
> > > > > > > > > migration flow.
> > > > > > > > > > > > > > +The owner driver periodically queries these
> > > > > > > > > > > > > > +written physical address
> > > > > > > > > > > records from the device.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I wonder how PA works in this case. Device uses
> > > > > > > > > > > > > untranslated requests so it can only see IOVA.
> > > > > > > > > > > > > We can't mandate
> > > > > ATS anyhow.
> > > > > > > > > > > > Michael suggested to keep the language uniform as
> > > > > > > > > > > > PA as this is ultimately
> > > > > > > > > > > what the guest driver is supplying during vq
> > > > > > > > > > > creation and in posting buffers as physical address.
> > > > > > > > > > >
> > > > > > > > > > > This seems to need some work. And, can you show me
> > > > > > > > > > > how it can
> > > > > work?
> > > > > > > > > > >
> > > > > > > > > > > 1) e.g if GAW is 48 bit, is the hypervisor expected
> > > > > > > > > > > to do a bisection of the whole range?
> > > > > > > > > > > 2) does the device need to reserve sufficient
> > > > > > > > > > > internal resources for logging the dirty page and why (not)?
> > > > > > > > > > No when dirty page logging starts, only at that time,
> > > > > > > > > > device will reserve
> > > > > > > > > enough resources.
> > > > > > > > >
> > > > > > > > > GAW is 48bit, how large would it have then?
> > > > > > > > Dirty page tracking is not dependent on the size of the GAW.
> > > > > > > > It is function of address ranges for the amount of guest
> > > > > > > > memory regardless of
> > > > > > > GAW.
> > > > > > >
> > > > > > > The problem is, e.g when vIOMMU is enabled, you can't know
> > > > > > > which IOVA is actually used by guests. And even for the case
> > > > > > > when vIOMMU is not enabled, the guest may have several TBs.
> > > > > > > Is it easy to reserve sufficient resources by the device itself?
> > > > > > >
> > > > > > When page tracking is enabled per device, it knows about the
> > > > > > range and it can
> > > > > reserve certain resource.
> > > > >
> > > > > I didn't see such an interface in this series. Anything I miss?
> > > > >
> > > > Yes, this patch and the next patch is covering the page tracking
> > > > start,stop and
> > > query commands.
> > > > They are named as write recording commands.
> > >
> > > So I still don't see how the device can reserve sufficient resources?
> > > Guests may map a very large area of memory to IOMMU (or when vIOMMU
> > > is disabled, GPA is used). It would be several TBs, how can the
> > > device reserve sufficient resources in this case?
> > When the map is established, the ranges are supplied to the device to know
> how much to reserve.
> > If device does not have enough resource, it fails the command.
> >
> > One can advance it further to provision for the desired range..
> 
> Well, I think I've asked whether or not a bisection is needed, and you told me
> not ...
> 
> But at least we need to document this in the proposal, no?
>
We should expose a limit of the device in the proposed WRITE_RECORD_CAP_QUERY command, that how much range it can track.
So that future provisioning framework can use it.

I will cover this in v5 early next week.
 
> > >
> > > >
> > > > > Btw, the IOVA is allocated by the guest actually, how can we
> > > > > know the
> > > range?
> > > > > (or using the host range?)
> > > > >
> > > > Hypervisor would have mapping translation.
> > >
> > > That's really tricky and can only work in some cases:
> > >
> > > 1) It requires the hypervisor to traverse the guest I/O page tables
> > > which could be very large range
> > > 2) It requests the hypervisor to trap the modification of guest I/O
> > > page tables and synchronize with the range changes, which is
> > > inefficient and can only be done when we are doing shadow PTEs. It
> > > won't work when the nesting translation could be offloaded to the
> > > hardware
> > > 3) It is racy with the guest modification of I/O page tables which
> > > is explained in another thread
> > Mapping changes with more hw mmu's is not a frequent event and IOTLB
> flush is done using querying the dirty log for the smaller range.
> >
> > > 4) No aware of new features like PASID which has been explained in
> > > another thread
> > For all the pinned work with non sw based IOMMU, it is typically small subset.
> > PASID is guest controlled.
> 
> Let's repeat my points:
> 
> 1) vq1 use untranslated request with PASID1
> 2) vq2 use untranslated request with PASID2
> 
> Shouldn't we log PASID as well?
> 
Possibly yes, either to request the tracking per PASID or to log the PASID.
When in future PASID based VQ are supported, this part should be extended.

> And
> 
> 1) vq1 is using translated request
> 2) vq2 is using untranslated request
>
 
> How could we differ?
> 
> >
> > >
> > > >
> > > > > >
> > > > > > > Host should always have more resources than device, in that
> > > > > > > sense there could be several methods that tries to utilize
> > > > > > > host memory instead of the one in the device. I think we've
> > > > > > > discussed this when going through the doc prepared by Eugenio.
> > > > > > >
> > > > > > > >
> > > > > > > > > What happens if we're trying to migrate more than 1 device?
> > > > > > > > >
> > > > > > > > That is perfectly fine.
> > > > > > > > Each device is updating its log of pages it wrote.
> > > > > > > > The hypervisor is collecting their sum.
> > > > > > >
> > > > > > > See above.
> > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 3) DMA is part of the transport, it's natural to do
> > > > > > > > > > > logging there, why duplicate efforts in the virtio layer?
> > > > > > > > > > He he, you have funny comment.
> > > > > > > > > > When an abstract facility is added to virtio you say
> > > > > > > > > > to do in
> > > transport.
> > > > > > > > >
> > > > > > > > > So it's not done in the general facility but tied to the admin part.
> > > > > > > > > And we all know dirty page tracking is a challenge and
> > > > > > > > > Eugenio has a good summary of pros/cons. A revisit of
> > > > > > > > > those docs make me think virtio is not the good place
> > > > > > > > > for doing that for
> > > may reasons:
> > > > > > > > >
> > > > > > > > > 1) as stated, platform will evolve to be able to
> > > > > > > > > tracking dirty pages, actually, it has been supported by
> > > > > > > > > a lot of major IOMMU vendors
> > > > > > > >
> > > > > > > > This is optional facility in virtio.
> > > > > > > > Can you please point to the references? I don’t see it in
> > > > > > > > the common Linux
> > > > > > > kernel support for it.
> > > > > > >
> > > > > > > Note that when IOMMUFD is being proposed, dirty page
> > > > > > > tracking is one of the major considerations.
> > > > > > >
> > > > > > > This is one recent proposal:
> > > > > > >
> > > > > > > https://www.spinics.net/lists/kvm/msg330894.html
> > > > > > >
> > > > > > Sure, so if platform supports it. it can be used from the platform.
> > > > > > If it does not, the device supplies it.
> > > > > >
> > > > > > > > Instead Linux kernel choose to extend to the devices.
> > > > > > >
> > > > > > > Well, as I stated, tracking dirty pages is challenging if
> > > > > > > you want to do it on a device, and you can't simply invent
> > > > > > > dirty page tracking for each type of the devices.
> > > > > > >
> > > > > > It is not invented.
> > > > > > It is generic framework for all virtio device types as proposed here.
> > > > > > Keep in mind, that it is optional already in v3 series.
> > > > > >
> > > > > > > > At least not seen to arrive this in any near term in start
> > > > > > > > of
> > > > > > > > 2024 which is
> > > > > > > where users must use this.
> > > > > > > >
> > > > > > > > > 2) you can't assume virtio is the only device that can
> > > > > > > > > be used by the guest, having dirty pages tracking to be
> > > > > > > > > implemented in each type of device is unrealistic
> > > > > > > > Of course, there is no such assumption made. Where did you
> > > > > > > > see a text that
> > > > > > > made such assumption?
> > > > > > >
> > > > > > > So what happens if you have a guest with virtio and other
> > > > > > > devices
> > > assigned?
> > > > > > >
> > > > > > What happens? Each device type would do its own dirty page tracking.
> > > > > > And if all devices does not have support, hypervisor knows to
> > > > > > fall back to
> > > > > platform iommu or its own.
> > > > > >
> > > > > > > > Each virtio and non virtio devices who wants to report
> > > > > > > > their dirty page report,
> > > > > > > will do their way.
> > > > > > > >
> > > > > > > > > 3) inventing it in the virtio layer will be deprecated
> > > > > > > > > in the future for sure, as platform will provide much
> > > > > > > > > rich features for logging e.g it can do it per PASID
> > > > > > > > > etc, I don't see any reason virtio need to compete with
> > > > > > > > > the features that will be provided by the platform
> > > > > > > > Can you bring the cpu vendors and committement to virtio
> > > > > > > > tc with timelines
> > > > > > > so that virtio TC can omit?
> > > > > > >
> > > > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio
> > > > > > > needs to be built on top of transport or platform. There's
> > > > > > > no need to duplicate
> > > > > their job.
> > > > > > > Especially considering that virtio can't do better than them.
> > > > > > >
> > > > > > I wanted to see a strong commitment for the cpu vendors to
> > > > > > support dirty
> > > > > page tracking.
> > > > >
> > > > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD
> > > > > and ARM are all supporting that now.
> > > > >
> > > > > > And the work seems to have started for some platforms.
> > > > >
> > > > > Let me quote from the above link:
> > > > >
> > > > > """
> > > > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > > > alongside VT-D rev3.x also do support.
> > > > > """
> > > > >
> > > > > > Without such platform commitment, virtio also skipping it would not
> work.
> > > > >
> > > > > Is the above sufficient? I'm a little bit more familiar with
> > > > > vtd, the hw feature has been there for years.
> > > > >
> > > > Vtd has a sticky D bit that requires synchronization with IOPTE
> > > > page caches
> > > when sw wants to clear it.
> > >
> > > This is by design.
> > >
> > > > Do you know if is it reliable when device does multiple writes,
> > > > ie,
> > > >
> > > > a. iommu write D bit
> > > > b. software read it
> > > > c. sw synchronize cache
> > > > d. iommu write D bit on next write by device
> > >
> > > What issue did you see here? But that's not even an excuse, if
> > > there's a bug, let's report it to IOMMU vendors and let them fix it.
> > > The thread I point to you is actually a good space.
> > >
> > So we cannot claim that it is there in the platform.
> 
> I'm confused, the thread I point to you did the cache synchronization which has
> been explained in the changelog, so what's the issue?
>
If the ask is for IOMMU chip to fix something, we cannot claim that dirty page tracking is available already in platform.
 
> >
> > > Again, the point is to let the correct role play.
> > >
> > How many more years should we block the virtio device migration when
> platform do not have it?
> 
> At least for VT-D, it has been used for years.
Is this device written pages tracked by KVM for VT-d as dirty page log, instead through vfio?

> 
> >
> > > >
> > > > ARM SMMU based servers to be present with D bit tracking.
> > > > It is still early to say platform is ready.
> > >
> > > This is not what I read from both the series I posted and the spec,
> > > dirty bit has been supported several years ago at least for vtd.
> > Supported, but spec listed it as sticky bit that may require special handling.
> 
> Please explain why this is "special handling". IOMMU has several different layers
> of caching, by design, it can't just open a window for D bit.
> 
> > May be it is working, but not all cpu platforms have it.
> 
> I don't see the point. Migration is not supported for virito as well.
>
I don’t see a point either to discuss.

I already acked that platform may have support as well, and not all platform has it.
So the device feeds the data and its platform's choice to enable/disable.
 
> >
> > >
> > > >
> > > > It is optional so whichever has the support it will be used.
> > >
> > > I can't see the point of this, it is already available. And
> > > migration doesn't exist in virtio spec yet.
> > >
> > > >
> > > > > >
> > > > > > > > i.e. in first year of 2024?
> > > > > > >
> > > > > > > Why does it matter in 2024?
> > > > > > Because users needs to use it now.
> > > > > >
> > > > > > >
> > > > > > > > If not, we are better off to offer this, and when/if
> > > > > > > > platform support is, sure,
> > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > >
> > > > > > > > > 4) if the platform support is missing, we can use
> > > > > > > > > software or leverage transport for assistance like PRI
> > > > > > > > All of these are in theory.
> > > > > > > > Our experiment shows PRI performance is 21x slower than
> > > > > > > > page fault rate
> > > > > > > done by the cpu.
> > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > >
> > > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > > Do you have perf data for this?
> > > > >
> > > > > No, but it's not hard to imagine the worst case. Wrote a small
> > > > > program that dirty every page by a NIC.
> > > > >
> > > > > > In the internal tests we don’t see this happening.
> > > > >
> > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > >
> > > > > So if we get very high dirty rates (e.g by a high speed NIC), we
> > > > > can't satisfy the requirement of the downtime. Or if you see the
> > > > > converge, you might get help from the auto converge support by
> > > > > the hypervisors like KVM where it tries to throttle the VCPU
> > > > > then you can't reach
> > > the wire speed.
> > > > >
> > > > Once PRI is enabled, even without migration, there is basic perf issues.
> > >
> > > The context is not PRI here...
> > >
> > > It's about if you can stick to wire speed during live migration.
> > > Based on the analysis so far, you can't achieve wirespeed and downtime at
> the same time.
> > > That's why the hypervisor needs to throttle VCPU or devices.
> > >
> > So?
> > Device also may throttle itself.
> 
> That's perfectly fine. We are on the same page, no? It's wrong to judge the dirty
> page tracking in the context of live migration by measuring whether or not the
> device can work at wire speed.
> 
> >
> > > For PRI, it really depends on how you want to use it. E.g if you
> > > don't want to pin a page, the performance is the price you must pay.
> > PRI without pinning does not make sense for device to make large mapping
> queries.
> 
> That's also fine. Hypervisors can choose to enable and use PRI depending on
> the different cases.
>
So PRI is not must for device migration.
Device migration must be able to work without PRI enabled, as simple as that as first base line.
 
> >
> > >
> > > >
> > > > > >
> > > > > > >
> > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > So it is unusable.
> > > > > > >
> > > > > > > It's not about mandating, it's about doing things in the
> > > > > > > correct layer. If PRI is slow, PCI can evolve for sure.
> > > > > > You should try.
> > > > >
> > > > > Not my duty, I just want to make sure things are done in the
> > > > > correct layer, and once it needs to be done in the virtio,
> > > > > there's nothing obviously
> > > wrong.
> > > > >
> > > > At present, it looks all platforms are not equally ready for page tracking.
> > >
> > > That's not an excuse to let virtio support that.
> > It is wrong attribution as excuse.
> >
> > > And we need also to figure out if
> > > virtio can do that easily. I've pointed out sufficient issues, I'm
> > > pretty sure there would be more as the platform evolves.
> > >
> > I am not sure if virtio feeds the log into the platform.
> 
> I don't understand the meaning here.
> 
I mistakenly merged two sentences.

Virtio feeds the dirty page details to the hypervisor platform which collects and merges the page record.
So it is platform choice to use iommu based tracking or device based.

> >
> > > >
> > > > > > In the current state, it is mandating.
> > > > > > And if you think PRI is the only way,
> > > > >
> > > > > I don't, it's just an example where virtio can leverage from
> > > > > either transport or platform. Or if it's the fault in virtio
> > > > > that slows down the PRI, then it is something we can do.
> > > > >
> > > > Yea, it does not seem to be ready yet.
> > > >
> > > > > >  than you should propose that in the dirty page tracking
> > > > > > series that you listed
> > > > > above to not do dirty page tracking. Rather depend on PRI, right?
> > > > >
> > > > > No, the point is to not duplicate works especially considering
> > > > > virtio can't do better than platform or transport.
> > > > >
> > > > Both the platform and virtio work is ongoing.
> > >
> > > Why duplicate the work then?
> > >
> > Not all cpu platforms support as far as I know.
> 
> Yes, but we all know the platform is working to support this.
> 
> Supporting this on the device is hard.
>
This is optional, whichever device would like to implement it, will support it.
 
> >
> > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > When one does something in transport, you say, this is
> > > > > > > > > > transport specific, do
> > > > > > > > > some generic.
> > > > > > > > > >
> > > > > > > > > > Here the device is being tracked is virtio device.
> > > > > > > > > > PCI-SIG has told already that PCIM interface is
> > > > > > > > > > outside the scope of
> > > it.
> > > > > > > > > > Hence, this is done in virtio layer here in abstract way.
> > > > > > > > >
> > > > > > > > > You will end up with a competition with the
> > > > > > > > > platform/transport one that will fail.
> > > > > > > > >
> > > > > > > > I don’t see a reason. There is no competition.
> > > > > > > > Platform always have a choice to not use device side page
> > > > > > > > tracking when it is
> > > > > > > supported.
> > > > > > >
> > > > > > > Platform provides a lot of other functionalities for dirty logging:
> > > > > > > e.g per PASID, granular, etc. So you want to duplicate them
> > > > > > > again in the virtio? If not, why choose this way?
> > > > > > >
> > > > > > It is optional for the platforms where platform do not have it.
> > > > >
> > > > > We are developing new virtio functionalities that are targeted
> > > > > for future platforms. Otherwise we would end up with a feature
> > > > > with a very narrow use case.
> > > > In general I agree that platform is an option too.
> > > > Hypervisor will be able to make the decision to use platform when
> > > > available
> > > and fallback to device method when platform does not have it.
> > > >
> > > > Future and to be equally usable in near term :)
> > >
> > > Please don't double standard again:
> > >
> > > When you are talking about TDISP, you want virtio to be designed to
> > > fit for the future where the platform is ready in the future When
> > > you are talking about dirty tracking, you want it to work now even
> > > if
> > >
> > The proposal of transport VQ is anti-TDISP.
> 
> It's nothing about transport VQ, it's about you're saying the adminq based
> device context. There's a comment to point out that the current TDISP spec
> forbids modifying device state when TVM is attached. Then you told us the
> TDISP may evolve for that.
So? That is not double standard.
The proposal is based on main principle that it is not depending on hypervisor traping + emulating which is the baseline of TDISP

> 
> > The proposal of dirty tracking is not anti-platform. It is optional like rest of the
> platform.
> >
> > > 1) most of the platform is ready now
> > Can you list a ARM server CPU in production that has it? (not in some pdf
> spec).
> 
> Then in the context of a dirty page, I've proved you dirty page tracking has been
> supported by all major vendors. 
Major IP vendor != major cpu chip vendor.
I don’t agree with the proof.

I already acknowledged that I have seen internal test report for dirty tracking with one cpu and nic.

I just don’t see all cpus have support for it.
Hence, this optional feature.

> Where you refuse to use the standard you used
> in explaining adminq for device context in TDISP.
> 
> So I didn't ask you the ETA of the TDISP support for migration or adminq, but
> you want me to give you the production information which is pointless. 
Because you keep claiming that _all_ cpus in the world has support for efficient dirty page tracking.

> You
> might need to ask ARM to get an answer, but a simple google told me the effort
> to support dirty page tracking in SMMUv3 could go back to early 2021.
>
To my knowledge ARM do not produce physical chips.
Your proposal is to keep those ARM server vendors to not use virtio devices.
Does not make sense to me.
 
> https://lore.kernel.org/linux-iommu/56b001fa-b4fe-c595-dc5e-
> f362d2f07a19@linux.intel.com/t/
> 
> Why is it not merged? It's simply because we agree to do it in the layer of
> IOMMUFD so it needs to wait.
> 
> Thanks
> 
> 
> >
> > > 2) whether or not virtio can log dirty page correctly is still
> > > suspicious
> > >
> > > Thanks
> >
> > There is no double standard. The feature is optional which co-exists as
> explained above.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16  5:29                             ` [virtio-comment] " Parav Pandit
@ 2023-11-16  5:51                               ` Michael S. Tsirkin
  2023-11-16  7:35                                 ` Michael S. Tsirkin
                                                   ` (2 more replies)
  2023-11-21  7:14                               ` Jason Wang
  1 sibling, 3 replies; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-16  5:51 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> We should expose a limit of the device in the proposed WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> So that future provisioning framework can use it.
> 
> I will cover this in v5 early next week.

I do worry about how this can even work though. If you want a generic 
device you do not get to dictate how much memory VM has.

Aren't we talking bit per page? With 1TByte of memory to track ->
256Gbit -> 32Gbit -> 8Gbyte per VF?

And you happily say "we'll address this in the future" while at the same
time fighting tooth and nail against adding single bit status registers
because scalability?


I have a feeling doing this completely theoretical like this is problematic.
Maybe you have it all laid out neatly in your head but I suspect
not all of TC can picture it clearly enough based just on spec text.

We do sometimes ask for POC implementation in linux / qemu to
demonstrate how things work before merging code. We skipped this
for admin things so far but I think it's a good idea to start doing
it here.

What makes me pause a bit before saying please do a PoC is
all the opposition that seems to exist to even using admin
commands in the 1st place. I think once we finally stop
arguing about whether to use admin commands at all then
a PoC will be needed before merging.


-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16  4:24                                     ` [virtio-comment] " Jason Wang
@ 2023-11-16  6:49                                       ` Michael S. Tsirkin
  2023-11-21  4:21                                         ` Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-16  6:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 16, 2023 at 12:24:27PM +0800, Jason Wang wrote:
> On Thu, Nov 16, 2023 at 1:37 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, November 13, 2023 9:11 AM
> > >
> > > On Fri, Nov 10, 2023 at 2:46 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > > Hi Michael,
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Thursday, November 9, 2023 1:29 PM
> > > >
> > > > [..]
> > > > > > Besides the issue of performance, it's also racy, assuming we are
> > > > > > logging
> > > > > IOVA.
> > > > > >
> > > > > > 0) device log IOVA
> > > > > > 1) hypervisor fetches IOVA from log buffer
> > > > > > 2) guest map IOVA to a new GPA
> > > > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > > > >
> > > > > > Then we lost the old GPA.
> > > > >
> > > > > Interesting and a good point. And by the way e.g. vhost has the same
> > > > > issue.  You need to flush dirty tracking info when changing the
> > > > > mappings somehow.  Parav what's the plan for this? Should be addressed in
> > > the spec too.
> > > > >
> > > > As you listed the flush is needed for vhost or device-based DPT.
> > >
> > > What does DPT mean? Device Page Table? Let's not invent terminology which is
> > > not known by others please.
> > >
> > Sorry for using the acronym. I meant dirty page tracking.
> >
> > > We have discussed it many times. You can't just depend on ATS or reinventing
> > > wheels in virtio.
> > The dependency is on the iommu which would have the mapping of GIOVA to GPA like any sw implementation.
> > No dependency on ATS.
> >
> > >
> > > What's more, please try not to give me the impression that the proposal is
> > > optimized for a specific vendor (like device IOMMU stuffs).
> > >
> > You should stop calling this specific vendor thing.
> 
> Well, as you have explained, the confusion came from "DPT" ...
> 
> > One can equally say that suspend bit proposal is for the sw_vendor device who is forcing virtio hw device to only implement ioqueues + PASID + non_unified interface for PF, VF, SIOVs + non_TDISP based devices.
> >
> > > > The necessary plumbing is already covered for this in the query (read and
> > > clear) command of this v3 proposal.
> > >
> > > The issue is logging via IOVA ... I don't see how "read and clear" can help.
> > >
> > Read and clear helps that ensures that all the dirty pages are reported, hence there is no mapping/unmapping race.
> 
> Reported as IOVA ...
> 
> > As everything is reported.
> >
> > > > It is listed in Device Write Records Read Command.
> > >
> > > Please explain how your proposal can solve the above race.
> > >
> > In below manner.
> > 1. guest has GIOVA to GPA_1 mapping
> > 2. RX packets occurred to GIOVA
> > 3. device reported dirty page log for GIOVA (hypervisor is yet to read)
> > 4. guest requested mapping change from GIOVA to GPA_2
> > 4.1 During this IOTLB is invalidated and dirty page report is queried ensuring, it can change the mapping
> 
> It requires
> 
> 1) hypervisor traps IOTLB invalidation, which doesn't work when
> nesting could be offloaded (IOMMUFD has started the work to support
> nesting)
> 2) query the device about the dirty page on each IOTLB invalidation which:
> 2.1) A huge round trip: guest IOTLB invalidation -> trapped by
> hypervisor -> start the query from the device -> device return ->
> hypervisor reports IOTLB invalidation is done -> let guest run. Have
> you benchmarked the RTT in this case? There are just too many places
> that cause the delay in the middle.

To be fair invalidations are already expensive e.g. with vhost iotlb
it requires a slow system call.
This will make them *even more* expensive.

Problem for some but not all workloads.  Again I agree motivation,
tradeoffs and comparison with both dirty tracking by iommu and shadow vq
approaches really should be included.


> 2.2) Guest triggerable behaviour, malicious guest can simply do
> endless IOTLB invalidation to DOS the e.g admin virtqueue

I'm not sure how much to worry about it - just don't allow more
than one in flight per VM.



> >
> > > >
> > > > When the page write record is fully read, it is flushed.
> > > > How/when to use, I think its hypervisor specific, so we probably better off not
> > > documenting those details.
> > >
> > > Well, as the author of this proposal, at least you need to know how a hypervisor
> > > can work with your proposal, no?
> > >
> > Likely yes, but it is not the scope of the spec to list those paths etc.
> 
> Fine, but as a reviewer I need to know if it can work with a hypervisor well.
> 
> >
> > > > May be such read is needed in some other path too depending on how
> > > hypervisor implemented.
> > >
> > > What do you mean by "May be ... some other path" here? You're inventing a
> > > mechanism that you don't know how a hypervisor can use?
> >
> > No. I meant hypervisor may have more operations that map/unmap/flush where it may need to implement it.
> > Some one may call it set_map(), some may say dma_map()...
> 
> Ok.
> 
> Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-15 17:37                                   ` [virtio-comment] " Parav Pandit
  2023-11-16  4:24                                     ` [virtio-comment] " Jason Wang
@ 2023-11-16  6:50                                     ` Michael S. Tsirkin
  1 sibling, 0 replies; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-16  6:50 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Wed, Nov 15, 2023 at 05:37:46PM +0000, Parav Pandit wrote:
> > What does DPT mean? Device Page Table? Let's not invent terminology which is
> > not known by others please.
> >
> Sorry for using the acronym. I meant dirty page tracking.

:)
Yea don't make up new ones please.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16  5:51                               ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-16  7:35                                 ` Michael S. Tsirkin
  2023-11-16  7:40                                   ` [virtio-comment] " Parav Pandit
  2023-11-16 10:28                                 ` Zhu, Lingshan
  2023-11-21  4:23                                 ` Jason Wang
  2 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-16  7:35 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 16, 2023 at 12:51:40AM -0500, Michael S. Tsirkin wrote:
> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > We should expose a limit of the device in the proposed WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > So that future provisioning framework can use it.
> > 
> > I will cover this in v5 early next week.
> 
> I do worry about how this can even work though. If you want a generic 
> device you do not get to dictate how much memory VM has.
> 
> Aren't we talking bit per page? With 1TByte of memory to track ->
> 256Gbit -> 32Gbit -> 8Gbyte per VF?

Ugh. Actually of course:
With 1TByte of memory to track -> 256Mbit -> 32Mbit -> 8Mbyte per VF

8Gbyte per *PF* with 1K VFs.



> And you happily say "we'll address this in the future" while at the same
> time fighting tooth and nail against adding single bit status registers
> because scalability?
> 
> 
> I have a feeling doing this completely theoretical like this is problematic.
> Maybe you have it all laid out neatly in your head but I suspect
> not all of TC can picture it clearly enough based just on spec text.
> 
> We do sometimes ask for POC implementation in linux / qemu to
> demonstrate how things work before merging code. We skipped this
> for admin things so far but I think it's a good idea to start doing
> it here.
> 
> What makes me pause a bit before saying please do a PoC is
> all the opposition that seems to exist to even using admin
> commands in the 1st place. I think once we finally stop
> arguing about whether to use admin commands at all then
> a PoC will be needed before merging.
> 
> 
> -- 
> MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16  7:35                                 ` Michael S. Tsirkin
@ 2023-11-16  7:40                                   ` Parav Pandit
  2023-11-16 11:48                                     ` [virtio-comment] " Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-16  7:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, November 16, 2023 1:06 PM
> 
> On Thu, Nov 16, 2023 at 12:51:40AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > We should expose a limit of the device in the proposed
> WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > > So that future provisioning framework can use it.
> > >
> > > I will cover this in v5 early next week.
> >
> > I do worry about how this can even work though. If you want a generic
> > device you do not get to dictate how much memory VM has.
> >
> > Aren't we talking bit per page? With 1TByte of memory to track ->
> > 256Gbit -> 32Gbit -> 8Gbyte per VF?
> 
> Ugh. Actually of course:
> With 1TByte of memory to track -> 256Mbit -> 32Mbit -> 8Mbyte per VF
> 
> 8Gbyte per *PF* with 1K VFs.
> 
Device may not maintain as a bitmap.
I have hard time using calculator for 1TB * 1K DDR memory. :)

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-15 11:52                                                 ` Michael S. Tsirkin
@ 2023-11-16  9:38                                                   ` Zhu, Lingshan
  2023-11-16 12:18                                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Zhu, Lingshan @ 2023-11-16  9:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Parav Pandit, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 11/15/2023 7:52 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 15, 2023 at 04:42:56PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/15/2023 4:05 PM, Michael S. Tsirkin wrote:
>>> On Wed, Nov 15, 2023 at 03:59:16PM +0800, Zhu, Lingshan wrote:
>>>> On 11/15/2023 3:51 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Nov 15, 2023 at 12:05:59PM +0800, Zhu, Lingshan wrote:
>>>>>> On 11/14/2023 4:27 PM, Michael S. Tsirkin wrote:
>>>>>>> On Tue, Nov 14, 2023 at 03:34:32PM +0800, Zhu, Lingshan wrote:
>>>>>>>>>> So I can't
>>>>>>>>>> believe it has good performance overall. Logging via IOMMU or using
>>>>>>>>>> shadow virtqueue doesn't need any extra PCI transactions at least.
>>>>>>>>> On the other hand they have an extra CPU cost.  Personally if this is
>>>>>>>>> coming from a hardware vendor, I am inclined to trust them wrt PCI
>>>>>>>>> transactions.  But anyway, discussing this at a high level theoretically
>>>>>>>>> is pointless - whoever bothers with actual prototyping for performance
>>>>>>>>> testing wins, if no one does I'd expect a back of a napkin estimate
>>>>>>>>> to be included.
>>>>>>>> if so, Intel has released productions implementing these interfaces years
>>>>>>>> ago,
>>>>>>>> see live migration in 4.1. IFCVF vDPA Implementation,
>>>>>>>> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html
>>>>>>>> and
>>>>>>> That one is based on shadow queue, right? Which I think this shows
>>>>>>> is worth supporting.
>>>>>> Yes, it is shadow virtqueue, I assume this is already mostly done,
>>>>>> do you see any gaps we need to address in our series that we should work on?
>>>>>>
>>>>>> Thanks
>>>>> There were a ton of comments posted on your series.
>>>> Hope I didn't miss anything. I see your latest comments are about vq states,
>>>> as replied before, I think we can record the states by two le16 and the
>>>> in-flight
>>>> descriptor tracking facility.
>>> I don't know why you need the le16. in-flight tracking should be enough.
>>> And given it needs DMA I would try really hard to actually use
>>> admin commands for this.
>> we need to record the on-device avail_idx and used_idx, or
>> how can the destination side know the device internal values.
> Again you never documented what state the device is in so I can't really
> say for sure.  But generally whenever a buffer is used the internal
> values are written out to memory.
This is the state of a virtqueue, in my series I have defined what is
vq state in [PATCH V2 1/6] virtio: introduce virtqueue state,
and give an example of PCI implementation.

In short, for split vq it is last_avail_idx and in-flight descriptors.

I humbly request an explicit list of missing gaps, so that I can improve 
my V3

Thanks
>
>>>> For this shadow virtqueue, do you think I should address this in my V4?
>>>> Like saying: acknowledged control commands through the control virtqueue
>>>> should be recorded, and we want to use shadow virtqueue to track dirty
>>>> pages.
>>> What you need to do is actually describe what do you expect the device
>>> to do when it enters this suspend state. since you mention control
>>> virtqueue then it seems that there needs to be a device type
>>> specific text explaining the behaviour. If so this implies there
>>> needs to be a list of device types that support suspend
>>> until someone looks at each type and documents what it does.
>> On a second thought, shadow vqs are hypervisor behaviors, maybe should not
>> be
>> described in this device spec.
>>
>> Since SUSPEND is in device status, so for now I see every type of device
>> implements
>> device_status should support SUSPEND. This should be a general facility.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16  5:51                               ` [virtio-comment] " Michael S. Tsirkin
  2023-11-16  7:35                                 ` Michael S. Tsirkin
@ 2023-11-16 10:28                                 ` Zhu, Lingshan
  2023-11-16 11:59                                   ` Michael S. Tsirkin
  2023-11-21  4:23                                 ` Jason Wang
  2 siblings, 1 reply; 157+ messages in thread
From: Zhu, Lingshan @ 2023-11-16 10:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu



On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
>> We should expose a limit of the device in the proposed WRITE_RECORD_CAP_QUERY command, that how much range it can track.
>> So that future provisioning framework can use it.
>>
>> I will cover this in v5 early next week.
> I do worry about how this can even work though. If you want a generic
> device you do not get to dictate how much memory VM has.
>
> Aren't we talking bit per page? With 1TByte of memory to track ->
> 256Gbit -> 32Gbit -> 8Gbyte per VF?
>
> And you happily say "we'll address this in the future" while at the same
> time fighting tooth and nail against adding single bit status registers
> because scalability?
>
>
> I have a feeling doing this completely theoretical like this is problematic.
> Maybe you have it all laid out neatly in your head but I suspect
> not all of TC can picture it clearly enough based just on spec text.
>
> We do sometimes ask for POC implementation in linux / qemu to
> demonstrate how things work before merging code. We skipped this
> for admin things so far but I think it's a good idea to start doing
> it here.
>
> What makes me pause a bit before saying please do a PoC is
> all the opposition that seems to exist to even using admin
> commands in the 1st place. I think once we finally stop
> arguing about whether to use admin commands at all then
> a PoC will be needed before merging.
We have POR productions that implemented the approach in my series. They 
are multiple generations
of productions in market and running in customers data centers for years.

Back to 2019 when we start working on vDPA, we have sent some samples of 
production(e.g., Cascade Glacier)
and the datasheet, you can find live migration facilities there, 
includes suspend, vq state and other
features.

And there is an reference in DPDK live migration, I have provided this 
page before:
https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html, it has been working 
for long long time.

So if we let the facts speak, if we want to see if the proposal is 
proven to work, I would
say: They are POR for years, customers already deployed them for years.

For dirty page tracking, I see you want both platform IOMMU tracking and 
shadow vqs, I am
totally fine with this idea. And I think maybe we should merge the basic 
features first, and
dirty page tracking should be the second step.

Thanks
>
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16  7:40                                   ` [virtio-comment] " Parav Pandit
@ 2023-11-16 11:48                                     ` Michael S. Tsirkin
  2023-11-16 16:26                                       ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-16 11:48 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 16, 2023 at 07:40:57AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, November 16, 2023 1:06 PM
> > 
> > On Thu, Nov 16, 2023 at 12:51:40AM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > > We should expose a limit of the device in the proposed
> > WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > > > So that future provisioning framework can use it.
> > > >
> > > > I will cover this in v5 early next week.
> > >
> > > I do worry about how this can even work though. If you want a generic
> > > device you do not get to dictate how much memory VM has.
> > >
> > > Aren't we talking bit per page? With 1TByte of memory to track ->
> > > 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > 
> > Ugh. Actually of course:
> > With 1TByte of memory to track -> 256Mbit -> 32Mbit -> 8Mbyte per VF
> > 
> > 8Gbyte per *PF* with 1K VFs.
> > 
> Device may not maintain as a bitmap.

However you maintain it, there's 256Mega bit of information.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16 10:28                                 ` Zhu, Lingshan
@ 2023-11-16 11:59                                   ` Michael S. Tsirkin
  2023-11-17  9:59                                     ` Zhu, Lingshan
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-16 11:59 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > We should expose a limit of the device in the proposed WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > > So that future provisioning framework can use it.
> > > 
> > > I will cover this in v5 early next week.
> > I do worry about how this can even work though. If you want a generic
> > device you do not get to dictate how much memory VM has.
> > 
> > Aren't we talking bit per page? With 1TByte of memory to track ->
> > 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > 
> > And you happily say "we'll address this in the future" while at the same
> > time fighting tooth and nail against adding single bit status registers
> > because scalability?
> > 
> > 
> > I have a feeling doing this completely theoretical like this is problematic.
> > Maybe you have it all laid out neatly in your head but I suspect
> > not all of TC can picture it clearly enough based just on spec text.
> > 
> > We do sometimes ask for POC implementation in linux / qemu to
> > demonstrate how things work before merging code. We skipped this
> > for admin things so far but I think it's a good idea to start doing
> > it here.
> > 
> > What makes me pause a bit before saying please do a PoC is
> > all the opposition that seems to exist to even using admin
> > commands in the 1st place. I think once we finally stop
> > arguing about whether to use admin commands at all then
> > a PoC will be needed before merging.
> We have POR productions that implemented the approach in my series. They are
> multiple generations
> of productions in market and running in customers data centers for years.
> 
> Back to 2019 when we start working on vDPA, we have sent some samples of
> production(e.g., Cascade Glacier)
> and the datasheet, you can find live migration facilities there, includes
> suspend, vq state and other
> features.
> 
> And there is an reference in DPDK live migration, I have provided this page
> before:
> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html, it has been working for
> long long time.
> 
> So if we let the facts speak, if we want to see if the proposal is proven to
> work, I would
> say: They are POR for years, customers already deployed them for years.

And I guess what you are trying to say is that this patchset
we are reviewing here should be help to the same standard and
there should be a PoC? Sounds reasonable.

> For dirty page tracking, I see you want both platform IOMMU tracking and
> shadow vqs, I am
> totally fine with this idea. And I think maybe we should merge the basic
> features first, and
> dirty page tracking should be the second step.
> 
> Thanks

Parav wants to add an option of on-device tracking. Which also seems
fine. I think it should be optional though because shadow and IOMMU
options exist.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16  9:38                                                   ` Zhu, Lingshan
@ 2023-11-16 12:18                                                     ` Michael S. Tsirkin
  2023-11-17  9:50                                                       ` Zhu, Lingshan
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-16 12:18 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Jason Wang, Parav Pandit, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Thu, Nov 16, 2023 at 05:38:35PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/15/2023 7:52 PM, Michael S. Tsirkin wrote:
> > On Wed, Nov 15, 2023 at 04:42:56PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 11/15/2023 4:05 PM, Michael S. Tsirkin wrote:
> > > > On Wed, Nov 15, 2023 at 03:59:16PM +0800, Zhu, Lingshan wrote:
> > > > > On 11/15/2023 3:51 PM, Michael S. Tsirkin wrote:
> > > > > > On Wed, Nov 15, 2023 at 12:05:59PM +0800, Zhu, Lingshan wrote:
> > > > > > > On 11/14/2023 4:27 PM, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Nov 14, 2023 at 03:34:32PM +0800, Zhu, Lingshan wrote:
> > > > > > > > > > > So I can't
> > > > > > > > > > > believe it has good performance overall. Logging via IOMMU or using
> > > > > > > > > > > shadow virtqueue doesn't need any extra PCI transactions at least.
> > > > > > > > > > On the other hand they have an extra CPU cost.  Personally if this is
> > > > > > > > > > coming from a hardware vendor, I am inclined to trust them wrt PCI
> > > > > > > > > > transactions.  But anyway, discussing this at a high level theoretically
> > > > > > > > > > is pointless - whoever bothers with actual prototyping for performance
> > > > > > > > > > testing wins, if no one does I'd expect a back of a napkin estimate
> > > > > > > > > > to be included.
> > > > > > > > > if so, Intel has released productions implementing these interfaces years
> > > > > > > > > ago,
> > > > > > > > > see live migration in 4.1. IFCVF vDPA Implementation,
> > > > > > > > > https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html
> > > > > > > > > and
> > > > > > > > That one is based on shadow queue, right? Which I think this shows
> > > > > > > > is worth supporting.
> > > > > > > Yes, it is shadow virtqueue, I assume this is already mostly done,
> > > > > > > do you see any gaps we need to address in our series that we should work on?
> > > > > > > 
> > > > > > > Thanks
> > > > > > There were a ton of comments posted on your series.
> > > > > Hope I didn't miss anything. I see your latest comments are about vq states,
> > > > > as replied before, I think we can record the states by two le16 and the
> > > > > in-flight
> > > > > descriptor tracking facility.
> > > > I don't know why you need the le16. in-flight tracking should be enough.
> > > > And given it needs DMA I would try really hard to actually use
> > > > admin commands for this.
> > > we need to record the on-device avail_idx and used_idx, or
> > > how can the destination side know the device internal values.
> > Again you never documented what state the device is in so I can't really
> > say for sure.  But generally whenever a buffer is used the internal
> > values are written out to memory.
> This is the state of a virtqueue, in my series I have defined what is
> vq state in [PATCH V2 1/6] virtio: introduce virtqueue state,
> and give an example of PCI implementation.
> 
> In short, for split vq it is last_avail_idx and in-flight descriptors.
> 
> I humbly request an explicit list of missing gaps, so that I can improve my
> V3
> 
> Thanks

I don't know how to help you without resorting to writing it instead of
you, I sent 3 messages in response to that one patch alone. Your patch
just adds some bits here and there without much in the way of
documentation. Patch needs to explain what these things are and how do they
interact with VQ state in memory.


But besides, Parav needs to do exactly the same too. So why don't you
let Parav do the work on this and then later just add a small interface
to send admin commands through VF itself? Looks like this will be good
enough for VDPA. Meanwhile I feel your energy would be better spend
working on transport vq which no one else is working on.




> > 
> > > > > For this shadow virtqueue, do you think I should address this in my V4?
> > > > > Like saying: acknowledged control commands through the control virtqueue
> > > > > should be recorded, and we want to use shadow virtqueue to track dirty
> > > > > pages.
> > > > What you need to do is actually describe what do you expect the device
> > > > to do when it enters this suspend state. since you mention control
> > > > virtqueue then it seems that there needs to be a device type
> > > > specific text explaining the behaviour. If so this implies there
> > > > needs to be a list of device types that support suspend
> > > > until someone looks at each type and documents what it does.
> > > On a second thought, shadow vqs are hypervisor behaviors, maybe should not
> > > be
> > > described in this device spec.
> > > 
> > > Since SUSPEND is in device status, so for now I see every type of device
> > > implements
> > > device_status should support SUSPEND. This should be a general facility.
> > 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16 11:48                                     ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-16 16:26                                       ` Parav Pandit
  2023-11-16 17:25                                         ` [virtio-comment] " Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-16 16:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, November 16, 2023 5:18 PM
> 
> On Thu, Nov 16, 2023 at 07:40:57AM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, November 16, 2023 1:06 PM
> > >
> > > On Thu, Nov 16, 2023 at 12:51:40AM -0500, Michael S. Tsirkin wrote:
> > > > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > > > We should expose a limit of the device in the proposed
> > > WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > > > > So that future provisioning framework can use it.
> > > > >
> > > > > I will cover this in v5 early next week.
> > > >
> > > > I do worry about how this can even work though. If you want a
> > > > generic device you do not get to dictate how much memory VM has.
> > > >
> > > > Aren't we talking bit per page? With 1TByte of memory to track ->
> > > > 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > >
> > > Ugh. Actually of course:
> > > With 1TByte of memory to track -> 256Mbit -> 32Mbit -> 8Mbyte per VF
> > >
> > > 8Gbyte per *PF* with 1K VFs.
> > >
> > Device may not maintain as a bitmap.
> 
> However you maintain it, there's 256Mega bit of information.
There may be other data structures that device may deploy as for example hash or tree or something else.
And this is runtime memory only during the short live migration period of 400msec or less.
It is not some _always_ resident memory.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16 16:26                                       ` [virtio-comment] " Parav Pandit
@ 2023-11-16 17:25                                         ` Michael S. Tsirkin
  2023-11-16 17:29                                           ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-16 17:25 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 16, 2023 at 04:26:53PM +0000, Parav Pandit wrote:
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, November 16, 2023 5:18 PM
> > 
> > On Thu, Nov 16, 2023 at 07:40:57AM +0000, Parav Pandit wrote:
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Thursday, November 16, 2023 1:06 PM
> > > >
> > > > On Thu, Nov 16, 2023 at 12:51:40AM -0500, Michael S. Tsirkin wrote:
> > > > > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > > > > We should expose a limit of the device in the proposed
> > > > WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > > > > > So that future provisioning framework can use it.
> > > > > >
> > > > > > I will cover this in v5 early next week.
> > > > >
> > > > > I do worry about how this can even work though. If you want a
> > > > > generic device you do not get to dictate how much memory VM has.
> > > > >
> > > > > Aren't we talking bit per page? With 1TByte of memory to track ->
> > > > > 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > >
> > > > Ugh. Actually of course:
> > > > With 1TByte of memory to track -> 256Mbit -> 32Mbit -> 8Mbyte per VF
> > > >
> > > > 8Gbyte per *PF* with 1K VFs.
> > > >
> > > Device may not maintain as a bitmap.
> > 
> > However you maintain it, there's 256Mega bit of information.
> There may be other data structures that device may deploy as for example hash or tree or something else.

Point being?

> And this is runtime memory only during the short live migration period of 400msec or less.
> It is not some _always_ resident memory.

No - write tracking is used in the live phase of migration. It can be
enabled as long as you wish - it's a question of policy.  There actually
exist solutions that utilize this phase for redundancy, permanently
running in this mode.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16 17:25                                         ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-16 17:29                                           ` Parav Pandit
  2023-11-16 18:20                                             ` [virtio-comment] " Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-16 17:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, November 16, 2023 10:56 PM
> 
> On Thu, Nov 16, 2023 at 04:26:53PM +0000, Parav Pandit wrote:
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, November 16, 2023 5:18 PM
> > >
> > > On Thu, Nov 16, 2023 at 07:40:57AM +0000, Parav Pandit wrote:
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Thursday, November 16, 2023 1:06 PM
> > > > >
> > > > > On Thu, Nov 16, 2023 at 12:51:40AM -0500, Michael S. Tsirkin wrote:
> > > > > > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > > > > > We should expose a limit of the device in the proposed
> > > > > WRITE_RECORD_CAP_QUERY command, that how much range it can
> track.
> > > > > > > So that future provisioning framework can use it.
> > > > > > >
> > > > > > > I will cover this in v5 early next week.
> > > > > >
> > > > > > I do worry about how this can even work though. If you want a
> > > > > > generic device you do not get to dictate how much memory VM has.
> > > > > >
> > > > > > Aren't we talking bit per page? With 1TByte of memory to track
> > > > > > -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > >
> > > > > Ugh. Actually of course:
> > > > > With 1TByte of memory to track -> 256Mbit -> 32Mbit -> 8Mbyte
> > > > > per VF
> > > > >
> > > > > 8Gbyte per *PF* with 1K VFs.
> > > > >
> > > > Device may not maintain as a bitmap.
> > >
> > > However you maintain it, there's 256Mega bit of information.
> > There may be other data structures that device may deploy as for example
> hash or tree or something else.
> 
> Point being?
The device may have some hashing accelerator or other improvements that may perform better than bitmap as many queues in parallel attempt to update the shared database.

> 
> > And this is runtime memory only during the short live migration period of
> 400msec or less.
> > It is not some _always_ resident memory.
> 
> No - write tracking is used in the live phase of migration. It can be enabled as
> long as you wish - it's a question of policy.  There actually exist solutions that
> utilize this phase for redundancy, permanently running in this mode.

If such use case exists, one may further improve the device implementation.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16 17:29                                           ` [virtio-comment] " Parav Pandit
@ 2023-11-16 18:20                                             ` Michael S. Tsirkin
  2023-11-17  3:02                                               ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-16 18:20 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 16, 2023 at 05:29:49PM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, November 16, 2023 10:56 PM
> > 
> > On Thu, Nov 16, 2023 at 04:26:53PM +0000, Parav Pandit wrote:
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Thursday, November 16, 2023 5:18 PM
> > > >
> > > > On Thu, Nov 16, 2023 at 07:40:57AM +0000, Parav Pandit wrote:
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Thursday, November 16, 2023 1:06 PM
> > > > > >
> > > > > > On Thu, Nov 16, 2023 at 12:51:40AM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > > > > > > We should expose a limit of the device in the proposed
> > > > > > WRITE_RECORD_CAP_QUERY command, that how much range it can
> > track.
> > > > > > > > So that future provisioning framework can use it.
> > > > > > > >
> > > > > > > > I will cover this in v5 early next week.
> > > > > > >
> > > > > > > I do worry about how this can even work though. If you want a
> > > > > > > generic device you do not get to dictate how much memory VM has.
> > > > > > >
> > > > > > > Aren't we talking bit per page? With 1TByte of memory to track
> > > > > > > -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > >
> > > > > > Ugh. Actually of course:
> > > > > > With 1TByte of memory to track -> 256Mbit -> 32Mbit -> 8Mbyte
> > > > > > per VF
> > > > > >
> > > > > > 8Gbyte per *PF* with 1K VFs.
> > > > > >
> > > > > Device may not maintain as a bitmap.
> > > >
> > > > However you maintain it, there's 256Mega bit of information.
> > > There may be other data structures that device may deploy as for example
> > hash or tree or something else.
> > 
> > Point being?
> The device may have some hashing accelerator or other improvements that may perform better than bitmap as many queues in parallel attempt to update the shared database.

Maybe, I didn't give this thought.

My point was that to be able to keep all combinations of dirty/non dirty
page for each 4k page in a 1TByte guest device needs 8MBytes of
on-device memory per VF. As designed the query also has to report it for
each VF accurately even if multiple VFs are accessing same guest.

> > 
> > > And this is runtime memory only during the short live migration period of
> > 400msec or less.
> > > It is not some _always_ resident memory.
> > 
> > No - write tracking is used in the live phase of migration. It can be enabled as
> > long as you wish - it's a question of policy.  There actually exist solutions that
> > utilize this phase for redundancy, permanently running in this mode.
> 
> If such use case exists, one may further improve the device implementation.

Yes such use cases exist, there is no limit on how long migration takes.
So go ahead and further improve it please. Do not give us "we did not
get requests for this feature" please.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16 18:20                                             ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-17  3:02                                               ` Parav Pandit
  2023-11-17  8:46                                                 ` [virtio-comment] " Michael S. Tsirkin
  2023-11-21  4:24                                                 ` Jason Wang
  0 siblings, 2 replies; 157+ messages in thread
From: Parav Pandit @ 2023-11-17  3:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, November 16, 2023 11:51 PM
> 
> On Thu, Nov 16, 2023 at 05:29:49PM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, November 16, 2023 10:56 PM
> > >
> > > On Thu, Nov 16, 2023 at 04:26:53PM +0000, Parav Pandit wrote:
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Thursday, November 16, 2023 5:18 PM
> > > > >
> > > > > On Thu, Nov 16, 2023 at 07:40:57AM +0000, Parav Pandit wrote:
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Thursday, November 16, 2023 1:06 PM
> > > > > > >
> > > > > > > On Thu, Nov 16, 2023 at 12:51:40AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > > > > > > > We should expose a limit of the device in the proposed
> > > > > > > WRITE_RECORD_CAP_QUERY command, that how much range it can
> > > track.
> > > > > > > > > So that future provisioning framework can use it.
> > > > > > > > >
> > > > > > > > > I will cover this in v5 early next week.
> > > > > > > >
> > > > > > > > I do worry about how this can even work though. If you
> > > > > > > > want a generic device you do not get to dictate how much memory
> VM has.
> > > > > > > >
> > > > > > > > Aren't we talking bit per page? With 1TByte of memory to
> > > > > > > > track
> > > > > > > > -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > >
> > > > > > > Ugh. Actually of course:
> > > > > > > With 1TByte of memory to track -> 256Mbit -> 32Mbit ->
> > > > > > > 8Mbyte per VF
> > > > > > >
> > > > > > > 8Gbyte per *PF* with 1K VFs.
> > > > > > >
> > > > > > Device may not maintain as a bitmap.
> > > > >
> > > > > However you maintain it, there's 256Mega bit of information.
> > > > There may be other data structures that device may deploy as for
> > > > example
> > > hash or tree or something else.
> > >
> > > Point being?
> > The device may have some hashing accelerator or other improvements that
> may perform better than bitmap as many queues in parallel attempt to update
> the shared database.
> 
> Maybe, I didn't give this thought.
> 
> My point was that to be able to keep all combinations of dirty/non dirty page
> for each 4k page in a 1TByte guest device needs 8MBytes of on-device memory
> per VF. As designed the query also has to report it for each VF accurately even if
> multiple VFs are accessing same guest.
Yes.

> 
> > >
> > > > And this is runtime memory only during the short live migration
> > > > period of
> > > 400msec or less.
> > > > It is not some _always_ resident memory.
> > >
> > > No - write tracking is used in the live phase of migration. It can
> > > be enabled as long as you wish - it's a question of policy.  There
> > > actually exist solutions that utilize this phase for redundancy, permanently
> running in this mode.
> >
> > If such use case exists, one may further improve the device implementation.
> 
> Yes such use cases exist, there is no limit on how long migration takes.
> So go ahead and further improve it please. Do not give us "we did not get
> requests for this feature" please.

Please describe the use case more precisely.
If there is any application or OS API etc exists, please point to it where would you like to fit this dirty page tracking beyond device migration.
We may have to draw a line to have reasonable point and not keep discussing infinitely.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  3:02                                               ` [virtio-comment] " Parav Pandit
@ 2023-11-17  8:46                                                 ` Michael S. Tsirkin
  2023-11-17  9:14                                                   ` [virtio-comment] " Parav Pandit
  2023-11-21  4:24                                                 ` Jason Wang
  1 sibling, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17  8:46 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Fri, Nov 17, 2023 at 03:02:20AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, November 16, 2023 11:51 PM
> > 
> > On Thu, Nov 16, 2023 at 05:29:49PM +0000, Parav Pandit wrote:
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Thursday, November 16, 2023 10:56 PM
> > > >
> > > > On Thu, Nov 16, 2023 at 04:26:53PM +0000, Parav Pandit wrote:
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Thursday, November 16, 2023 5:18 PM
> > > > > >
> > > > > > On Thu, Nov 16, 2023 at 07:40:57AM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Thursday, November 16, 2023 1:06 PM
> > > > > > > >
> > > > > > > > On Thu, Nov 16, 2023 at 12:51:40AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > > > > > > > > We should expose a limit of the device in the proposed
> > > > > > > > WRITE_RECORD_CAP_QUERY command, that how much range it can
> > > > track.
> > > > > > > > > > So that future provisioning framework can use it.
> > > > > > > > > >
> > > > > > > > > > I will cover this in v5 early next week.
> > > > > > > > >
> > > > > > > > > I do worry about how this can even work though. If you
> > > > > > > > > want a generic device you do not get to dictate how much memory
> > VM has.
> > > > > > > > >
> > > > > > > > > Aren't we talking bit per page? With 1TByte of memory to
> > > > > > > > > track
> > > > > > > > > -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > >
> > > > > > > > Ugh. Actually of course:
> > > > > > > > With 1TByte of memory to track -> 256Mbit -> 32Mbit ->
> > > > > > > > 8Mbyte per VF
> > > > > > > >
> > > > > > > > 8Gbyte per *PF* with 1K VFs.
> > > > > > > >
> > > > > > > Device may not maintain as a bitmap.
> > > > > >
> > > > > > However you maintain it, there's 256Mega bit of information.
> > > > > There may be other data structures that device may deploy as for
> > > > > example
> > > > hash or tree or something else.
> > > >
> > > > Point being?
> > > The device may have some hashing accelerator or other improvements that
> > may perform better than bitmap as many queues in parallel attempt to update
> > the shared database.
> > 
> > Maybe, I didn't give this thought.
> > 
> > My point was that to be able to keep all combinations of dirty/non dirty page
> > for each 4k page in a 1TByte guest device needs 8MBytes of on-device memory
> > per VF. As designed the query also has to report it for each VF accurately even if
> > multiple VFs are accessing same guest.
> Yes.
> 
> > 
> > > >
> > > > > And this is runtime memory only during the short live migration
> > > > > period of
> > > > 400msec or less.
> > > > > It is not some _always_ resident memory.
> > > >
> > > > No - write tracking is used in the live phase of migration. It can
> > > > be enabled as long as you wish - it's a question of policy.  There
> > > > actually exist solutions that utilize this phase for redundancy, permanently
> > running in this mode.
> > >
> > > If such use case exists, one may further improve the device implementation.
> > 
> > Yes such use cases exist, there is no limit on how long migration takes.
> > So go ahead and further improve it please. Do not give us "we did not get
> > requests for this feature" please.
> 
> Please describe the use case more precisely.  If there is any
> application or OS API etc exists, please point to it where would you
> like to fit this dirty page tracking beyond device migration.  We may
> have to draw a line to have reasonable point and not keep discussing
> infinitely.


What I had in mind was fault tolerance e.g. the abandoned Kemari
project.  Even just with KVM people tried several times so we know
there's interest.

In any case you can safely assume that many users will have migration
that takes seconds and minutes.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  8:46                                                 ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-17  9:14                                                   ` Parav Pandit
  2023-11-17  9:37                                                     ` [virtio-comment] " Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17  9:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 2:16 PM
> In any case you can safely assume that many users will have migration that takes
> seconds and minutes.

Strange, but ok. I don't see any problem with current method.
8MB is used for very large VM of 1TB takes minutes. Should be fine.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  9:14                                                   ` [virtio-comment] " Parav Pandit
@ 2023-11-17  9:37                                                     ` Michael S. Tsirkin
  2023-11-17  9:41                                                       ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17  9:37 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Fri, Nov 17, 2023 at 09:14:21AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 2:16 PM
> > In any case you can safely assume that many users will have migration that takes
> > seconds and minutes.
> 
> Strange, but ok. I don't see any problem with current method.
> 8MB is used for very large VM of 1TB takes minutes. Should be fine.

The problem is simple: vendors selling devices have no idea how large
the VM will be. So you have to over-provision for the max VM size.
If there was a way to instead allocate that in host memory, that
would improve on this.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  9:37                                                     ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-17  9:41                                                       ` Parav Pandit
  2023-11-17  9:44                                                         ` Parav Pandit
                                                                           ` (2 more replies)
  0 siblings, 3 replies; 157+ messages in thread
From: Parav Pandit @ 2023-11-17  9:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 3:08 PM
> 
> On Fri, Nov 17, 2023 at 09:14:21AM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 2:16 PM In any case you can safely
> > > assume that many users will have migration that takes seconds and
> > > minutes.
> >
> > Strange, but ok. I don't see any problem with current method.
> > 8MB is used for very large VM of 1TB takes minutes. Should be fine.
> 
> The problem is simple: vendors selling devices have no idea how large the VM
> will be. So you have to over-provision for the max VM size.
> If there was a way to instead allocate that in host memory, that would improve
> on this.

Not sure what to over provision for max VM size.
Vendor does not know how many vcpus will be needed. It is no different problem.

When the VM migration is started, the individual tracking range is supplied by the hypervisor to device.
Device allocates necessary memory on this instruction.

When the VM with certain size is provisioned, the member device can be provisioned for the VM size.
And if it cannot be provisioned, possibly this may not the right member device to use at that point in time.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  9:41                                                       ` [virtio-comment] " Parav Pandit
@ 2023-11-17  9:44                                                         ` Parav Pandit
  2023-11-17  9:51                                                         ` [virtio-comment] " Michael S. Tsirkin
  2023-11-17  9:52                                                         ` Zhu, Lingshan
  2 siblings, 0 replies; 157+ messages in thread
From: Parav Pandit @ 2023-11-17  9:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: Parav Pandit <parav@nvidia.com>
> Sent: Friday, November 17, 2023 3:12 PM
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 3:08 PM
> >
> > On Fri, Nov 17, 2023 at 09:14:21AM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 2:16 PM In any case you can safely
> > > > assume that many users will have migration that takes seconds and
> > > > minutes.
> > >
> > > Strange, but ok. I don't see any problem with current method.
> > > 8MB is used for very large VM of 1TB takes minutes. Should be fine.
> >
> > The problem is simple: vendors selling devices have no idea how large
> > the VM will be. So you have to over-provision for the max VM size.
> > If there was a way to instead allocate that in host memory, that would
> > improve on this.
> 
> Not sure what to over provision for max VM size.
> Vendor does not know how many vcpus will be needed. It is no different
> problem.
> 
> When the VM migration is started, the individual tracking range is supplied by
> the hypervisor to device.
> Device allocates necessary memory on this instruction.
> 
> When the VM with certain size is provisioned, the member device can be
> provisioned for the VM size.
> And if it cannot be provisioned, possibly this may not the right member device
> to use at that point in time.

Clicked little early,
I agree that,
And if the dynamic host memory by the owner device, an extension could be have request/event queue from the owner device depending on the algorithm and size of VM it runs.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16 12:18                                                     ` Michael S. Tsirkin
@ 2023-11-17  9:50                                                       ` Zhu, Lingshan
  2023-11-17  9:55                                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Zhu, Lingshan @ 2023-11-17  9:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Parav Pandit, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 11/16/2023 8:18 PM, Michael S. Tsirkin wrote:
> On Thu, Nov 16, 2023 at 05:38:35PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/15/2023 7:52 PM, Michael S. Tsirkin wrote:
>>> On Wed, Nov 15, 2023 at 04:42:56PM +0800, Zhu, Lingshan wrote:
>>>> On 11/15/2023 4:05 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Nov 15, 2023 at 03:59:16PM +0800, Zhu, Lingshan wrote:
>>>>>> On 11/15/2023 3:51 PM, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Nov 15, 2023 at 12:05:59PM +0800, Zhu, Lingshan wrote:
>>>>>>>> On 11/14/2023 4:27 PM, Michael S. Tsirkin wrote:
>>>>>>>>> On Tue, Nov 14, 2023 at 03:34:32PM +0800, Zhu, Lingshan wrote:
>>>>>>>>>>>> So I can't
>>>>>>>>>>>> believe it has good performance overall. Logging via IOMMU or using
>>>>>>>>>>>> shadow virtqueue doesn't need any extra PCI transactions at least.
>>>>>>>>>>> On the other hand they have an extra CPU cost.  Personally if this is
>>>>>>>>>>> coming from a hardware vendor, I am inclined to trust them wrt PCI
>>>>>>>>>>> transactions.  But anyway, discussing this at a high level theoretically
>>>>>>>>>>> is pointless - whoever bothers with actual prototyping for performance
>>>>>>>>>>> testing wins, if no one does I'd expect a back of a napkin estimate
>>>>>>>>>>> to be included.
>>>>>>>>>> if so, Intel has released productions implementing these interfaces years
>>>>>>>>>> ago,
>>>>>>>>>> see live migration in 4.1. IFCVF vDPA Implementation,
>>>>>>>>>> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html
>>>>>>>>>> and
>>>>>>>>> That one is based on shadow queue, right? Which I think this shows
>>>>>>>>> is worth supporting.
>>>>>>>> Yes, it is shadow virtqueue, I assume this is already mostly done,
>>>>>>>> do you see any gaps we need to address in our series that we should work on?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>> There were a ton of comments posted on your series.
>>>>>> Hope I didn't miss anything. I see your latest comments are about vq states,
>>>>>> as replied before, I think we can record the states by two le16 and the
>>>>>> in-flight
>>>>>> descriptor tracking facility.
>>>>> I don't know why you need the le16. in-flight tracking should be enough.
>>>>> And given it needs DMA I would try really hard to actually use
>>>>> admin commands for this.
>>>> we need to record the on-device avail_idx and used_idx, or
>>>> how can the destination side know the device internal values.
>>> Again you never documented what state the device is in so I can't really
>>> say for sure.  But generally whenever a buffer is used the internal
>>> values are written out to memory.
>> This is the state of a virtqueue, in my series I have defined what is
>> vq state in [PATCH V2 1/6] virtio: introduce virtqueue state,
>> and give an example of PCI implementation.
>>
>> In short, for split vq it is last_avail_idx and in-flight descriptors.
>>
>> I humbly request an explicit list of missing gaps, so that I can improve my
>> V3
>>
>> Thanks
> I don't know how to help you without resorting to writing it instead of
> you, I sent 3 messages in response to that one patch alone. Your patch
> just adds some bits here and there without much in the way of
> documentation. Patch needs to explain what these things are and how do they
> interact with VQ state in memory.
Please allow me quote the series:
+The available state field is two bytes of virtqueue state that is used by
+the device to read the next available buffer. It is presented in the 
following format:
+
+\begin{lstlisting}
+le16 last_avail_idx;
+\end{lstlisting}
+
+The \field{last_avail_idx} field is the free-running available ring
+index where the device will read the next available head of a
+descriptor chain.

+When SUSPEND is set, the device MUST record the Available State of 
every enabled splited virtqueue
+in \field{Available State} field,
+and correspondingly restore the Available State of every enabled 
splited virtqueue
+from \field{Available State} field when DRIVER_OK is set.
+
+The device SHOULD reset \field{Available State} field upon a device reset

I will add these contents in next series:
1) vq states refer to the device internal states, so have to record and 
restore them
2) in-flight descriprots tracking.

Not sure whether I should describe how in-guest-memory vq config migrated,
because they migrated with guest memory.
>
>
> But besides, Parav needs to do exactly the same too. So why don't you
> let Parav do the work on this and then later just add a small interface
> to send admin commands through VF itself? Looks like this will be good
I think this is the topic: shall we process admin cmds in config space.....
> enough for VDPA. Meanwhile I feel your energy would be better spend
> working on transport vq which no one else is working on.
I can start rework on transport vq next week.

Thanks
>
>
>
>
>>>>>> For this shadow virtqueue, do you think I should address this in my V4?
>>>>>> Like saying: acknowledged control commands through the control virtqueue
>>>>>> should be recorded, and we want to use shadow virtqueue to track dirty
>>>>>> pages.
>>>>> What you need to do is actually describe what do you expect the device
>>>>> to do when it enters this suspend state. since you mention control
>>>>> virtqueue then it seems that there needs to be a device type
>>>>> specific text explaining the behaviour. If so this implies there
>>>>> needs to be a list of device types that support suspend
>>>>> until someone looks at each type and documents what it does.
>>>> On a second thought, shadow vqs are hypervisor behaviors, maybe should not
>>>> be
>>>> described in this device spec.
>>>>
>>>> Since SUSPEND is in device status, so for now I see every type of device
>>>> implements
>>>> device_status should support SUSPEND. This should be a general facility.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  9:41                                                       ` [virtio-comment] " Parav Pandit
  2023-11-17  9:44                                                         ` Parav Pandit
@ 2023-11-17  9:51                                                         ` Michael S. Tsirkin
  2023-11-17  9:54                                                           ` Zhu, Lingshan
  2023-11-17  9:57                                                           ` Parav Pandit
  2023-11-17  9:52                                                         ` Zhu, Lingshan
  2 siblings, 2 replies; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17  9:51 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Fri, Nov 17, 2023 at 09:41:40AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 3:08 PM
> > 
> > On Fri, Nov 17, 2023 at 09:14:21AM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 2:16 PM In any case you can safely
> > > > assume that many users will have migration that takes seconds and
> > > > minutes.
> > >
> > > Strange, but ok. I don't see any problem with current method.
> > > 8MB is used for very large VM of 1TB takes minutes. Should be fine.
> > 
> > The problem is simple: vendors selling devices have no idea how large the VM
> > will be. So you have to over-provision for the max VM size.
> > If there was a way to instead allocate that in host memory, that would improve
> > on this.
> 
> Not sure what to over provision for max VM size.
> Vendor does not know how many vcpus will be needed. It is no different problem.
> 
> When the VM migration is started, the individual tracking range is supplied by the hypervisor to device.
> Device allocates necessary memory on this instruction.
> 
> When the VM with certain size is provisioned, the member device can be provisioned for the VM size.
> And if it cannot be provisioned, possibly this may not the right member device to use at that point in time.

For someone who keeps arguing against adding single bit registers
"because it does not scale" you seem very nonchalant about adding
8Mbytes.

I thought we have a nicely contained and orthogonal feature, so if it's
optional it's not a problem.

But with such costs and corner cases what exactly is the motivation for
the feature here?  Do you have a PoC showing how this works better than
e.g. shadow VQ?

Maybe IOMMU based and shadow VQ based tracking are the way to go
initially, and if there's a problem then we should add this later, on
top.

I really want us to finally make progress merging features and anything
that reduces scope initially is good for that.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  9:41                                                       ` [virtio-comment] " Parav Pandit
  2023-11-17  9:44                                                         ` Parav Pandit
  2023-11-17  9:51                                                         ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-17  9:52                                                         ` Zhu, Lingshan
  2023-11-17  9:59                                                           ` [virtio-comment] " Parav Pandit
  2 siblings, 1 reply; 157+ messages in thread
From: Zhu, Lingshan @ 2023-11-17  9:52 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 11/17/2023 5:41 PM, Parav Pandit wrote:
>
>> From: Michael S. Tsirkin <mst@redhat.com>
>> Sent: Friday, November 17, 2023 3:08 PM
>>
>> On Fri, Nov 17, 2023 at 09:14:21AM +0000, Parav Pandit wrote:
>>>
>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>> Sent: Friday, November 17, 2023 2:16 PM In any case you can safely
>>>> assume that many users will have migration that takes seconds and
>>>> minutes.
>>> Strange, but ok. I don't see any problem with current method.
>>> 8MB is used for very large VM of 1TB takes minutes. Should be fine.
>> The problem is simple: vendors selling devices have no idea how large the VM
>> will be. So you have to over-provision for the max VM size.
>> If there was a way to instead allocate that in host memory, that would improve
>> on this.
> Not sure what to over provision for max VM size.
> Vendor does not know how many vcpus will be needed. It is no different problem.
>
> When the VM migration is started, the individual tracking range is supplied by the hypervisor to device.
> Device allocates necessary memory on this instruction.
>
> When the VM with certain size is provisioned, the member device can be provisioned for the VM size.
> And if it cannot be provisioned, possibly this may not the right member device to use at that point in time.
I think Michael means the guest memory can be large, and the device may 
DMA anywhere, so the device should
prepare for the worst case, that could be U64 size which can be 
over-provision.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  9:51                                                         ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-17  9:54                                                           ` Zhu, Lingshan
  2023-11-17 10:02                                                             ` Michael S. Tsirkin
  2023-11-17  9:57                                                           ` Parav Pandit
  1 sibling, 1 reply; 157+ messages in thread
From: Zhu, Lingshan @ 2023-11-17  9:54 UTC (permalink / raw)
  To: Michael S. Tsirkin, Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 11/17/2023 5:51 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 17, 2023 at 09:41:40AM +0000, Parav Pandit wrote:
>>
>>> From: Michael S. Tsirkin <mst@redhat.com>
>>> Sent: Friday, November 17, 2023 3:08 PM
>>>
>>> On Fri, Nov 17, 2023 at 09:14:21AM +0000, Parav Pandit wrote:
>>>>
>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>> Sent: Friday, November 17, 2023 2:16 PM In any case you can safely
>>>>> assume that many users will have migration that takes seconds and
>>>>> minutes.
>>>> Strange, but ok. I don't see any problem with current method.
>>>> 8MB is used for very large VM of 1TB takes minutes. Should be fine.
>>> The problem is simple: vendors selling devices have no idea how large the VM
>>> will be. So you have to over-provision for the max VM size.
>>> If there was a way to instead allocate that in host memory, that would improve
>>> on this.
>> Not sure what to over provision for max VM size.
>> Vendor does not know how many vcpus will be needed. It is no different problem.
>>
>> When the VM migration is started, the individual tracking range is supplied by the hypervisor to device.
>> Device allocates necessary memory on this instruction.
>>
>> When the VM with certain size is provisioned, the member device can be provisioned for the VM size.
>> And if it cannot be provisioned, possibly this may not the right member device to use at that point in time.
> For someone who keeps arguing against adding single bit registers
> "because it does not scale" you seem very nonchalant about adding
> 8Mbytes.
>
> I thought we have a nicely contained and orthogonal feature, so if it's
> optional it's not a problem.
>
> But with such costs and corner cases what exactly is the motivation for
> the feature here?  Do you have a PoC showing how this works better than
> e.g. shadow VQ?
>
> Maybe IOMMU based and shadow VQ based tracking are the way to go
> initially, and if there's a problem then we should add this later, on
> top.
I agree.
>
> I really want us to finally make progress merging features and anything
> that reduces scope initially is good for that.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  9:50                                                       ` Zhu, Lingshan
@ 2023-11-17  9:55                                                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17  9:55 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Jason Wang, Parav Pandit, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 05:50:24PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/16/2023 8:18 PM, Michael S. Tsirkin wrote:
> > On Thu, Nov 16, 2023 at 05:38:35PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 11/15/2023 7:52 PM, Michael S. Tsirkin wrote:
> > > > On Wed, Nov 15, 2023 at 04:42:56PM +0800, Zhu, Lingshan wrote:
> > > > > On 11/15/2023 4:05 PM, Michael S. Tsirkin wrote:
> > > > > > On Wed, Nov 15, 2023 at 03:59:16PM +0800, Zhu, Lingshan wrote:
> > > > > > > On 11/15/2023 3:51 PM, Michael S. Tsirkin wrote:
> > > > > > > > On Wed, Nov 15, 2023 at 12:05:59PM +0800, Zhu, Lingshan wrote:
> > > > > > > > > On 11/14/2023 4:27 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > On Tue, Nov 14, 2023 at 03:34:32PM +0800, Zhu, Lingshan wrote:
> > > > > > > > > > > > > So I can't
> > > > > > > > > > > > > believe it has good performance overall. Logging via IOMMU or using
> > > > > > > > > > > > > shadow virtqueue doesn't need any extra PCI transactions at least.
> > > > > > > > > > > > On the other hand they have an extra CPU cost.  Personally if this is
> > > > > > > > > > > > coming from a hardware vendor, I am inclined to trust them wrt PCI
> > > > > > > > > > > > transactions.  But anyway, discussing this at a high level theoretically
> > > > > > > > > > > > is pointless - whoever bothers with actual prototyping for performance
> > > > > > > > > > > > testing wins, if no one does I'd expect a back of a napkin estimate
> > > > > > > > > > > > to be included.
> > > > > > > > > > > if so, Intel has released productions implementing these interfaces years
> > > > > > > > > > > ago,
> > > > > > > > > > > see live migration in 4.1. IFCVF vDPA Implementation,
> > > > > > > > > > > https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html
> > > > > > > > > > > and
> > > > > > > > > > That one is based on shadow queue, right? Which I think this shows
> > > > > > > > > > is worth supporting.
> > > > > > > > > Yes, it is shadow virtqueue, I assume this is already mostly done,
> > > > > > > > > do you see any gaps we need to address in our series that we should work on?
> > > > > > > > > 
> > > > > > > > > Thanks
> > > > > > > > There were a ton of comments posted on your series.
> > > > > > > Hope I didn't miss anything. I see your latest comments are about vq states,
> > > > > > > as replied before, I think we can record the states by two le16 and the
> > > > > > > in-flight
> > > > > > > descriptor tracking facility.
> > > > > > I don't know why you need the le16. in-flight tracking should be enough.
> > > > > > And given it needs DMA I would try really hard to actually use
> > > > > > admin commands for this.
> > > > > we need to record the on-device avail_idx and used_idx, or
> > > > > how can the destination side know the device internal values.
> > > > Again you never documented what state the device is in so I can't really
> > > > say for sure.  But generally whenever a buffer is used the internal
> > > > values are written out to memory.
> > > This is the state of a virtqueue, in my series I have defined what is
> > > vq state in [PATCH V2 1/6] virtio: introduce virtqueue state,
> > > and give an example of PCI implementation.
> > > 
> > > In short, for split vq it is last_avail_idx and in-flight descriptors.
> > > 
> > > I humbly request an explicit list of missing gaps, so that I can improve my
> > > V3
> > > 
> > > Thanks
> > I don't know how to help you without resorting to writing it instead of
> > you, I sent 3 messages in response to that one patch alone. Your patch
> > just adds some bits here and there without much in the way of
> > documentation. Patch needs to explain what these things are and how do they
> > interact with VQ state in memory.
> Please allow me quote the series:
> +The available state field is two bytes of virtqueue state that is used by
> +the device to read the next available buffer. It is presented in the
> following format:
> +
> +\begin{lstlisting}
> +le16 last_avail_idx;
> +\end{lstlisting}
> +
> +The \field{last_avail_idx} field is the free-running available ring
> +index where the device will read the next available head of a
> +descriptor chain.

Next *after what*? The last used buffer? This is exactly the used index.

> 
> +When SUSPEND is set, the device MUST record the Available State of every
> enabled splited virtqueue
> +in \field{Available State} field,
> +and correspondingly restore the Available State of every enabled splited
> virtqueue
> +from \field{Available State} field when DRIVER_OK is set.
> +
> +The device SHOULD reset \field{Available State} field upon a device reset
> 
> I will add these contents in next series:
> 1) vq states refer to the device internal states, so have to record and
> restore them
> 2) in-flight descriprots tracking.
> 
> Not sure whether I should describe how in-guest-memory vq config migrated,
> because they migrated with guest memory.
> > 
> > 
> > But besides, Parav needs to do exactly the same too. So why don't you
> > let Parav do the work on this and then later just add a small interface
> > to send admin commands through VF itself? Looks like this will be good
> I think this is the topic: shall we process admin cmds in config space.....
> > enough for VDPA. Meanwhile I feel your energy would be better spend
> > working on transport vq which no one else is working on.
> I can start rework on transport vq next week.
> 
> Thanks
> > 
> > 
> > 
> > 
> > > > > > > For this shadow virtqueue, do you think I should address this in my V4?
> > > > > > > Like saying: acknowledged control commands through the control virtqueue
> > > > > > > should be recorded, and we want to use shadow virtqueue to track dirty
> > > > > > > pages.
> > > > > > What you need to do is actually describe what do you expect the device
> > > > > > to do when it enters this suspend state. since you mention control
> > > > > > virtqueue then it seems that there needs to be a device type
> > > > > > specific text explaining the behaviour. If so this implies there
> > > > > > needs to be a list of device types that support suspend
> > > > > > until someone looks at each type and documents what it does.
> > > > > On a second thought, shadow vqs are hypervisor behaviors, maybe should not
> > > > > be
> > > > > described in this device spec.
> > > > > 
> > > > > Since SUSPEND is in device status, so for now I see every type of device
> > > > > implements
> > > > > device_status should support SUSPEND. This should be a general facility.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  9:51                                                         ` [virtio-comment] " Michael S. Tsirkin
  2023-11-17  9:54                                                           ` Zhu, Lingshan
@ 2023-11-17  9:57                                                           ` Parav Pandit
  2023-11-17 10:37                                                             ` Michael S. Tsirkin
  1 sibling, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17  9:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Michael S. Tsirkin
> Sent: Friday, November 17, 2023 3:21 PM
> 
> On Fri, Nov 17, 2023 at 09:41:40AM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 3:08 PM
> > >
> > > On Fri, Nov 17, 2023 at 09:14:21AM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 2:16 PM In any case you can
> > > > > safely assume that many users will have migration that takes
> > > > > seconds and minutes.
> > > >
> > > > Strange, but ok. I don't see any problem with current method.
> > > > 8MB is used for very large VM of 1TB takes minutes. Should be fine.
> > >
> > > The problem is simple: vendors selling devices have no idea how
> > > large the VM will be. So you have to over-provision for the max VM size.
> > > If there was a way to instead allocate that in host memory, that
> > > would improve on this.
> >
> > Not sure what to over provision for max VM size.
> > Vendor does not know how many vcpus will be needed. It is no different
> problem.
> >
> > When the VM migration is started, the individual tracking range is supplied by
> the hypervisor to device.
> > Device allocates necessary memory on this instruction.
> >
> > When the VM with certain size is provisioned, the member device can be
> provisioned for the VM size.
> > And if it cannot be provisioned, possibly this may not the right member device
> to use at that point in time.
> 
> For someone who keeps arguing against adding single bit registers "because it
> does not scale" you seem very nonchalant about adding 8Mbytes.
> 
There is fundamental difference on how/when a bit is used.
One wants to use a bit for non-performance part and keep it always available vs data path.
Not same comparison.

> I thought we have a nicely contained and orthogonal feature, so if it's optional
> it's not a problem.
It is optional as always.

> 
> But with such costs and corner cases what exactly is the motivation for the
> feature here?  
New generations DPUs have memory for device data path workloads but not for bits.

> Do you have a PoC showing how this works better than e.g.
> shadow VQ?
> 
Not yet.
But I don't think this can be even a criteria to consider as dependency on PASID is nonstarter with other limitations.

> Maybe IOMMU based and shadow VQ based tracking are the way to go initially,
> and if there's a problem then we should add this later, on top.
>
For the cpus that does not support IOMMU cannot shift to shadow VQ either.
 
> I really want us to finally make progress merging features and anything that
> reduces scope initially is good for that.
>
Yes, if you prefer to split the last three patches, I am fine.
Please let me know.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  9:52                                                         ` Zhu, Lingshan
@ 2023-11-17  9:59                                                           ` Parav Pandit
  2023-11-17 10:00                                                             ` [virtio-comment] " Zhu, Lingshan
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17  9:59 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, November 17, 2023 3:23 PM
> 
> 
> On 11/17/2023 5:41 PM, Parav Pandit wrote:
> >
> >> From: Michael S. Tsirkin <mst@redhat.com>
> >> Sent: Friday, November 17, 2023 3:08 PM
> >>
> >> On Fri, Nov 17, 2023 at 09:14:21AM +0000, Parav Pandit wrote:
> >>>
> >>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>> Sent: Friday, November 17, 2023 2:16 PM In any case you can safely
> >>>> assume that many users will have migration that takes seconds and
> >>>> minutes.
> >>> Strange, but ok. I don't see any problem with current method.
> >>> 8MB is used for very large VM of 1TB takes minutes. Should be fine.
> >> The problem is simple: vendors selling devices have no idea how large
> >> the VM will be. So you have to over-provision for the max VM size.
> >> If there was a way to instead allocate that in host memory, that
> >> would improve on this.
> > Not sure what to over provision for max VM size.
> > Vendor does not know how many vcpus will be needed. It is no different
> problem.
> >
> > When the VM migration is started, the individual tracking range is supplied by
> the hypervisor to device.
> > Device allocates necessary memory on this instruction.
> >
> > When the VM with certain size is provisioned, the member device can be
> provisioned for the VM size.
> > And if it cannot be provisioned, possibly this may not the right member device
> to use at that point in time.
> I think Michael means the guest memory can be large, and the device may DMA
> anywhere, so the device should prepare for the worst case, that could be U64
> size which can be over-provision.

No. that is not true. 
The hypervisor supplies the range of addresses on which to track the dirty pages.
So for sure it is not u64.



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16 11:59                                   ` Michael S. Tsirkin
@ 2023-11-17  9:59                                     ` Zhu, Lingshan
  2023-11-17 10:03                                       ` Parav Pandit
  2023-11-17 10:40                                       ` Michael S. Tsirkin
  0 siblings, 2 replies; 157+ messages in thread
From: Zhu, Lingshan @ 2023-11-17  9:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
>>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
>>>> We should expose a limit of the device in the proposed WRITE_RECORD_CAP_QUERY command, that how much range it can track.
>>>> So that future provisioning framework can use it.
>>>>
>>>> I will cover this in v5 early next week.
>>> I do worry about how this can even work though. If you want a generic
>>> device you do not get to dictate how much memory VM has.
>>>
>>> Aren't we talking bit per page? With 1TByte of memory to track ->
>>> 256Gbit -> 32Gbit -> 8Gbyte per VF?
>>>
>>> And you happily say "we'll address this in the future" while at the same
>>> time fighting tooth and nail against adding single bit status registers
>>> because scalability?
>>>
>>>
>>> I have a feeling doing this completely theoretical like this is problematic.
>>> Maybe you have it all laid out neatly in your head but I suspect
>>> not all of TC can picture it clearly enough based just on spec text.
>>>
>>> We do sometimes ask for POC implementation in linux / qemu to
>>> demonstrate how things work before merging code. We skipped this
>>> for admin things so far but I think it's a good idea to start doing
>>> it here.
>>>
>>> What makes me pause a bit before saying please do a PoC is
>>> all the opposition that seems to exist to even using admin
>>> commands in the 1st place. I think once we finally stop
>>> arguing about whether to use admin commands at all then
>>> a PoC will be needed before merging.
>> We have POR productions that implemented the approach in my series. They are
>> multiple generations
>> of productions in market and running in customers data centers for years.
>>
>> Back to 2019 when we start working on vDPA, we have sent some samples of
>> production(e.g., Cascade Glacier)
>> and the datasheet, you can find live migration facilities there, includes
>> suspend, vq state and other
>> features.
>>
>> And there is an reference in DPDK live migration, I have provided this page
>> before:
>> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html, it has been working for
>> long long time.
>>
>> So if we let the facts speak, if we want to see if the proposal is proven to
>> work, I would
>> say: They are POR for years, customers already deployed them for years.
> And I guess what you are trying to say is that this patchset
> we are reviewing here should be help to the same standard and
> there should be a PoC? Sounds reasonable.
Yes and the in-marketing productions are POR, the series just improves 
the design,
for example, our series also use registers to track vq state, but 
improvements
than CG or BSC. So I think they are proven to work.
>
>> For dirty page tracking, I see you want both platform IOMMU tracking and
>> shadow vqs, I am
>> totally fine with this idea. And I think maybe we should merge the basic
>> features first, and
>> dirty page tracking should be the second step.
>>
>> Thanks
> Parav wants to add an option of on-device tracking. Which also seems
> fine. I think it should be optional though because shadow and IOMMU
> options exist.
I agree, the vendor can choose to implement their own facility as a backup.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  9:59                                                           ` [virtio-comment] " Parav Pandit
@ 2023-11-17 10:00                                                             ` Zhu, Lingshan
  0 siblings, 0 replies; 157+ messages in thread
From: Zhu, Lingshan @ 2023-11-17 10:00 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 11/17/2023 5:59 PM, Parav Pandit wrote:
>
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Friday, November 17, 2023 3:23 PM
>>
>>
>> On 11/17/2023 5:41 PM, Parav Pandit wrote:
>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>> Sent: Friday, November 17, 2023 3:08 PM
>>>>
>>>> On Fri, Nov 17, 2023 at 09:14:21AM +0000, Parav Pandit wrote:
>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>> Sent: Friday, November 17, 2023 2:16 PM In any case you can safely
>>>>>> assume that many users will have migration that takes seconds and
>>>>>> minutes.
>>>>> Strange, but ok. I don't see any problem with current method.
>>>>> 8MB is used for very large VM of 1TB takes minutes. Should be fine.
>>>> The problem is simple: vendors selling devices have no idea how large
>>>> the VM will be. So you have to over-provision for the max VM size.
>>>> If there was a way to instead allocate that in host memory, that
>>>> would improve on this.
>>> Not sure what to over provision for max VM size.
>>> Vendor does not know how many vcpus will be needed. It is no different
>> problem.
>>> When the VM migration is started, the individual tracking range is supplied by
>> the hypervisor to device.
>>> Device allocates necessary memory on this instruction.
>>>
>>> When the VM with certain size is provisioned, the member device can be
>> provisioned for the VM size.
>>> And if it cannot be provisioned, possibly this may not the right member device
>> to use at that point in time.
>> I think Michael means the guest memory can be large, and the device may DMA
>> anywhere, so the device should prepare for the worst case, that could be U64
>> size which can be over-provision.
> No. that is not true.
> The hypervisor supplies the range of addresses on which to track the dirty pages.
> So for sure it is not u64.
hypervisor provides GPA to the guest, but the VA can be very high 
address, means can be u64.
>
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  9:54                                                           ` Zhu, Lingshan
@ 2023-11-17 10:02                                                             ` Michael S. Tsirkin
  2023-11-17 10:10                                                               ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 10:02 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 05:54:32PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/17/2023 5:51 PM, Michael S. Tsirkin wrote:
> > On Fri, Nov 17, 2023 at 09:41:40AM +0000, Parav Pandit wrote:
> > > 
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 3:08 PM
> > > > 
> > > > On Fri, Nov 17, 2023 at 09:14:21AM +0000, Parav Pandit wrote:
> > > > > 
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 2:16 PM In any case you can safely
> > > > > > assume that many users will have migration that takes seconds and
> > > > > > minutes.
> > > > > Strange, but ok. I don't see any problem with current method.
> > > > > 8MB is used for very large VM of 1TB takes minutes. Should be fine.
> > > > The problem is simple: vendors selling devices have no idea how large the VM
> > > > will be. So you have to over-provision for the max VM size.
> > > > If there was a way to instead allocate that in host memory, that would improve
> > > > on this.
> > > Not sure what to over provision for max VM size.
> > > Vendor does not know how many vcpus will be needed. It is no different problem.
> > > 
> > > When the VM migration is started, the individual tracking range is supplied by the hypervisor to device.
> > > Device allocates necessary memory on this instruction.
> > > 
> > > When the VM with certain size is provisioned, the member device can be provisioned for the VM size.
> > > And if it cannot be provisioned, possibly this may not the right member device to use at that point in time.
> > For someone who keeps arguing against adding single bit registers
> > "because it does not scale" you seem very nonchalant about adding
> > 8Mbytes.
> > 
> > I thought we have a nicely contained and orthogonal feature, so if it's
> > optional it's not a problem.
> > 
> > But with such costs and corner cases what exactly is the motivation for
> > the feature here?  Do you have a PoC showing how this works better than
> > e.g. shadow VQ?
> > 
> > Maybe IOMMU based and shadow VQ based tracking are the way to go
> > initially, and if there's a problem then we should add this later, on
> > top.
> I agree.

However, the patchset is ordered sensibly, first the device state
recording and then write tracking. So we can merge patches 1-5 and defer
6-8 if we want to.

Parav I suggest maybe split write tracking to a separate patchset just
because it seems so contentious.

I notice there have not been comments on 1-5 yet, I am not sure why
I started with patch 6 - I guess I was curious what it does.
I'll focus review on 1-5 next week.




> > 
> > I really want us to finally make progress merging features and anything
> > that reduces scope initially is good for that.
> > 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  9:59                                     ` Zhu, Lingshan
@ 2023-11-17 10:03                                       ` Parav Pandit
  2023-11-17 11:00                                         ` Michael S. Tsirkin
  2023-11-17 10:40                                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 10:03 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, November 17, 2023 3:30 PM
> 
> On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu, Lingshan wrote:
> >>
> >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> >>>> We should expose a limit of the device in the proposed
> WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> >>>> So that future provisioning framework can use it.
> >>>>
> >>>> I will cover this in v5 early next week.
> >>> I do worry about how this can even work though. If you want a
> >>> generic device you do not get to dictate how much memory VM has.
> >>>
> >>> Aren't we talking bit per page? With 1TByte of memory to track ->
> >>> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> >>>
> >>> And you happily say "we'll address this in the future" while at the
> >>> same time fighting tooth and nail against adding single bit status
> >>> registers because scalability?
> >>>
> >>>
> >>> I have a feeling doing this completely theoretical like this is problematic.
> >>> Maybe you have it all laid out neatly in your head but I suspect not
> >>> all of TC can picture it clearly enough based just on spec text.
> >>>
> >>> We do sometimes ask for POC implementation in linux / qemu to
> >>> demonstrate how things work before merging code. We skipped this for
> >>> admin things so far but I think it's a good idea to start doing it
> >>> here.
> >>>
> >>> What makes me pause a bit before saying please do a PoC is all the
> >>> opposition that seems to exist to even using admin commands in the
> >>> 1st place. I think once we finally stop arguing about whether to use
> >>> admin commands at all then a PoC will be needed before merging.
> >> We have POR productions that implemented the approach in my series.
> >> They are multiple generations of productions in market and running in
> >> customers data centers for years.
> >>
> >> Back to 2019 when we start working on vDPA, we have sent some samples
> >> of production(e.g., Cascade Glacier) and the datasheet, you can find
> >> live migration facilities there, includes suspend, vq state and other
> >> features.
> >>
> >> And there is an reference in DPDK live migration, I have provided
> >> this page
> >> before:
> >> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html, it has been
> >> working for long long time.
> >>
> >> So if we let the facts speak, if we want to see if the proposal is
> >> proven to work, I would
> >> say: They are POR for years, customers already deployed them for years.
> > And I guess what you are trying to say is that this patchset we are
> > reviewing here should be help to the same standard and there should be
> > a PoC? Sounds reasonable.
> Yes and the in-marketing productions are POR, the series just improves the
> design, for example, our series also use registers to track vq state, but
> improvements than CG or BSC. So I think they are proven to work.

If you prefer to go the route of POR and production and proven documents etc, there is ton of it of multiple types of products I can dump here with open-source code and documentation and more.
Let me know what you would like to see.

Michael has requested some performance comparisons, not all are ready to share yet.
Some are present that I will share in coming weeks.

And all the vdpa dpdk you published does not have basic CVQ support when I last looked at it.
Do you know when was it added?

> >
> >> For dirty page tracking, I see you want both platform IOMMU tracking
> >> and shadow vqs, I am totally fine with this idea. And I think maybe
> >> we should merge the basic features first, and dirty page tracking
> >> should be the second step.
> >>
> >> Thanks
> > Parav wants to add an option of on-device tracking. Which also seems
> > fine. I think it should be optional though because shadow and IOMMU
> > options exist.
> I agree, the vendor can choose to implement their own facility as a backup.
> >


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 10:02                                                             ` Michael S. Tsirkin
@ 2023-11-17 10:10                                                               ` Parav Pandit
  0 siblings, 0 replies; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 10:10 UTC (permalink / raw)
  To: Michael S. Tsirkin, Zhu, Lingshan
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Michael S. Tsirkin
> Sent: Friday, November 17, 2023 3:32 PM
> 
> On Fri, Nov 17, 2023 at 05:54:32PM +0800, Zhu, Lingshan wrote:
> >
> >
> > On 11/17/2023 5:51 PM, Michael S. Tsirkin wrote:
> > > On Fri, Nov 17, 2023 at 09:41:40AM +0000, Parav Pandit wrote:
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 3:08 PM
> > > > >
> > > > > On Fri, Nov 17, 2023 at 09:14:21AM +0000, Parav Pandit wrote:
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 2:16 PM In any case you can
> > > > > > > safely assume that many users will have migration that takes
> > > > > > > seconds and minutes.
> > > > > > Strange, but ok. I don't see any problem with current method.
> > > > > > 8MB is used for very large VM of 1TB takes minutes. Should be fine.
> > > > > The problem is simple: vendors selling devices have no idea how
> > > > > large the VM will be. So you have to over-provision for the max VM size.
> > > > > If there was a way to instead allocate that in host memory, that
> > > > > would improve on this.
> > > > Not sure what to over provision for max VM size.
> > > > Vendor does not know how many vcpus will be needed. It is no different
> problem.
> > > >
> > > > When the VM migration is started, the individual tracking range is supplied
> by the hypervisor to device.
> > > > Device allocates necessary memory on this instruction.
> > > >
> > > > When the VM with certain size is provisioned, the member device can be
> provisioned for the VM size.
> > > > And if it cannot be provisioned, possibly this may not the right member
> device to use at that point in time.
> > > For someone who keeps arguing against adding single bit registers
> > > "because it does not scale" you seem very nonchalant about adding
> > > 8Mbytes.
> > >
> > > I thought we have a nicely contained and orthogonal feature, so if
> > > it's optional it's not a problem.
> > >
> > > But with such costs and corner cases what exactly is the motivation
> > > for the feature here?  Do you have a PoC showing how this works
> > > better than e.g. shadow VQ?
> > >
> > > Maybe IOMMU based and shadow VQ based tracking are the way to go
> > > initially, and if there's a problem then we should add this later,
> > > on top.
> > I agree.
> 
> However, the patchset is ordered sensibly, first the device state recording and
> then write tracking. So we can merge patches 1-5 and defer
> 6-8 if we want to.
> 
> Parav I suggest maybe split write tracking to a separate patchset just because it
> seems so contentious.
> 
> I notice there have not been comments on 1-5 yet, I am not sure why I started
> with patch 6 - I guess I was curious what it does.
> I'll focus review on 1-5 next week.

Ok. sounds good. 

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-15 17:42                                 ` [virtio-comment] " Parav Pandit
  2023-11-16  4:18                                   ` [virtio-comment] " Jason Wang
@ 2023-11-17 10:15                                   ` Michael S. Tsirkin
  2023-11-17 10:48                                     ` Parav Pandit
  1 sibling, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 10:15 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Wed, Nov 15, 2023 at 05:42:04PM +0000, Parav Pandit wrote:
> 
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, November 13, 2023 9:02 AM
> > 
> > On Thu, Nov 9, 2023 at 3:59 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote:
> > > > On Wed, Nov 8, 2023 at 4:17 PM Michael S. Tsirkin <mst@redhat.com>
> > wrote:
> > > > >
> > > > > On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> > > > > > On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin <mst@redhat.com>
> > wrote:
> > > > > > >
> > > > > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > > > > > Each virtio and non virtio devices who wants to report
> > > > > > > > > > > their dirty page report,
> > > > > > > > > > will do their way.
> > > > > > > > > > >
> > > > > > > > > > > > 3) inventing it in the virtio layer will be
> > > > > > > > > > > > deprecated in the future for sure, as platform will
> > > > > > > > > > > > provide much rich features for logging e.g it can do
> > > > > > > > > > > > it per PASID etc, I don't see any reason virtio need
> > > > > > > > > > > > to compete with the features that will be provided
> > > > > > > > > > > > by the platform
> > > > > > > > > > > Can you bring the cpu vendors and committement to
> > > > > > > > > > > virtio tc with timelines
> > > > > > > > > > so that virtio TC can omit?
> > > > > > > > > >
> > > > > > > > > > Why do we need to bring CPU vendors in the virtio TC?
> > > > > > > > > > Virtio needs to be built on top of transport or platform. There's
> > no need to duplicate their job.
> > > > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > > > >
> > > > > > > > > I wanted to see a strong commitment for the cpu vendors to
> > support dirty page tracking.
> > > > > > > >
> > > > > > > > The RFC of IOMMUFD support can go back to early 2022. Intel,
> > > > > > > > AMD and ARM are all supporting that now.
> > > > > > > >
> > > > > > > > > And the work seems to have started for some platforms.
> > > > > > > >
> > > > > > > > Let me quote from the above link:
> > > > > > > >
> > > > > > > > """
> > > > > > > > Today, AMD Milan (or more recent) supports it while ARM
> > > > > > > > SMMUv3.2 alongside VT-D rev3.x also do support.
> > > > > > > > """
> > > > > > > >
> > > > > > > > > Without such platform commitment, virtio also skipping it would
> > not work.
> > > > > > > >
> > > > > > > > Is the above sufficient? I'm a little bit more familiar with
> > > > > > > > vtd, the hw feature has been there for years.
> > > > > > >
> > > > > > >
> > > > > > > Repeating myself - I'm not sure that will work well for all workloads.
> > > > > >
> > > > > > I think this comment applies to this proposal as well.
> > > > >
> > > > > Yes - some systems might be better off with platform tracking.
> > > > > And I think supporting shadow vq better would be nice too.
> > > >
> > > > For shadow vq, did you mean the work that is done by Eugenio?
> > >
> > > Yes.
> > 
> > That's exactly why vDPA starts with shadow virtqueue. We've evaluated various
> > possible approaches, each of them have their shortcomings and shadow
> > virtqueue is the only one that doesn't require any additional hardware features
> > to work in every platform.
> > 
> > >
> > > > >
> > > > > > > Definitely KVM did
> > > > > > > not scan PTEs. It used pagefaults with bit per page and later
> > > > > > > as VM size grew switched to PLM.  This interface is analogous
> > > > > > > to PLM,
> > > > > >
> > > > > > I think you meant PML actually. And it doesn't work like PML. To
> > > > > > behave like PML it needs to
> > > > > >
> > > > > > 1) log buffers were organized as a queue with indices
> > > > > > 2) device needs to suspend (as a #vmexit in PML) if it runs out
> > > > > > of the buffers
> > > > > > 3) device need to send a notification to the driver if it runs
> > > > > > out of the buffer
> > > > > >
> > > > > > I don't see any of the above in this proposal. If we do that it
> > > > > > would be less problematic than what is being proposed here.
> > > > >
> > > > > What is common between this and PML is that you get the addresses
> > > > > directly without scanning megabytes of bitmaps or worse - hundreds
> > > > > of megabytes of page tables.
> > > >
> > > > Yes, it has overhead but this is the method we use for vhost and KVM
> > (earlier).
> > > >
> > > > To me the  important advantage of PML is that it uses limited
> > > > resources on the host which
> > > >
> > > > 1) doesn't require resources in the device
> > > > 2) doesn't scale as the guest memory increases. (but this advantage
> > > > doesn't exist in neither this nor bitmap)
> > >
> > > it seems 2 exactly exists here.
> > 
> > Actually not, Parav said the device needs to reserve sufficient resources in
> > another thread.
> The device resource reservation starts only when the device migration starts.
> i.e. with WRITE_RECORDS_START command of patch 7 in the series.

And now your precious VM can't migrate at all because -ENOSPC.



> > 
> > >
> > >
> > > > >
> > > > > The data structure is different but I don't see why it is critical.
> > > > >
> > > > > I agree that I don't see out of buffers notifications too which
> > > > > implies device has to maintain something like a bitmap internally.
> > > > > Which I guess could be fine but it is not clear to me how large
> > > > > that bitmap has to be. How does the device know? Needs to be addressed.
> > > >
> > > > This is the question I asked Parav in another thread. Using host
> > > > memory as a queue with notification (like PML) might be much better.
> > >
> > > Well if queue is what you want to do you can just do it internally.
> > 
> > Then it's not the proposal here, Parav has explained it in another reply, and as
> > explained it lacks a lot of other facilities.
> > 
> PML is yet another option that requires small pci writes.
> In the current proposal, there are no small PCI writes.
> It is a query interface from the device.
> 
> > > Problem of course is that it might overflow and cause things like
> > > packet drops.
> > 
> > Exactly like PML. So sticking to wire speed should not be a general goal in the
> > context of migration. It can be done if the speed of the migration interface is
> > faster than the virtio device that needs to be migrated.
> May not have to be.
> Speed of page recording should be fast enough.
> It usually improves with subsequent generation.
> > 
> > >
> > >
> > > > >
> > > > >
> > > > > > Even if we manage to do that, it doesn't mean we won't have issues.
> > > > > >
> > > > > > 1) For many reasons it can neither see nor log via GPA, so this
> > > > > > requires a traversal of the vIOMMU mapping tables by the
> > > > > > hypervisor afterwards, it would be expensive and need
> > > > > > synchronization with the guest modification of the IO page table which
> > looks very hard.
> > > > >
> > > > > vIOMMU is fast enough to be used on data path but not fast enough
> > > > > for dirty tracking?
> > > >
> > > > We set up SPTEs or using nesting offloading where the PTEs could be
> > > > iterated by hardware directly which is fast.
> > >
> > > There's a way to have hardware find dirty PTEs for you quickly?
> > 
> > Scanning PTEs on the host is faster and more secure than scanning guests, that's
> > what I want to say:
> > 
> > 1) the guest page could be swapped out but not the host one.
> > 2) no guest triggerable behavior
> > 
> 
> Device page tracking table to be consulted to flush on mapping change.
> 
> > > I don't know how it's done. Do tell.
> > >
> > >
> > > > This is not the case here where software needs to iterate the IO
> > > > page tables in the guest which could be slow.
> > > >
> > > > > Hard to believe.  If true and you want to speed up vIOMMU then you
> > > > > implement an efficient datastructure for that.
> > > >
> > > > Besides the issue of performance, it's also racy, assuming we are logging
> > IOVA.
> > > >
> > > > 0) device log IOVA
> > > > 1) hypervisor fetches IOVA from log buffer
> > > > 2) guest map IOVA to a new GPA
> > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > >
> > > > Then we lost the old GPA.
> > >
> > > Interesting and a good point.
> > 
> > Note that PML logs at GPA as it works at L1 of EPT.
> > 
> > > And by the way e.g. vhost has the same issue.  You need to flush dirty
> > > tracking info when changing the mappings somehow.
> > 
> > It's not,
> > 
> > 1) memory translation is done by vhost
> > 2) vhost knows GPA and it doesn't log via IOVA.
> > 
> > See this for example, and DPDK has similar fixes.
> > 
> > commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
> > Author: Jason Wang <jasowang@redhat.com>
> > Date:   Wed Jan 16 16:54:42 2019 +0800
> > 
> >     vhost: log dirty page correctly
> > 
> >     Vhost dirty page logging API is designed to sync through GPA. But we
> >     try to log GIOVA when device IOTLB is enabled. This is wrong and may
> >     lead to missing data after migration.
> > 
> >     To solve this issue, when logging with device IOTLB enabled, we will:
> > 
> >     1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
> >        get HVA, for writable descriptor, get HVA through iovec. For used
> >        ring update, translate its GIOVA to HVA
> >     2) traverse the GPA->HVA mapping to get the possible GPA and log
> >        through GPA. Pay attention this reverse mapping is not guaranteed
> >        to be unique, so we should log each possible GPA in this case.
> > 
> >     This fix the failure of scp to guest during migration. In -next, we
> >     will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> > 
> >     Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> >     Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> >     Cc: Jintack Lim <jintack@cs.columbia.edu>
> >     Signed-off-by: Jason Wang <jasowang@redhat.com>
> >     Acked-by: Michael S. Tsirkin <mst@redhat.com>
> >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > 
> > All of the above is not what virtio did right now.
> > 
> > > Parav what's the plan for this? Should be addressed in the spec too.
> > >
> > 
> > AFAIK, there's no easy/efficient way to do that. I hope I was wrong.
> > 
> 
> The query interface in this proposal works on the granular boundary to read and clear.
> This will ensure that mapping is consistent.

By itself it does not, you have to actually keep querying until you
flush all dirty info and do it each time there's an invalidation in the
IOMMU.


> > >
> > >
> > > > >
> > > > > > 2) There are a lot of special or reserved IOVA ranges (for
> > > > > > example the interrupt areas in x86) that need special care which
> > > > > > is architectural and where it is beyond the scope or knowledge
> > > > > > of the virtio device but the platform IOMMU. Things would be
> > > > > > more complicated when SVA is enabled.
> > > > >
> > > > > SVA being what here?
> > > >
> > > > For example, IOMMU may treat interrupt ranges differently depending
> > > > on whether SVA is enabled or not. It's very hard and unnecessary to
> > > > teach devices about this.
> > >
> > > Oh, shared virtual memory. So what you are saying here? virtio does
> > > not care, it just uses some addresses and if you want it to it can
> > > record writes somewhere.
> > 
> > One example, PCI allows devices to send translated requests, how can a
> > hypervisor know it's a PA or IOVA in this case? We probably need a new bit. But
> > it's not the only thing we need to deal with.
> > 
> > By definition, interrupt ranges and other reserved ranges should not belong to
> > dirty pages. And the logging should be done before the DMA where there's no
> > way for the device to know whether or not an IOVA is valid or not. It would be
> > more safe to just not report them from the source instead of leaving it to the
> > hypervisor to deal with but this seems impossible at the device level. Otherwise
> > the hypervisor driver needs to communicate with the (v)IOMMU to be reached
> > with the
> > interrupt(MSI) area, RMRR area etc in order to do the correct things or it might
> > have security implications. And those areas don't make sense at L1 when vSVA
> > is enabled. What's more, when vIOMMU could be fully offloaded, there's no
> > easy way to fetch that information.
> > 
> There cannot be logging before the DMA.
> Only requirement is before the mapping changes, the dirty page tracking to be synced.
> 
> In most common cases where the perf is critical, such mapping wont change so often dynamically anyway.
> 
> > Again, it's hard to bypass or even duplicate the functionality of the platform or
> > we need to step into every single detail of a specific transport, architecture or
> > IOMMU to figure out whether or not logging at virtio is correct which is
> > awkward and unrealistic. This proposal suffers from an exact similar issue when
> > inventing things like freeze/stop where I've pointed out other branches of issues
> > as well.
> > 
> It is incorrect attribution that platform is duplicated here.
> It feeds the data to the platform as needed without replicating.
> 
> I do agree that there is overlap of IOMMU tracking the dirty and storing it in the per PTE vs device supplying its dirty track via its own interface.
> Both are consolidated at hypervisor level.
> 
> > >
> > > > >
> > > > > > And there could be other architecte specific knowledge (e.g
> > > > > > PAGE_SIZE) that might be needed. There's no easy way to deal
> > > > > > with those cases.
> > > > >
> > > > > Good point about page size actually - using 4k unconditionally is
> > > > > a waste of resources.
> > > >
> > > > Actually, they are more than just PAGE_SIZE, for example, PASID and others.
> > >
> > > what does pasid have to do with it? anyway, just give driver control
> > > over page size.
> > 
> > For example, two virtqueues have two PASIDs assigned. How can a hypervisor
> > know which specific IOVA belongs to which IOVA? For platform IOMMU, they
> > are handy as it talks to the transport. But I don't think we need to duplicate
> > every transport specific address space feature in core virtio layer:
> > 
> PASID to vq assignment won't be duplicated.
> It is configured fully by the guest without consulting hypervisor at the device level.
> Guest IOMMU would consult hypervisor to setup any PASID mapping as part of any mapping method.
> 
> > 1) translated/untranslated request
> > 2) request w/ and w/o PASID
> > 
> > >
> > > > >
> > > > >
> > > > > > We wouldn't need to care about all of them if it is done at
> > > > > > platform IOMMU level.
> > > > >
> > > > > If someone logs at IOMMU level then nothing needs to be done in
> > > > > the spec at all. This is about capability at the device level.
> > > >
> > > > True, but my question is where or not it can be done at the device level
> > easily.
> > >
> > > there's no "easily" about live migration ever.
> > 
> > I think I've stated sufficient issues to demonstrate how hard virtio wants to do it.
> > And I've given the link that it is possible to do that in IOMMU without those
> > issues. So in this context doing it in virtio is much harder.
> > 
> > > For example on-device iommus are a thing.
> > 
> > I'm not sure that's the way to go considering the platform IOMMU evolves very
> > quickly.
> > 
> > >
> > > > >
> > > > >
> > > > > > > what Lingshan
> > > > > > > proposed is analogous to bit per page - problem unfortunately
> > > > > > > is you can't easily set a bit by DMA.
> > > > > > >
> > > > > >
> > > > > > I'm not saying bit/bytemap is the best, but it has been used by
> > > > > > real hardware. And we have many other options.
> > > > > >
> > > > > > > So I think this dirty tracking is a good option to have.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > > >
> > > > > > > > > > Why does it matter in 2024?
> > > > > > > > > Because users needs to use it now.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > If not, we are better off to offer this, and when/if
> > > > > > > > > > > platform support is, sure,
> > > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > > >
> > > > > > > > > > > > 4) if the platform support is missing, we can use
> > > > > > > > > > > > software or leverage transport for assistance like
> > > > > > > > > > > > PRI
> > > > > > > > > > > All of these are in theory.
> > > > > > > > > > > Our experiment shows PRI performance is 21x slower
> > > > > > > > > > > than page fault rate
> > > > > > > > > > done by the cpu.
> > > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > > >
> > > > > > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > > > > > Do you have perf data for this?
> > > > > > > >
> > > > > > > > No, but it's not hard to imagine the worst case. Wrote a
> > > > > > > > small program that dirty every page by a NIC.
> > > > > > > >
> > > > > > > > > In the internal tests we don’t see this happening.
> > > > > > > >
> > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > > >
> > > > > > > > So if we get very high dirty rates (e.g by a high speed
> > > > > > > > NIC), we can't satisfy the requirement of the downtime. Or
> > > > > > > > if you see the converge, you might get help from the auto
> > > > > > > > converge support by the hypervisors like KVM where it tries
> > > > > > > > to throttle the VCPU then you can't reach the wire speed.
> > > > > > >
> > > > > > > Will only work for some device types.
> > > > > > >
> > > > > >
> > > > > > Yes, that's the point. Parav said he doesn't see the issue, it's
> > > > > > probably because he is testing a virtio-net and so the vCPU is
> > > > > > automatically throttled. It doesn't mean it can work for other
> > > > > > virito devices.
> > > > >
> > > > > Only for TX, and I'm pretty sure they had the foresight to test RX
> > > > > not just TX but let's confirm. Parav did you test both directions?
> > > >
> > > > RX speed somehow depends on the speed of refill, so throttling helps
> > > > more or less.
> > >
> > > It doesn't depend on speed of refill you just underrun and drop
> > > packets. then your nice 10usec latency becomes more like 10sec.
> > 
> > I miss your point here. If the driver can't achieve wire speed without dirty page
> > tracking, it can neither when dirty page tracking is enabled.
> > 
> > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > > So it is unusable.
> > > > > > > > > >
> > > > > > > > > > It's not about mandating, it's about doing things in the
> > > > > > > > > > correct layer. If PRI is slow, PCI can evolve for sure.
> > > > > > > > > You should try.
> > > > > > > >
> > > > > > > > Not my duty, I just want to make sure things are done in the
> > > > > > > > correct layer, and once it needs to be done in the virtio,
> > > > > > > > there's nothing obviously wrong.
> > > > > > >
> > > > > > > Yea but just vague questions don't help to make sure eiter way.
> > > > > >
> > > > > > I don't think it's vague, I have explained, if something in the
> > > > > > virito slows down the PRI, we can try to fix them.
> > > > >
> > > > > I don't believe you are going to make PRI fast. No one managed so far.
> > > >
> > > > So it's the fault of PRI not virito, but it doesn't mean we need to
> > > > do it in virtio.
> > >
> > > I keep saying with this approach we would just say "e1000 emulation is
> > > slow and encumbered this is the fault of e1000" and never get virtio
> > > at all.  Assigning blame only gets you so far.
> > 
> > I think we are discussing different things. My point is virtio needs to leverage
> > the functionality provided by transport or platform (especially considering they
> > evolve faster than virtio). It seems to me it's hard even to duplicate some basic
> > function of platform IOMMU in virtio.
> > 
> Not duplicated. Feeding into the platform.

I mean IOMMU still sets the dirty bit, too. How is that not
a duplication?


> > >
> > > > >
> > > > > > Missing functions in
> > > > > > platform or transport is not a good excuse to try to workaround
> > > > > > it in the virtio. It's a layer violation and we never had any
> > > > > > feature like this in the past.
> > > > >
> > > > > Yes missing functionality in the platform is exactly why virtio
> > > > > was born in the first place.
> > > >
> > > > Well the platform can't do device specific logic. But that's not the
> > > > case of dirty page tracking which is device logic agnostic.
> > >
> > > Not true platforms have things like NICs on board and have for many
> > > years. It's about performance really.
> > 
> > I've stated sufficient issues above. And one more obvious issue for device
> > initiated page logging is that it needs a lot of extra or unnecessary PCI
> > transactions which will throttle the performance of the whole system (and lead
> > to other issues like QOS). So I can't believe it has good performance overall.
> > Logging via IOMMU or using shadow virtqueue doesn't need any extra PCI
> > transactions at least.
> > 
> In the current proposal, it does not required PCI transactions, as there is only a hypervisor-initiated query interface.
> It is a trade off of using svq + pasid vs using something from the device.
> 
> Again, both has different use case and value. One uses cpu and one uses device.
> Depending how much power one wants to spend where..

Also how much effort we want to spend on this virtio specific thing.
The needs to be a *reason* to do things in virtio as opposed to using
platform capabilities, this is exactly the same thing I told Lingshan
wrt using SUSPEND for power management as opposed to using PCI PM -
relying on platform when we can is right there in the mission statement.
For some reason I asssumed you guys have done a PoC and that's the
motivation but if it's a "just in case" feature then I'd suggest
we focus on merging patches 1-5 first.


> > > So I'd like Parav to publish some
> > > experiment results and/or some estimates.
> > >
> > 
> > That's fine, but the above equation (used by Qemu) is sufficient to demonstrate
> > how hard to stick wire speed in the case.
> > 
> > >
> > > > >
> > > > > > >
> > > > > > > > > In the current state, it is mandating.
> > > > > > > > > And if you think PRI is the only way,
> > > > > > > >
> > > > > > > > I don't, it's just an example where virtio can leverage from
> > > > > > > > either transport or platform. Or if it's the fault in virtio
> > > > > > > > that slows down the PRI, then it is something we can do.
> > > > > > > >
> > > > > > > > >  than you should propose that in the dirty page tracking series that
> > you listed above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > > > >
> > > > > > > > No, the point is to not duplicate works especially
> > > > > > > > considering virtio can't do better than platform or transport.
> > > > > > >
> > > > > > > If someone says they tried and platform's migration support
> > > > > > > does not work for them and they want to build a solution in
> > > > > > > virtio then what exactly is the objection?
> > > > > >
> > > > > > The discussion is to make sure whether virtio can do this easily
> > > > > > and correctly, then we can have a conclusion. I've stated some
> > > > > > issues above, and I've asked other questions related to them
> > > > > > which are still not answered.
> > > > > >
> > > > > > I think we had a very hard time in bypassing IOMMU in the past
> > > > > > that we don't want to repeat.
> > > > > >
> > > > > > We've gone through several methods of logging dirty pages in the
> > > > > > past (each with pros/cons), but this proposal never explains why
> > > > > > it chooses one of them but not others. Spec needs to find the
> > > > > > best path instead of just a possible path without any rationale about
> > why.
> > > > >
> > > > > Adding more rationale isn't a bad thing.
> > > > > In particular if platform supplies dirty tracking then how does
> > > > > driver decide which to use platform or device capability?
> > > > > A bit of discussion around this is a good idea.
> > > > >
> > > > >
> > > > > > > virtio is here in the
> > > > > > > first place because emulating devices didn't work well.
> > > > > >
> > > > > > I don't understand here. We have supported emulated devices for years.
> > > > > > I'm pretty sure a lot of issues could be uncovered if this
> > > > > > proposal can be prototyped with an emulated device first.
> > > > > >
> > > > > > Thanks
> > > > >
> > > > > virtio was originally PV as opposed to emulation. That there's now
> > > > > hardware virtio and you call software implementation "an
> > > > > emulation" is very meta.
> > > >
> > > > Yes but I don't see how it relates to dirty page tracking. When we
> > > > find a way it should work for both software and hardware devices.
> > > >
> > > > Thanks
> > >
> > > It has to work well on a variety of existing platforms. If it does
> > > then sure, why would we roll our own.
> > 
> > If virtio can do that in an efficient way without any issues, I agree.
> > But it seems not.
> > 
> > Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  9:57                                                           ` Parav Pandit
@ 2023-11-17 10:37                                                             ` Michael S. Tsirkin
  2023-11-17 10:52                                                               ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 10:37 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Fri, Nov 17, 2023 at 09:57:52AM +0000, Parav Pandit wrote:
> 
> > From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> > open.org> On Behalf Of Michael S. Tsirkin
> > Sent: Friday, November 17, 2023 3:21 PM
> > 
> > On Fri, Nov 17, 2023 at 09:41:40AM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 3:08 PM
> > > >
> > > > On Fri, Nov 17, 2023 at 09:14:21AM +0000, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 2:16 PM In any case you can
> > > > > > safely assume that many users will have migration that takes
> > > > > > seconds and minutes.
> > > > >
> > > > > Strange, but ok. I don't see any problem with current method.
> > > > > 8MB is used for very large VM of 1TB takes minutes. Should be fine.
> > > >
> > > > The problem is simple: vendors selling devices have no idea how
> > > > large the VM will be. So you have to over-provision for the max VM size.
> > > > If there was a way to instead allocate that in host memory, that
> > > > would improve on this.
> > >
> > > Not sure what to over provision for max VM size.
> > > Vendor does not know how many vcpus will be needed. It is no different
> > problem.
> > >
> > > When the VM migration is started, the individual tracking range is supplied by
> > the hypervisor to device.
> > > Device allocates necessary memory on this instruction.
> > >
> > > When the VM with certain size is provisioned, the member device can be
> > provisioned for the VM size.
> > > And if it cannot be provisioned, possibly this may not the right member device
> > to use at that point in time.
> > 
> > For someone who keeps arguing against adding single bit registers "because it
> > does not scale" you seem very nonchalant about adding 8Mbytes.
> > 
> There is fundamental difference on how/when a bit is used.
> One wants to use a bit for non-performance part and keep it always available vs data path.
> Not same comparison.
> 
> > I thought we have a nicely contained and orthogonal feature, so if it's optional
> > it's not a problem.
> It is optional as always.
> 
> > 
> > But with such costs and corner cases what exactly is the motivation for the
> > feature here?  
> New generations DPUs have memory for device data path workloads but not for bits.
> 
> > Do you have a PoC showing how this works better than e.g.
> > shadow VQ?
> > 
> Not yet.
> But I don't think this can be even a criteria to consider as dependency on PASID is nonstarter with other limitations.

You just need dirty bit in PTE, whether that is tied to PASID depends
very much on the platform.  For VTD I think it is.  And if shadow vq
works as a fallback, it just might be reasonable not to do any tracking
in virtio.

> > Maybe IOMMU based and shadow VQ based tracking are the way to go initially,
> > and if there's a problem then we should add this later, on top.
> >
> For the cpus that does not support IOMMU cannot shift to shadow VQ either.

I don't know what this means (no IOMMU at all?) but it looks like shadow
vq and similar approaches are in production with vdpa and have been
demonstrated for a while. All we are doing is supporting them in
virtio proper.

> > I really want us to finally make progress merging features and anything that
> > reduces scope initially is good for that.
> >
> Yes, if you prefer to split the last three patches, I am fine.
> Please let me know.

As here have not been any comments on 1-5 I don't think there's
need to repost this just yet. I'll review 1-5 next week.
I think in the next version it might be wise to split this and post
as two series, yes.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  9:59                                     ` Zhu, Lingshan
  2023-11-17 10:03                                       ` Parav Pandit
@ 2023-11-17 10:40                                       ` Michael S. Tsirkin
  1 sibling, 0 replies; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 10:40 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 05:59:35PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > > > We should expose a limit of the device in the proposed WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > > > > So that future provisioning framework can use it.
> > > > > 
> > > > > I will cover this in v5 early next week.
> > > > I do worry about how this can even work though. If you want a generic
> > > > device you do not get to dictate how much memory VM has.
> > > > 
> > > > Aren't we talking bit per page? With 1TByte of memory to track ->
> > > > 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > 
> > > > And you happily say "we'll address this in the future" while at the same
> > > > time fighting tooth and nail against adding single bit status registers
> > > > because scalability?
> > > > 
> > > > 
> > > > I have a feeling doing this completely theoretical like this is problematic.
> > > > Maybe you have it all laid out neatly in your head but I suspect
> > > > not all of TC can picture it clearly enough based just on spec text.
> > > > 
> > > > We do sometimes ask for POC implementation in linux / qemu to
> > > > demonstrate how things work before merging code. We skipped this
> > > > for admin things so far but I think it's a good idea to start doing
> > > > it here.
> > > > 
> > > > What makes me pause a bit before saying please do a PoC is
> > > > all the opposition that seems to exist to even using admin
> > > > commands in the 1st place. I think once we finally stop
> > > > arguing about whether to use admin commands at all then
> > > > a PoC will be needed before merging.
> > > We have POR productions that implemented the approach in my series. They are
> > > multiple generations
> > > of productions in market and running in customers data centers for years.
> > > 
> > > Back to 2019 when we start working on vDPA, we have sent some samples of
> > > production(e.g., Cascade Glacier)
> > > and the datasheet, you can find live migration facilities there, includes
> > > suspend, vq state and other
> > > features.
> > > 
> > > And there is an reference in DPDK live migration, I have provided this page
> > > before:
> > > https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html, it has been working for
> > > long long time.
> > > 
> > > So if we let the facts speak, if we want to see if the proposal is proven to
> > > work, I would
> > > say: They are POR for years, customers already deployed them for years.
> > And I guess what you are trying to say is that this patchset
> > we are reviewing here should be help to the same standard and
> > there should be a PoC? Sounds reasonable.
> Yes and the in-marketing productions are POR, the series just improves the
> design,
> for example, our series also use registers to track vq state, but
> improvements
> than CG or BSC. So I think they are proven to work.

Well yes and no. It works for vdpa because it's a very specific device
with very specific behaviour. If it needs to work for virtio generally,
then 16 bits of state won't be enough so registers won't work.


> > 
> > > For dirty page tracking, I see you want both platform IOMMU tracking and
> > > shadow vqs, I am
> > > totally fine with this idea. And I think maybe we should merge the basic
> > > features first, and
> > > dirty page tracking should be the second step.
> > > 
> > > Thanks
> > Parav wants to add an option of on-device tracking. Which also seems
> > fine. I think it should be optional though because shadow and IOMMU
> > options exist.
> I agree, the vendor can choose to implement their own facility as a backup.
> > 

No that is a bad idea if vendor is doing full virtio, things need to be in spec.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 10:15                                   ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-17 10:48                                     ` Parav Pandit
  2023-11-17 11:19                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 10:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu



> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Michael S. Tsirkin
> Sent: Friday, November 17, 2023 3:46 PM
> 
> On Wed, Nov 15, 2023 at 05:42:04PM +0000, Parav Pandit wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, November 13, 2023 9:02 AM
> > >
> > > On Thu, Nov 9, 2023 at 3:59 PM Michael S. Tsirkin <mst@redhat.com>
> wrote:
> > > >
> > > > On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote:
> > > > > On Wed, Nov 8, 2023 at 4:17 PM Michael S. Tsirkin
> > > > > <mst@redhat.com>
> > > wrote:
> > > > > >
> > > > > > On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> > > > > > > On Tue, Nov 7, 2023 at 3:05 PM Michael S. Tsirkin
> > > > > > > <mst@redhat.com>
> > > wrote:
> > > > > > > >
> > > > > > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > > > > > > Each virtio and non virtio devices who wants to
> > > > > > > > > > > > report their dirty page report,
> > > > > > > > > > > will do their way.
> > > > > > > > > > > >
> > > > > > > > > > > > > 3) inventing it in the virtio layer will be
> > > > > > > > > > > > > deprecated in the future for sure, as platform
> > > > > > > > > > > > > will provide much rich features for logging e.g
> > > > > > > > > > > > > it can do it per PASID etc, I don't see any
> > > > > > > > > > > > > reason virtio need to compete with the features
> > > > > > > > > > > > > that will be provided by the platform
> > > > > > > > > > > > Can you bring the cpu vendors and committement to
> > > > > > > > > > > > virtio tc with timelines
> > > > > > > > > > > so that virtio TC can omit?
> > > > > > > > > > >
> > > > > > > > > > > Why do we need to bring CPU vendors in the virtio TC?
> > > > > > > > > > > Virtio needs to be built on top of transport or
> > > > > > > > > > > platform. There's
> > > no need to duplicate their job.
> > > > > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > > > > >
> > > > > > > > > > I wanted to see a strong commitment for the cpu
> > > > > > > > > > vendors to
> > > support dirty page tracking.
> > > > > > > > >
> > > > > > > > > The RFC of IOMMUFD support can go back to early 2022.
> > > > > > > > > Intel, AMD and ARM are all supporting that now.
> > > > > > > > >
> > > > > > > > > > And the work seems to have started for some platforms.
> > > > > > > > >
> > > > > > > > > Let me quote from the above link:
> > > > > > > > >
> > > > > > > > > """
> > > > > > > > > Today, AMD Milan (or more recent) supports it while ARM
> > > > > > > > > SMMUv3.2 alongside VT-D rev3.x also do support.
> > > > > > > > > """
> > > > > > > > >
> > > > > > > > > > Without such platform commitment, virtio also skipping
> > > > > > > > > > it would
> > > not work.
> > > > > > > > >
> > > > > > > > > Is the above sufficient? I'm a little bit more familiar
> > > > > > > > > with vtd, the hw feature has been there for years.
> > > > > > > >
> > > > > > > >
> > > > > > > > Repeating myself - I'm not sure that will work well for all workloads.
> > > > > > >
> > > > > > > I think this comment applies to this proposal as well.
> > > > > >
> > > > > > Yes - some systems might be better off with platform tracking.
> > > > > > And I think supporting shadow vq better would be nice too.
> > > > >
> > > > > For shadow vq, did you mean the work that is done by Eugenio?
> > > >
> > > > Yes.
> > >
> > > That's exactly why vDPA starts with shadow virtqueue. We've
> > > evaluated various possible approaches, each of them have their
> > > shortcomings and shadow virtqueue is the only one that doesn't
> > > require any additional hardware features to work in every platform.
> > >
> > > >
> > > > > >
> > > > > > > > Definitely KVM did
> > > > > > > > not scan PTEs. It used pagefaults with bit per page and
> > > > > > > > later as VM size grew switched to PLM.  This interface is
> > > > > > > > analogous to PLM,
> > > > > > >
> > > > > > > I think you meant PML actually. And it doesn't work like
> > > > > > > PML. To behave like PML it needs to
> > > > > > >
> > > > > > > 1) log buffers were organized as a queue with indices
> > > > > > > 2) device needs to suspend (as a #vmexit in PML) if it runs
> > > > > > > out of the buffers
> > > > > > > 3) device need to send a notification to the driver if it
> > > > > > > runs out of the buffer
> > > > > > >
> > > > > > > I don't see any of the above in this proposal. If we do that
> > > > > > > it would be less problematic than what is being proposed here.
> > > > > >
> > > > > > What is common between this and PML is that you get the
> > > > > > addresses directly without scanning megabytes of bitmaps or
> > > > > > worse - hundreds of megabytes of page tables.
> > > > >
> > > > > Yes, it has overhead but this is the method we use for vhost and
> > > > > KVM
> > > (earlier).
> > > > >
> > > > > To me the  important advantage of PML is that it uses limited
> > > > > resources on the host which
> > > > >
> > > > > 1) doesn't require resources in the device
> > > > > 2) doesn't scale as the guest memory increases. (but this
> > > > > advantage doesn't exist in neither this nor bitmap)
> > > >
> > > > it seems 2 exactly exists here.
> > >
> > > Actually not, Parav said the device needs to reserve sufficient
> > > resources in another thread.
> > The device resource reservation starts only when the device migration starts.
> > i.e. with WRITE_RECORDS_START command of patch 7 in the series.
> 
> And now your precious VM can't migrate at all because -ENOSPC.
>
I am not aware of any Linux IOCTL that ensures a guaranteed execution without an error code. :)

As we talked in other email, a VF can be provisioned too as extension and capability can be exposed.
This is not going the only error on device migration.
 
> 
> 
> > >
> > > >
> > > >
> > > > > >
> > > > > > The data structure is different but I don't see why it is critical.
> > > > > >
> > > > > > I agree that I don't see out of buffers notifications too
> > > > > > which implies device has to maintain something like a bitmap internally.
> > > > > > Which I guess could be fine but it is not clear to me how
> > > > > > large that bitmap has to be. How does the device know? Needs to be
> addressed.
> > > > >
> > > > > This is the question I asked Parav in another thread. Using host
> > > > > memory as a queue with notification (like PML) might be much better.
> > > >
> > > > Well if queue is what you want to do you can just do it internally.
> > >
> > > Then it's not the proposal here, Parav has explained it in another
> > > reply, and as explained it lacks a lot of other facilities.
> > >
> > PML is yet another option that requires small pci writes.
> > In the current proposal, there are no small PCI writes.
> > It is a query interface from the device.
> >
> > > > Problem of course is that it might overflow and cause things like
> > > > packet drops.
> > >
> > > Exactly like PML. So sticking to wire speed should not be a general
> > > goal in the context of migration. It can be done if the speed of the
> > > migration interface is faster than the virtio device that needs to be migrated.
> > May not have to be.
> > Speed of page recording should be fast enough.
> > It usually improves with subsequent generation.
> > >
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > > > Even if we manage to do that, it doesn't mean we won't have issues.
> > > > > > >
> > > > > > > 1) For many reasons it can neither see nor log via GPA, so
> > > > > > > this requires a traversal of the vIOMMU mapping tables by
> > > > > > > the hypervisor afterwards, it would be expensive and need
> > > > > > > synchronization with the guest modification of the IO page
> > > > > > > table which
> > > looks very hard.
> > > > > >
> > > > > > vIOMMU is fast enough to be used on data path but not fast
> > > > > > enough for dirty tracking?
> > > > >
> > > > > We set up SPTEs or using nesting offloading where the PTEs could
> > > > > be iterated by hardware directly which is fast.
> > > >
> > > > There's a way to have hardware find dirty PTEs for you quickly?
> > >
> > > Scanning PTEs on the host is faster and more secure than scanning
> > > guests, that's what I want to say:
> > >
> > > 1) the guest page could be swapped out but not the host one.
> > > 2) no guest triggerable behavior
> > >
> >
> > Device page tracking table to be consulted to flush on mapping change.
> >
> > > > I don't know how it's done. Do tell.
> > > >
> > > >
> > > > > This is not the case here where software needs to iterate the IO
> > > > > page tables in the guest which could be slow.
> > > > >
> > > > > > Hard to believe.  If true and you want to speed up vIOMMU then
> > > > > > you implement an efficient datastructure for that.
> > > > >
> > > > > Besides the issue of performance, it's also racy, assuming we
> > > > > are logging
> > > IOVA.
> > > > >
> > > > > 0) device log IOVA
> > > > > 1) hypervisor fetches IOVA from log buffer
> > > > > 2) guest map IOVA to a new GPA
> > > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > > >
> > > > > Then we lost the old GPA.
> > > >
> > > > Interesting and a good point.
> > >
> > > Note that PML logs at GPA as it works at L1 of EPT.
> > >
> > > > And by the way e.g. vhost has the same issue.  You need to flush
> > > > dirty tracking info when changing the mappings somehow.
> > >
> > > It's not,
> > >
> > > 1) memory translation is done by vhost
> > > 2) vhost knows GPA and it doesn't log via IOVA.
> > >
> > > See this for example, and DPDK has similar fixes.
> > >
> > > commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
> > > Author: Jason Wang <jasowang@redhat.com>
> > > Date:   Wed Jan 16 16:54:42 2019 +0800
> > >
> > >     vhost: log dirty page correctly
> > >
> > >     Vhost dirty page logging API is designed to sync through GPA. But we
> > >     try to log GIOVA when device IOTLB is enabled. This is wrong and may
> > >     lead to missing data after migration.
> > >
> > >     To solve this issue, when logging with device IOTLB enabled, we will:
> > >
> > >     1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
> > >        get HVA, for writable descriptor, get HVA through iovec. For used
> > >        ring update, translate its GIOVA to HVA
> > >     2) traverse the GPA->HVA mapping to get the possible GPA and log
> > >        through GPA. Pay attention this reverse mapping is not guaranteed
> > >        to be unique, so we should log each possible GPA in this case.
> > >
> > >     This fix the failure of scp to guest during migration. In -next, we
> > >     will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> > >
> > >     Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> > >     Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> > >     Cc: Jintack Lim <jintack@cs.columbia.edu>
> > >     Signed-off-by: Jason Wang <jasowang@redhat.com>
> > >     Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > >
> > > All of the above is not what virtio did right now.
> > >
> > > > Parav what's the plan for this? Should be addressed in the spec too.
> > > >
> > >
> > > AFAIK, there's no easy/efficient way to do that. I hope I was wrong.
> > >
> >
> > The query interface in this proposal works on the granular boundary to read
> and clear.
> > This will ensure that mapping is consistent.
> 
> By itself it does not, you have to actually keep querying until you flush all dirty
> info and do it each time there's an invalidation in the IOMMU.
>
Only during device migration time.
It only applied on those specific cases when unmapping and migration both in progress at same time.
But yes, it can slow down unmapping.
 
> 
> > > >
> > > >
> > > > > >
> > > > > > > 2) There are a lot of special or reserved IOVA ranges (for
> > > > > > > example the interrupt areas in x86) that need special care
> > > > > > > which is architectural and where it is beyond the scope or
> > > > > > > knowledge of the virtio device but the platform IOMMU.
> > > > > > > Things would be more complicated when SVA is enabled.
> > > > > >
> > > > > > SVA being what here?
> > > > >
> > > > > For example, IOMMU may treat interrupt ranges differently
> > > > > depending on whether SVA is enabled or not. It's very hard and
> > > > > unnecessary to teach devices about this.
> > > >
> > > > Oh, shared virtual memory. So what you are saying here? virtio
> > > > does not care, it just uses some addresses and if you want it to
> > > > it can record writes somewhere.
> > >
> > > One example, PCI allows devices to send translated requests, how can
> > > a hypervisor know it's a PA or IOVA in this case? We probably need a
> > > new bit. But it's not the only thing we need to deal with.
> > >
> > > By definition, interrupt ranges and other reserved ranges should not
> > > belong to dirty pages. And the logging should be done before the DMA
> > > where there's no way for the device to know whether or not an IOVA
> > > is valid or not. It would be more safe to just not report them from
> > > the source instead of leaving it to the hypervisor to deal with but
> > > this seems impossible at the device level. Otherwise the hypervisor
> > > driver needs to communicate with the (v)IOMMU to be reached with the
> > > interrupt(MSI) area, RMRR area etc in order to do the correct things
> > > or it might have security implications. And those areas don't make
> > > sense at L1 when vSVA is enabled. What's more, when vIOMMU could be
> > > fully offloaded, there's no easy way to fetch that information.
> > >
> > There cannot be logging before the DMA.
> > Only requirement is before the mapping changes, the dirty page tracking to be
> synced.
> >
> > In most common cases where the perf is critical, such mapping wont change
> so often dynamically anyway.
> >
> > > Again, it's hard to bypass or even duplicate the functionality of
> > > the platform or we need to step into every single detail of a
> > > specific transport, architecture or IOMMU to figure out whether or
> > > not logging at virtio is correct which is awkward and unrealistic.
> > > This proposal suffers from an exact similar issue when inventing
> > > things like freeze/stop where I've pointed out other branches of issues as
> well.
> > >
> > It is incorrect attribution that platform is duplicated here.
> > It feeds the data to the platform as needed without replicating.
> >
> > I do agree that there is overlap of IOMMU tracking the dirty and storing it in
> the per PTE vs device supplying its dirty track via its own interface.
> > Both are consolidated at hypervisor level.
> >
> > > >
> > > > > >
> > > > > > > And there could be other architecte specific knowledge (e.g
> > > > > > > PAGE_SIZE) that might be needed. There's no easy way to deal
> > > > > > > with those cases.
> > > > > >
> > > > > > Good point about page size actually - using 4k unconditionally
> > > > > > is a waste of resources.
> > > > >
> > > > > Actually, they are more than just PAGE_SIZE, for example, PASID and
> others.
> > > >
> > > > what does pasid have to do with it? anyway, just give driver
> > > > control over page size.
> > >
> > > For example, two virtqueues have two PASIDs assigned. How can a
> > > hypervisor know which specific IOVA belongs to which IOVA? For
> > > platform IOMMU, they are handy as it talks to the transport. But I
> > > don't think we need to duplicate every transport specific address space
> feature in core virtio layer:
> > >
> > PASID to vq assignment won't be duplicated.
> > It is configured fully by the guest without consulting hypervisor at the device
> level.
> > Guest IOMMU would consult hypervisor to setup any PASID mapping as part
> of any mapping method.
> >
> > > 1) translated/untranslated request
> > > 2) request w/ and w/o PASID
> > >
> > > >
> > > > > >
> > > > > >
> > > > > > > We wouldn't need to care about all of them if it is done at
> > > > > > > platform IOMMU level.
> > > > > >
> > > > > > If someone logs at IOMMU level then nothing needs to be done
> > > > > > in the spec at all. This is about capability at the device level.
> > > > >
> > > > > True, but my question is where or not it can be done at the
> > > > > device level
> > > easily.
> > > >
> > > > there's no "easily" about live migration ever.
> > >
> > > I think I've stated sufficient issues to demonstrate how hard virtio wants to
> do it.
> > > And I've given the link that it is possible to do that in IOMMU
> > > without those issues. So in this context doing it in virtio is much harder.
> > >
> > > > For example on-device iommus are a thing.
> > >
> > > I'm not sure that's the way to go considering the platform IOMMU
> > > evolves very quickly.
> > >
> > > >
> > > > > >
> > > > > >
> > > > > > > > what Lingshan
> > > > > > > > proposed is analogous to bit per page - problem
> > > > > > > > unfortunately is you can't easily set a bit by DMA.
> > > > > > > >
> > > > > > >
> > > > > > > I'm not saying bit/bytemap is the best, but it has been used
> > > > > > > by real hardware. And we have many other options.
> > > > > > >
> > > > > > > > So I think this dirty tracking is a good option to have.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > > > >
> > > > > > > > > > > Why does it matter in 2024?
> > > > > > > > > > Because users needs to use it now.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > If not, we are better off to offer this, and
> > > > > > > > > > > > when/if platform support is, sure,
> > > > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > > > >
> > > > > > > > > > > > > 4) if the platform support is missing, we can
> > > > > > > > > > > > > use software or leverage transport for
> > > > > > > > > > > > > assistance like PRI
> > > > > > > > > > > > All of these are in theory.
> > > > > > > > > > > > Our experiment shows PRI performance is 21x slower
> > > > > > > > > > > > than page fault rate
> > > > > > > > > > > done by the cpu.
> > > > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > > > >
> > > > > > > > > > > If you stick to the wire speed during migration, it can
> converge.
> > > > > > > > > > Do you have perf data for this?
> > > > > > > > >
> > > > > > > > > No, but it's not hard to imagine the worst case. Wrote a
> > > > > > > > > small program that dirty every page by a NIC.
> > > > > > > > >
> > > > > > > > > > In the internal tests we don’t see this happening.
> > > > > > > > >
> > > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > > > >
> > > > > > > > > So if we get very high dirty rates (e.g by a high speed
> > > > > > > > > NIC), we can't satisfy the requirement of the downtime.
> > > > > > > > > Or if you see the converge, you might get help from the
> > > > > > > > > auto converge support by the hypervisors like KVM where
> > > > > > > > > it tries to throttle the VCPU then you can't reach the wire speed.
> > > > > > > >
> > > > > > > > Will only work for some device types.
> > > > > > > >
> > > > > > >
> > > > > > > Yes, that's the point. Parav said he doesn't see the issue,
> > > > > > > it's probably because he is testing a virtio-net and so the
> > > > > > > vCPU is automatically throttled. It doesn't mean it can work
> > > > > > > for other virito devices.
> > > > > >
> > > > > > Only for TX, and I'm pretty sure they had the foresight to
> > > > > > test RX not just TX but let's confirm. Parav did you test both directions?
> > > > >
> > > > > RX speed somehow depends on the speed of refill, so throttling
> > > > > helps more or less.
> > > >
> > > > It doesn't depend on speed of refill you just underrun and drop
> > > > packets. then your nice 10usec latency becomes more like 10sec.
> > >
> > > I miss your point here. If the driver can't achieve wire speed
> > > without dirty page tracking, it can neither when dirty page tracking is
> enabled.
> > >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > > > So it is unusable.
> > > > > > > > > > >
> > > > > > > > > > > It's not about mandating, it's about doing things in
> > > > > > > > > > > the correct layer. If PRI is slow, PCI can evolve for sure.
> > > > > > > > > > You should try.
> > > > > > > > >
> > > > > > > > > Not my duty, I just want to make sure things are done in
> > > > > > > > > the correct layer, and once it needs to be done in the
> > > > > > > > > virtio, there's nothing obviously wrong.
> > > > > > > >
> > > > > > > > Yea but just vague questions don't help to make sure eiter way.
> > > > > > >
> > > > > > > I don't think it's vague, I have explained, if something in
> > > > > > > the virito slows down the PRI, we can try to fix them.
> > > > > >
> > > > > > I don't believe you are going to make PRI fast. No one managed so far.
> > > > >
> > > > > So it's the fault of PRI not virito, but it doesn't mean we need
> > > > > to do it in virtio.
> > > >
> > > > I keep saying with this approach we would just say "e1000
> > > > emulation is slow and encumbered this is the fault of e1000" and
> > > > never get virtio at all.  Assigning blame only gets you so far.
> > >
> > > I think we are discussing different things. My point is virtio needs
> > > to leverage the functionality provided by transport or platform
> > > (especially considering they evolve faster than virtio). It seems to
> > > me it's hard even to duplicate some basic function of platform IOMMU in
> virtio.
> > >
> > Not duplicated. Feeding into the platform.
> 
> I mean IOMMU still sets the dirty bit, too. How is that not a duplication?
>
Only if the IOMMU is enabled for it.
For example AMD has DTE HAD bit to enable dirty page tracking in IOMMU.

So if platform does not enable, it can be enabled on the device and vis-versa.

> 
> > > >
> > > > > >
> > > > > > > Missing functions in
> > > > > > > platform or transport is not a good excuse to try to
> > > > > > > workaround it in the virtio. It's a layer violation and we
> > > > > > > never had any feature like this in the past.
> > > > > >
> > > > > > Yes missing functionality in the platform is exactly why
> > > > > > virtio was born in the first place.
> > > > >
> > > > > Well the platform can't do device specific logic. But that's not
> > > > > the case of dirty page tracking which is device logic agnostic.
> > > >
> > > > Not true platforms have things like NICs on board and have for
> > > > many years. It's about performance really.
> > >
> > > I've stated sufficient issues above. And one more obvious issue for
> > > device initiated page logging is that it needs a lot of extra or
> > > unnecessary PCI transactions which will throttle the performance of
> > > the whole system (and lead to other issues like QOS). So I can't believe it has
> good performance overall.
> > > Logging via IOMMU or using shadow virtqueue doesn't need any extra
> > > PCI transactions at least.
> > >
> > In the current proposal, it does not required PCI transactions, as there is only a
> hypervisor-initiated query interface.
> > It is a trade off of using svq + pasid vs using something from the device.
> >
> > Again, both has different use case and value. One uses cpu and one uses
> device.
> > Depending how much power one wants to spend where..
> 
> Also how much effort we want to spend on this virtio specific thing.
> The needs to be a *reason* to do things in virtio as opposed to using platform
> capabilities, this is exactly the same thing I told Lingshan wrt using SUSPEND for
> power management as opposed to using PCI PM - relying on platform when we
> can is right there in the mission statement.
> For some reason I asssumed you guys have done a PoC and that's the
> motivation but if it's a "just in case" feature then I'd suggest we focus on
> merging patches 1-5 first.
>
It is not just in case feature.
We learnt that not all cpus have it.

There is ongoing efforts of the poc.
We will have the results in sometime.

We have similar interface on at least two devices already and integrated in the Linux stack, one in upstream, other is in progress.
virtio is also in discussion here.

Sure, it is proposed as optional. We can focus on 1-5 first.
I will split the series once I have comments.

There is also extension after 1-5 for net device context as well.

 
> 
> > > > So I'd like Parav to publish some
> > > > experiment results and/or some estimates.
> > > >
> > >
> > > That's fine, but the above equation (used by Qemu) is sufficient to
> > > demonstrate how hard to stick wire speed in the case.
> > >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > In the current state, it is mandating.
> > > > > > > > > > And if you think PRI is the only way,
> > > > > > > > >
> > > > > > > > > I don't, it's just an example where virtio can leverage
> > > > > > > > > from either transport or platform. Or if it's the fault
> > > > > > > > > in virtio that slows down the PRI, then it is something we can do.
> > > > > > > > >
> > > > > > > > > >  than you should propose that in the dirty page
> > > > > > > > > > tracking series that
> > > you listed above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > > > > >
> > > > > > > > > No, the point is to not duplicate works especially
> > > > > > > > > considering virtio can't do better than platform or transport.
> > > > > > > >
> > > > > > > > If someone says they tried and platform's migration
> > > > > > > > support does not work for them and they want to build a
> > > > > > > > solution in virtio then what exactly is the objection?
> > > > > > >
> > > > > > > The discussion is to make sure whether virtio can do this
> > > > > > > easily and correctly, then we can have a conclusion. I've
> > > > > > > stated some issues above, and I've asked other questions
> > > > > > > related to them which are still not answered.
> > > > > > >
> > > > > > > I think we had a very hard time in bypassing IOMMU in the
> > > > > > > past that we don't want to repeat.
> > > > > > >
> > > > > > > We've gone through several methods of logging dirty pages in
> > > > > > > the past (each with pros/cons), but this proposal never
> > > > > > > explains why it chooses one of them but not others. Spec
> > > > > > > needs to find the best path instead of just a possible path
> > > > > > > without any rationale about
> > > why.
> > > > > >
> > > > > > Adding more rationale isn't a bad thing.
> > > > > > In particular if platform supplies dirty tracking then how
> > > > > > does driver decide which to use platform or device capability?
> > > > > > A bit of discussion around this is a good idea.
> > > > > >
> > > > > >
> > > > > > > > virtio is here in the
> > > > > > > > first place because emulating devices didn't work well.
> > > > > > >
> > > > > > > I don't understand here. We have supported emulated devices for
> years.
> > > > > > > I'm pretty sure a lot of issues could be uncovered if this
> > > > > > > proposal can be prototyped with an emulated device first.
> > > > > > >
> > > > > > > Thanks
> > > > > >
> > > > > > virtio was originally PV as opposed to emulation. That there's
> > > > > > now hardware virtio and you call software implementation "an
> > > > > > emulation" is very meta.
> > > > >
> > > > > Yes but I don't see how it relates to dirty page tracking. When
> > > > > we find a way it should work for both software and hardware devices.
> > > > >
> > > > > Thanks
> > > >
> > > > It has to work well on a variety of existing platforms. If it does
> > > > then sure, why would we roll our own.
> > >
> > > If virtio can do that in an efficient way without any issues, I agree.
> > > But it seems not.
> > >
> > > Thanks
> 
> 
> This publicly archived list offers a means to provide input to the OASIS Virtual
> I/O Device (VIRTIO) TC.
> 
> In order to verify user consent to the Feedback License terms and to minimize
> spam in the list archive, subscription is required before posting.
> 
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 10:37                                                             ` Michael S. Tsirkin
@ 2023-11-17 10:52                                                               ` Parav Pandit
  2023-11-17 11:32                                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 10:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 4:08 PM
> 
> On Fri, Nov 17, 2023 at 09:57:52AM +0000, Parav Pandit wrote:
> >
> > > From: virtio-comment@lists.oasis-open.org
> > > <virtio-comment@lists.oasis- open.org> On Behalf Of Michael S.
> > > Tsirkin
> > > Sent: Friday, November 17, 2023 3:21 PM
> > >
> > > On Fri, Nov 17, 2023 at 09:41:40AM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 3:08 PM
> > > > >
> > > > > On Fri, Nov 17, 2023 at 09:14:21AM +0000, Parav Pandit wrote:
> > > > > >
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 2:16 PM In any case you can
> > > > > > > safely assume that many users will have migration that takes
> > > > > > > seconds and minutes.
> > > > > >
> > > > > > Strange, but ok. I don't see any problem with current method.
> > > > > > 8MB is used for very large VM of 1TB takes minutes. Should be fine.
> > > > >
> > > > > The problem is simple: vendors selling devices have no idea how
> > > > > large the VM will be. So you have to over-provision for the max VM size.
> > > > > If there was a way to instead allocate that in host memory, that
> > > > > would improve on this.
> > > >
> > > > Not sure what to over provision for max VM size.
> > > > Vendor does not know how many vcpus will be needed. It is no
> > > > different
> > > problem.
> > > >
> > > > When the VM migration is started, the individual tracking range is
> > > > supplied by
> > > the hypervisor to device.
> > > > Device allocates necessary memory on this instruction.
> > > >
> > > > When the VM with certain size is provisioned, the member device
> > > > can be
> > > provisioned for the VM size.
> > > > And if it cannot be provisioned, possibly this may not the right
> > > > member device
> > > to use at that point in time.
> > >
> > > For someone who keeps arguing against adding single bit registers
> > > "because it does not scale" you seem very nonchalant about adding
> 8Mbytes.
> > >
> > There is fundamental difference on how/when a bit is used.
> > One wants to use a bit for non-performance part and keep it always available
> vs data path.
> > Not same comparison.
> >
> > > I thought we have a nicely contained and orthogonal feature, so if
> > > it's optional it's not a problem.
> > It is optional as always.
> >
> > >
> > > But with such costs and corner cases what exactly is the motivation
> > > for the feature here?
> > New generations DPUs have memory for device data path workloads but not
> for bits.
> >
> > > Do you have a PoC showing how this works better than e.g.
> > > shadow VQ?
> > >
> > Not yet.
> > But I don't think this can be even a criteria to consider as dependency on
> PASID is nonstarter with other limitations.
> 
> You just need dirty bit in PTE, whether that is tied to PASID depends very much
> on the platform.  For VTD I think it is.  And if shadow vq works as a fallback, it
> just might be reasonable not to do any tracking in virtio.
>
Somehow the claim of shadow vq is great without sharing any performance numbers is what I don't agree with.

And it fundamentally does not fit the generic stack where virtio to be used.

We have accelerated some of the shadow vq for non virtio devices and those optimizations are not elegant enough that I wouldn't want to bring to virtio spec.
A different discussion.
 
> > > Maybe IOMMU based and shadow VQ based tracking are the way to go
> > > initially, and if there's a problem then we should add this later, on top.
> > >
> > For the cpus that does not support IOMMU cannot shift to shadow VQ either.
> 
> I don't know what this means (no IOMMU at all?) but it looks like shadow vq
> and similar approaches are in production with vdpa and have been
> demonstrated for a while. All we are doing is supporting them in virtio proper.
> 
IOMMU is present but does not have support for D bit.

> > > I really want us to finally make progress merging features and
> > > anything that reduces scope initially is good for that.
> > >
> > Yes, if you prefer to split the last three patches, I am fine.
> > Please let me know.
> 
> As here have not been any comments on 1-5 I don't think there's need to repost
> this just yet. I'll review 1-5 next week.
> I think in the next version it might be wise to split this and post as two series,
> yes.
Ok.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 10:03                                       ` Parav Pandit
@ 2023-11-17 11:00                                         ` Michael S. Tsirkin
  2023-11-17 11:05                                           ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 11:00 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
> 
> 
> > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > Sent: Friday, November 17, 2023 3:30 PM
> > 
> > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu, Lingshan wrote:
> > >>
> > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > >>>> We should expose a limit of the device in the proposed
> > WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > >>>> So that future provisioning framework can use it.
> > >>>>
> > >>>> I will cover this in v5 early next week.
> > >>> I do worry about how this can even work though. If you want a
> > >>> generic device you do not get to dictate how much memory VM has.
> > >>>
> > >>> Aren't we talking bit per page? With 1TByte of memory to track ->
> > >>> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > >>>
> > >>> And you happily say "we'll address this in the future" while at the
> > >>> same time fighting tooth and nail against adding single bit status
> > >>> registers because scalability?
> > >>>
> > >>>
> > >>> I have a feeling doing this completely theoretical like this is problematic.
> > >>> Maybe you have it all laid out neatly in your head but I suspect not
> > >>> all of TC can picture it clearly enough based just on spec text.
> > >>>
> > >>> We do sometimes ask for POC implementation in linux / qemu to
> > >>> demonstrate how things work before merging code. We skipped this for
> > >>> admin things so far but I think it's a good idea to start doing it
> > >>> here.
> > >>>
> > >>> What makes me pause a bit before saying please do a PoC is all the
> > >>> opposition that seems to exist to even using admin commands in the
> > >>> 1st place. I think once we finally stop arguing about whether to use
> > >>> admin commands at all then a PoC will be needed before merging.
> > >> We have POR productions that implemented the approach in my series.
> > >> They are multiple generations of productions in market and running in
> > >> customers data centers for years.
> > >>
> > >> Back to 2019 when we start working on vDPA, we have sent some samples
> > >> of production(e.g., Cascade Glacier) and the datasheet, you can find
> > >> live migration facilities there, includes suspend, vq state and other
> > >> features.
> > >>
> > >> And there is an reference in DPDK live migration, I have provided
> > >> this page
> > >> before:
> > >> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html, it has been
> > >> working for long long time.
> > >>
> > >> So if we let the facts speak, if we want to see if the proposal is
> > >> proven to work, I would
> > >> say: They are POR for years, customers already deployed them for years.
> > > And I guess what you are trying to say is that this patchset we are
> > > reviewing here should be help to the same standard and there should be
> > > a PoC? Sounds reasonable.
> > Yes and the in-marketing productions are POR, the series just improves the
> > design, for example, our series also use registers to track vq state, but
> > improvements than CG or BSC. So I think they are proven to work.
> 
> If you prefer to go the route of POR and production and proven documents etc, there is ton of it of multiple types of products I can dump here with open-source code and documentation and more.
> Let me know what you would like to see.
> 
> Michael has requested some performance comparisons, not all are ready to share yet.
> Some are present that I will share in coming weeks.
> 
> And all the vdpa dpdk you published does not have basic CVQ support when I last looked at it.
> Do you know when was it added?

It's good enough for PoC I think, CVQ or not.
The problem with CVQ generally, is that VDPA wants to shadow CVQ it at all times
because it wants to decode and cache the content. But this problem has nothing
to do with dirty tracking even though it also mentions "shadow":
if device can report it's state then there's no need to shadow CVQ.

> > >
> > >> For dirty page tracking, I see you want both platform IOMMU tracking
> > >> and shadow vqs, I am totally fine with this idea. And I think maybe
> > >> we should merge the basic features first, and dirty page tracking
> > >> should be the second step.
> > >>
> > >> Thanks
> > > Parav wants to add an option of on-device tracking. Which also seems
> > > fine. I think it should be optional though because shadow and IOMMU
> > > options exist.
> > I agree, the vendor can choose to implement their own facility as a backup.
> > >
> 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 11:00                                         ` Michael S. Tsirkin
@ 2023-11-17 11:05                                           ` Parav Pandit
  2023-11-17 11:33                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 11:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 4:30 PM
> 
> On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
> >
> >
> > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > Sent: Friday, November 17, 2023 3:30 PM
> > >
> > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu, Lingshan wrote:
> > > >>
> > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > >>>> We should expose a limit of the device in the proposed
> > > WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > > >>>> So that future provisioning framework can use it.
> > > >>>>
> > > >>>> I will cover this in v5 early next week.
> > > >>> I do worry about how this can even work though. If you want a
> > > >>> generic device you do not get to dictate how much memory VM has.
> > > >>>
> > > >>> Aren't we talking bit per page? With 1TByte of memory to track
> > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > >>>
> > > >>> And you happily say "we'll address this in the future" while at
> > > >>> the same time fighting tooth and nail against adding single bit
> > > >>> status registers because scalability?
> > > >>>
> > > >>>
> > > >>> I have a feeling doing this completely theoretical like this is problematic.
> > > >>> Maybe you have it all laid out neatly in your head but I suspect
> > > >>> not all of TC can picture it clearly enough based just on spec text.
> > > >>>
> > > >>> We do sometimes ask for POC implementation in linux / qemu to
> > > >>> demonstrate how things work before merging code. We skipped this
> > > >>> for admin things so far but I think it's a good idea to start
> > > >>> doing it here.
> > > >>>
> > > >>> What makes me pause a bit before saying please do a PoC is all
> > > >>> the opposition that seems to exist to even using admin commands
> > > >>> in the 1st place. I think once we finally stop arguing about
> > > >>> whether to use admin commands at all then a PoC will be needed
> before merging.
> > > >> We have POR productions that implemented the approach in my series.
> > > >> They are multiple generations of productions in market and
> > > >> running in customers data centers for years.
> > > >>
> > > >> Back to 2019 when we start working on vDPA, we have sent some
> > > >> samples of production(e.g., Cascade Glacier) and the datasheet,
> > > >> you can find live migration facilities there, includes suspend,
> > > >> vq state and other features.
> > > >>
> > > >> And there is an reference in DPDK live migration, I have provided
> > > >> this page
> > > >> before:
> > > >> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html, it has been
> > > >> working for long long time.
> > > >>
> > > >> So if we let the facts speak, if we want to see if the proposal
> > > >> is proven to work, I would
> > > >> say: They are POR for years, customers already deployed them for years.
> > > > And I guess what you are trying to say is that this patchset we
> > > > are reviewing here should be help to the same standard and there
> > > > should be a PoC? Sounds reasonable.
> > > Yes and the in-marketing productions are POR, the series just
> > > improves the design, for example, our series also use registers to
> > > track vq state, but improvements than CG or BSC. So I think they are proven
> to work.
> >
> > If you prefer to go the route of POR and production and proven documents
> etc, there is ton of it of multiple types of products I can dump here with open-
> source code and documentation and more.
> > Let me know what you would like to see.
> >
> > Michael has requested some performance comparisons, not all are ready to
> share yet.
> > Some are present that I will share in coming weeks.
> >
> > And all the vdpa dpdk you published does not have basic CVQ support when I
> last looked at it.
> > Do you know when was it added?
> 
> It's good enough for PoC I think, CVQ or not.
> The problem with CVQ generally, is that VDPA wants to shadow CVQ it at all
> times because it wants to decode and cache the content. But this problem has
> nothing to do with dirty tracking even though it also mentions "shadow":
> if device can report it's state then there's no need to shadow CVQ.

For the performance numbers with the pre-copy and device context of patches posted 1 to 5, the downtime reduction of the VM is 3.71x with active traffic on 8 RQs at 100Gbps port speed.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 10:48                                     ` Parav Pandit
@ 2023-11-17 11:19                                       ` Michael S. Tsirkin
  2023-11-17 11:32                                         ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 11:19 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Fri, Nov 17, 2023 at 10:48:36AM +0000, Parav Pandit wrote:
> > > > Actually not, Parav said the device needs to reserve sufficient
> > > > resources in another thread.
> > > The device resource reservation starts only when the device migration starts.
> > > i.e. with WRITE_RECORDS_START command of patch 7 in the series.
> > 
> > And now your precious VM can't migrate at all because -ENOSPC.
> >
> I am not aware of any Linux IOCTL that ensures a guaranteed execution without an error code. :)
> 
> As we talked in other email, a VF can be provisioned too as extension and capability can be exposed.
> This is not going the only error on device migration.

Allocating resources on outgoing migration is a very bad idea.
It is common to migrate prcisely because you are out of resources.
Incoming is a different story, less of a problem.


> > 
> > 
> > > >
> > > > >
> > > > >
> > > > > > >
> > > > > > > The data structure is different but I don't see why it is critical.
> > > > > > >
> > > > > > > I agree that I don't see out of buffers notifications too
> > > > > > > which implies device has to maintain something like a bitmap internally.
> > > > > > > Which I guess could be fine but it is not clear to me how
> > > > > > > large that bitmap has to be. How does the device know? Needs to be
> > addressed.
> > > > > >
> > > > > > This is the question I asked Parav in another thread. Using host
> > > > > > memory as a queue with notification (like PML) might be much better.
> > > > >
> > > > > Well if queue is what you want to do you can just do it internally.
> > > >
> > > > Then it's not the proposal here, Parav has explained it in another
> > > > reply, and as explained it lacks a lot of other facilities.
> > > >
> > > PML is yet another option that requires small pci writes.
> > > In the current proposal, there are no small PCI writes.
> > > It is a query interface from the device.
> > >
> > > > > Problem of course is that it might overflow and cause things like
> > > > > packet drops.
> > > >
> > > > Exactly like PML. So sticking to wire speed should not be a general
> > > > goal in the context of migration. It can be done if the speed of the
> > > > migration interface is faster than the virtio device that needs to be migrated.
> > > May not have to be.
> > > Speed of page recording should be fast enough.
> > > It usually improves with subsequent generation.
> > > >
> > > > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Even if we manage to do that, it doesn't mean we won't have issues.
> > > > > > > >
> > > > > > > > 1) For many reasons it can neither see nor log via GPA, so
> > > > > > > > this requires a traversal of the vIOMMU mapping tables by
> > > > > > > > the hypervisor afterwards, it would be expensive and need
> > > > > > > > synchronization with the guest modification of the IO page
> > > > > > > > table which
> > > > looks very hard.
> > > > > > >
> > > > > > > vIOMMU is fast enough to be used on data path but not fast
> > > > > > > enough for dirty tracking?
> > > > > >
> > > > > > We set up SPTEs or using nesting offloading where the PTEs could
> > > > > > be iterated by hardware directly which is fast.
> > > > >
> > > > > There's a way to have hardware find dirty PTEs for you quickly?
> > > >
> > > > Scanning PTEs on the host is faster and more secure than scanning
> > > > guests, that's what I want to say:
> > > >
> > > > 1) the guest page could be swapped out but not the host one.
> > > > 2) no guest triggerable behavior
> > > >
> > >
> > > Device page tracking table to be consulted to flush on mapping change.
> > >
> > > > > I don't know how it's done. Do tell.
> > > > >
> > > > >
> > > > > > This is not the case here where software needs to iterate the IO
> > > > > > page tables in the guest which could be slow.
> > > > > >
> > > > > > > Hard to believe.  If true and you want to speed up vIOMMU then
> > > > > > > you implement an efficient datastructure for that.
> > > > > >
> > > > > > Besides the issue of performance, it's also racy, assuming we
> > > > > > are logging
> > > > IOVA.
> > > > > >
> > > > > > 0) device log IOVA
> > > > > > 1) hypervisor fetches IOVA from log buffer
> > > > > > 2) guest map IOVA to a new GPA
> > > > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > > > >
> > > > > > Then we lost the old GPA.
> > > > >
> > > > > Interesting and a good point.
> > > >
> > > > Note that PML logs at GPA as it works at L1 of EPT.
> > > >
> > > > > And by the way e.g. vhost has the same issue.  You need to flush
> > > > > dirty tracking info when changing the mappings somehow.
> > > >
> > > > It's not,
> > > >
> > > > 1) memory translation is done by vhost
> > > > 2) vhost knows GPA and it doesn't log via IOVA.
> > > >
> > > > See this for example, and DPDK has similar fixes.
> > > >
> > > > commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
> > > > Author: Jason Wang <jasowang@redhat.com>
> > > > Date:   Wed Jan 16 16:54:42 2019 +0800
> > > >
> > > >     vhost: log dirty page correctly
> > > >
> > > >     Vhost dirty page logging API is designed to sync through GPA. But we
> > > >     try to log GIOVA when device IOTLB is enabled. This is wrong and may
> > > >     lead to missing data after migration.
> > > >
> > > >     To solve this issue, when logging with device IOTLB enabled, we will:
> > > >
> > > >     1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
> > > >        get HVA, for writable descriptor, get HVA through iovec. For used
> > > >        ring update, translate its GIOVA to HVA
> > > >     2) traverse the GPA->HVA mapping to get the possible GPA and log
> > > >        through GPA. Pay attention this reverse mapping is not guaranteed
> > > >        to be unique, so we should log each possible GPA in this case.
> > > >
> > > >     This fix the failure of scp to guest during migration. In -next, we
> > > >     will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> > > >
> > > >     Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> > > >     Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> > > >     Cc: Jintack Lim <jintack@cs.columbia.edu>
> > > >     Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > >     Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > > >
> > > > All of the above is not what virtio did right now.
> > > >
> > > > > Parav what's the plan for this? Should be addressed in the spec too.
> > > > >
> > > >
> > > > AFAIK, there's no easy/efficient way to do that. I hope I was wrong.
> > > >
> > >
> > > The query interface in this proposal works on the granular boundary to read
> > and clear.
> > > This will ensure that mapping is consistent.
> > 
> > By itself it does not, you have to actually keep querying until you flush all dirty
> > info and do it each time there's an invalidation in the IOMMU.
> >
> Only during device migration time.
> It only applied on those specific cases when unmapping and migration both in progress at same time.
> But yes, it can slow down unmapping.
>  
> > 
> > > > >
> > > > >
> > > > > > >
> > > > > > > > 2) There are a lot of special or reserved IOVA ranges (for
> > > > > > > > example the interrupt areas in x86) that need special care
> > > > > > > > which is architectural and where it is beyond the scope or
> > > > > > > > knowledge of the virtio device but the platform IOMMU.
> > > > > > > > Things would be more complicated when SVA is enabled.
> > > > > > >
> > > > > > > SVA being what here?
> > > > > >
> > > > > > For example, IOMMU may treat interrupt ranges differently
> > > > > > depending on whether SVA is enabled or not. It's very hard and
> > > > > > unnecessary to teach devices about this.
> > > > >
> > > > > Oh, shared virtual memory. So what you are saying here? virtio
> > > > > does not care, it just uses some addresses and if you want it to
> > > > > it can record writes somewhere.
> > > >
> > > > One example, PCI allows devices to send translated requests, how can
> > > > a hypervisor know it's a PA or IOVA in this case? We probably need a
> > > > new bit. But it's not the only thing we need to deal with.
> > > >
> > > > By definition, interrupt ranges and other reserved ranges should not
> > > > belong to dirty pages. And the logging should be done before the DMA
> > > > where there's no way for the device to know whether or not an IOVA
> > > > is valid or not. It would be more safe to just not report them from
> > > > the source instead of leaving it to the hypervisor to deal with but
> > > > this seems impossible at the device level. Otherwise the hypervisor
> > > > driver needs to communicate with the (v)IOMMU to be reached with the
> > > > interrupt(MSI) area, RMRR area etc in order to do the correct things
> > > > or it might have security implications. And those areas don't make
> > > > sense at L1 when vSVA is enabled. What's more, when vIOMMU could be
> > > > fully offloaded, there's no easy way to fetch that information.
> > > >
> > > There cannot be logging before the DMA.
> > > Only requirement is before the mapping changes, the dirty page tracking to be
> > synced.
> > >
> > > In most common cases where the perf is critical, such mapping wont change
> > so often dynamically anyway.
> > >
> > > > Again, it's hard to bypass or even duplicate the functionality of
> > > > the platform or we need to step into every single detail of a
> > > > specific transport, architecture or IOMMU to figure out whether or
> > > > not logging at virtio is correct which is awkward and unrealistic.
> > > > This proposal suffers from an exact similar issue when inventing
> > > > things like freeze/stop where I've pointed out other branches of issues as
> > well.
> > > >
> > > It is incorrect attribution that platform is duplicated here.
> > > It feeds the data to the platform as needed without replicating.
> > >
> > > I do agree that there is overlap of IOMMU tracking the dirty and storing it in
> > the per PTE vs device supplying its dirty track via its own interface.
> > > Both are consolidated at hypervisor level.
> > >
> > > > >
> > > > > > >
> > > > > > > > And there could be other architecte specific knowledge (e.g
> > > > > > > > PAGE_SIZE) that might be needed. There's no easy way to deal
> > > > > > > > with those cases.
> > > > > > >
> > > > > > > Good point about page size actually - using 4k unconditionally
> > > > > > > is a waste of resources.
> > > > > >
> > > > > > Actually, they are more than just PAGE_SIZE, for example, PASID and
> > others.
> > > > >
> > > > > what does pasid have to do with it? anyway, just give driver
> > > > > control over page size.
> > > >
> > > > For example, two virtqueues have two PASIDs assigned. How can a
> > > > hypervisor know which specific IOVA belongs to which IOVA? For
> > > > platform IOMMU, they are handy as it talks to the transport. But I
> > > > don't think we need to duplicate every transport specific address space
> > feature in core virtio layer:
> > > >
> > > PASID to vq assignment won't be duplicated.
> > > It is configured fully by the guest without consulting hypervisor at the device
> > level.
> > > Guest IOMMU would consult hypervisor to setup any PASID mapping as part
> > of any mapping method.
> > >
> > > > 1) translated/untranslated request
> > > > 2) request w/ and w/o PASID
> > > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > We wouldn't need to care about all of them if it is done at
> > > > > > > > platform IOMMU level.
> > > > > > >
> > > > > > > If someone logs at IOMMU level then nothing needs to be done
> > > > > > > in the spec at all. This is about capability at the device level.
> > > > > >
> > > > > > True, but my question is where or not it can be done at the
> > > > > > device level
> > > > easily.
> > > > >
> > > > > there's no "easily" about live migration ever.
> > > >
> > > > I think I've stated sufficient issues to demonstrate how hard virtio wants to
> > do it.
> > > > And I've given the link that it is possible to do that in IOMMU
> > > > without those issues. So in this context doing it in virtio is much harder.
> > > >
> > > > > For example on-device iommus are a thing.
> > > >
> > > > I'm not sure that's the way to go considering the platform IOMMU
> > > > evolves very quickly.
> > > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > > what Lingshan
> > > > > > > > > proposed is analogous to bit per page - problem
> > > > > > > > > unfortunately is you can't easily set a bit by DMA.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'm not saying bit/bytemap is the best, but it has been used
> > > > > > > > by real hardware. And we have many other options.
> > > > > > > >
> > > > > > > > > So I think this dirty tracking is a good option to have.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > > > > >
> > > > > > > > > > > > Why does it matter in 2024?
> > > > > > > > > > > Because users needs to use it now.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > If not, we are better off to offer this, and
> > > > > > > > > > > > > when/if platform support is, sure,
> > > > > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 4) if the platform support is missing, we can
> > > > > > > > > > > > > > use software or leverage transport for
> > > > > > > > > > > > > > assistance like PRI
> > > > > > > > > > > > > All of these are in theory.
> > > > > > > > > > > > > Our experiment shows PRI performance is 21x slower
> > > > > > > > > > > > > than page fault rate
> > > > > > > > > > > > done by the cpu.
> > > > > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > > > > >
> > > > > > > > > > > > If you stick to the wire speed during migration, it can
> > converge.
> > > > > > > > > > > Do you have perf data for this?
> > > > > > > > > >
> > > > > > > > > > No, but it's not hard to imagine the worst case. Wrote a
> > > > > > > > > > small program that dirty every page by a NIC.
> > > > > > > > > >
> > > > > > > > > > > In the internal tests we don’t see this happening.
> > > > > > > > > >
> > > > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > > > > >
> > > > > > > > > > So if we get very high dirty rates (e.g by a high speed
> > > > > > > > > > NIC), we can't satisfy the requirement of the downtime.
> > > > > > > > > > Or if you see the converge, you might get help from the
> > > > > > > > > > auto converge support by the hypervisors like KVM where
> > > > > > > > > > it tries to throttle the VCPU then you can't reach the wire speed.
> > > > > > > > >
> > > > > > > > > Will only work for some device types.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes, that's the point. Parav said he doesn't see the issue,
> > > > > > > > it's probably because he is testing a virtio-net and so the
> > > > > > > > vCPU is automatically throttled. It doesn't mean it can work
> > > > > > > > for other virito devices.
> > > > > > >
> > > > > > > Only for TX, and I'm pretty sure they had the foresight to
> > > > > > > test RX not just TX but let's confirm. Parav did you test both directions?
> > > > > >
> > > > > > RX speed somehow depends on the speed of refill, so throttling
> > > > > > helps more or less.
> > > > >
> > > > > It doesn't depend on speed of refill you just underrun and drop
> > > > > packets. then your nice 10usec latency becomes more like 10sec.
> > > >
> > > > I miss your point here. If the driver can't achieve wire speed
> > > > without dirty page tracking, it can neither when dirty page tracking is
> > enabled.
> > > >
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > > > > So it is unusable.
> > > > > > > > > > > >
> > > > > > > > > > > > It's not about mandating, it's about doing things in
> > > > > > > > > > > > the correct layer. If PRI is slow, PCI can evolve for sure.
> > > > > > > > > > > You should try.
> > > > > > > > > >
> > > > > > > > > > Not my duty, I just want to make sure things are done in
> > > > > > > > > > the correct layer, and once it needs to be done in the
> > > > > > > > > > virtio, there's nothing obviously wrong.
> > > > > > > > >
> > > > > > > > > Yea but just vague questions don't help to make sure eiter way.
> > > > > > > >
> > > > > > > > I don't think it's vague, I have explained, if something in
> > > > > > > > the virito slows down the PRI, we can try to fix them.
> > > > > > >
> > > > > > > I don't believe you are going to make PRI fast. No one managed so far.
> > > > > >
> > > > > > So it's the fault of PRI not virito, but it doesn't mean we need
> > > > > > to do it in virtio.
> > > > >
> > > > > I keep saying with this approach we would just say "e1000
> > > > > emulation is slow and encumbered this is the fault of e1000" and
> > > > > never get virtio at all.  Assigning blame only gets you so far.
> > > >
> > > > I think we are discussing different things. My point is virtio needs
> > > > to leverage the functionality provided by transport or platform
> > > > (especially considering they evolve faster than virtio). It seems to
> > > > me it's hard even to duplicate some basic function of platform IOMMU in
> > virtio.
> > > >
> > > Not duplicated. Feeding into the platform.
> > 
> > I mean IOMMU still sets the dirty bit, too. How is that not a duplication?
> >
> Only if the IOMMU is enabled for it.
> For example AMD has DTE HAD bit to enable dirty page tracking in IOMMU.
> 
> So if platform does not enable, it can be enabled on the device and vis-versa.

So again, if your motivation is on-device IOMMU then say so, and in this
case I don't see the point of only adding write tracking without
adding the actual device IOMMU interface.
And maybe that is the answer to resource management questions:
there's going to be an IOMMU data structure on the device
and it's just an extra bit in the PTE there.
Makes sense but let's see it all together then.
Because separate from on-device IOMMU it looks crazily expensive
and just weird.


> > 
> > > > >
> > > > > > >
> > > > > > > > Missing functions in
> > > > > > > > platform or transport is not a good excuse to try to
> > > > > > > > workaround it in the virtio. It's a layer violation and we
> > > > > > > > never had any feature like this in the past.
> > > > > > >
> > > > > > > Yes missing functionality in the platform is exactly why
> > > > > > > virtio was born in the first place.
> > > > > >
> > > > > > Well the platform can't do device specific logic. But that's not
> > > > > > the case of dirty page tracking which is device logic agnostic.
> > > > >
> > > > > Not true platforms have things like NICs on board and have for
> > > > > many years. It's about performance really.
> > > >
> > > > I've stated sufficient issues above. And one more obvious issue for
> > > > device initiated page logging is that it needs a lot of extra or
> > > > unnecessary PCI transactions which will throttle the performance of
> > > > the whole system (and lead to other issues like QOS). So I can't believe it has
> > good performance overall.
> > > > Logging via IOMMU or using shadow virtqueue doesn't need any extra
> > > > PCI transactions at least.
> > > >
> > > In the current proposal, it does not required PCI transactions, as there is only a
> > hypervisor-initiated query interface.
> > > It is a trade off of using svq + pasid vs using something from the device.
> > >
> > > Again, both has different use case and value. One uses cpu and one uses
> > device.
> > > Depending how much power one wants to spend where..
> > 
> > Also how much effort we want to spend on this virtio specific thing.
> > The needs to be a *reason* to do things in virtio as opposed to using platform
> > capabilities, this is exactly the same thing I told Lingshan wrt using SUSPEND for
> > power management as opposed to using PCI PM - relying on platform when we
> > can is right there in the mission statement.
> > For some reason I asssumed you guys have done a PoC and that's the
> > motivation but if it's a "just in case" feature then I'd suggest we focus on
> > merging patches 1-5 first.
> >
> It is not just in case feature.
> We learnt that not all cpus have it.

Have dirty tracking? Well shadow is portable.

> There is ongoing efforts of the poc.
> We will have the results in sometime.
> 
> We have similar interface on at least two devices already and integrated in the Linux stack, one in upstream, other is in progress.

Aha.  I hear IOMMUFD is working on integrating access to dirty bit.
Maybe compare performance to that?
It does not have to be exactly virtio I think for PoC.

> virtio is also in discussion here.
> 
> Sure, it is proposed as optional. We can focus on 1-5 first.
> I will split the series once I have comments.
> 
> There is also extension after 1-5 for net device context as well.
> 
>  
> > 
> > > > > So I'd like Parav to publish some
> > > > > experiment results and/or some estimates.
> > > > >
> > > >
> > > > That's fine, but the above equation (used by Qemu) is sufficient to
> > > > demonstrate how hard to stick wire speed in the case.
> > > >
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > In the current state, it is mandating.
> > > > > > > > > > > And if you think PRI is the only way,
> > > > > > > > > >
> > > > > > > > > > I don't, it's just an example where virtio can leverage
> > > > > > > > > > from either transport or platform. Or if it's the fault
> > > > > > > > > > in virtio that slows down the PRI, then it is something we can do.
> > > > > > > > > >
> > > > > > > > > > >  than you should propose that in the dirty page
> > > > > > > > > > > tracking series that
> > > > you listed above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > > > > > >
> > > > > > > > > > No, the point is to not duplicate works especially
> > > > > > > > > > considering virtio can't do better than platform or transport.
> > > > > > > > >
> > > > > > > > > If someone says they tried and platform's migration
> > > > > > > > > support does not work for them and they want to build a
> > > > > > > > > solution in virtio then what exactly is the objection?
> > > > > > > >
> > > > > > > > The discussion is to make sure whether virtio can do this
> > > > > > > > easily and correctly, then we can have a conclusion. I've
> > > > > > > > stated some issues above, and I've asked other questions
> > > > > > > > related to them which are still not answered.
> > > > > > > >
> > > > > > > > I think we had a very hard time in bypassing IOMMU in the
> > > > > > > > past that we don't want to repeat.
> > > > > > > >
> > > > > > > > We've gone through several methods of logging dirty pages in
> > > > > > > > the past (each with pros/cons), but this proposal never
> > > > > > > > explains why it chooses one of them but not others. Spec
> > > > > > > > needs to find the best path instead of just a possible path
> > > > > > > > without any rationale about
> > > > why.
> > > > > > >
> > > > > > > Adding more rationale isn't a bad thing.
> > > > > > > In particular if platform supplies dirty tracking then how
> > > > > > > does driver decide which to use platform or device capability?
> > > > > > > A bit of discussion around this is a good idea.
> > > > > > >
> > > > > > >
> > > > > > > > > virtio is here in the
> > > > > > > > > first place because emulating devices didn't work well.
> > > > > > > >
> > > > > > > > I don't understand here. We have supported emulated devices for
> > years.
> > > > > > > > I'm pretty sure a lot of issues could be uncovered if this
> > > > > > > > proposal can be prototyped with an emulated device first.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > >
> > > > > > > virtio was originally PV as opposed to emulation. That there's
> > > > > > > now hardware virtio and you call software implementation "an
> > > > > > > emulation" is very meta.
> > > > > >
> > > > > > Yes but I don't see how it relates to dirty page tracking. When
> > > > > > we find a way it should work for both software and hardware devices.
> > > > > >
> > > > > > Thanks
> > > > >
> > > > > It has to work well on a variety of existing platforms. If it does
> > > > > then sure, why would we roll our own.
> > > >
> > > > If virtio can do that in an efficient way without any issues, I agree.
> > > > But it seems not.
> > > >
> > > > Thanks
> > 
> > 
> > This publicly archived list offers a means to provide input to the OASIS Virtual
> > I/O Device (VIRTIO) TC.
> > 
> > In order to verify user consent to the Feedback License terms and to minimize
> > spam in the list archive, subscription is required before posting.
> > 
> > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > List help: virtio-comment-help@lists.oasis-open.org
> > List archive: https://lists.oasis-open.org/archives/virtio-comment/
> > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> > Committee: https://www.oasis-open.org/committees/virtio/
> > Join OASIS: https://www.oasis-open.org/join/
> 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 11:19                                       ` Michael S. Tsirkin
@ 2023-11-17 11:32                                         ` Parav Pandit
  2023-11-17 11:49                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 11:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 4:50 PM
> 
> On Fri, Nov 17, 2023 at 10:48:36AM +0000, Parav Pandit wrote:
> > > > > Actually not, Parav said the device needs to reserve sufficient
> > > > > resources in another thread.
> > > > The device resource reservation starts only when the device migration
> starts.
> > > > i.e. with WRITE_RECORDS_START command of patch 7 in the series.
> > >
> > > And now your precious VM can't migrate at all because -ENOSPC.
> > >
> > I am not aware of any Linux IOCTL that ensures a guaranteed execution
> > without an error code. :)
> >
> > As we talked in other email, a VF can be provisioned too as extension and
> capability can be exposed.
> > This is not going the only error on device migration.
> 
> Allocating resources on outgoing migration is a very bad idea.
> It is common to migrate prcisely because you are out of resources.
> Incoming is a different story, less of a problem.
>
The resource allocated may not be on same system.
Also the resource allocated while the VM is running, so I don’t see a problem.

Additionally, this is not what the Linux kernel maintainers of iommu subsystem told us either.
Let me know if you check with Alex W and Jason who build this interface.
 
> 
> > >
> > >
> > > > >
> > > > > >
> > > > > >
> > > > > > > >
> > > > > > > > The data structure is different but I don't see why it is critical.
> > > > > > > >
> > > > > > > > I agree that I don't see out of buffers notifications too
> > > > > > > > which implies device has to maintain something like a bitmap
> internally.
> > > > > > > > Which I guess could be fine but it is not clear to me how
> > > > > > > > large that bitmap has to be. How does the device know?
> > > > > > > > Needs to be
> > > addressed.
> > > > > > >
> > > > > > > This is the question I asked Parav in another thread. Using
> > > > > > > host memory as a queue with notification (like PML) might be much
> better.
> > > > > >
> > > > > > Well if queue is what you want to do you can just do it internally.
> > > > >
> > > > > Then it's not the proposal here, Parav has explained it in
> > > > > another reply, and as explained it lacks a lot of other facilities.
> > > > >
> > > > PML is yet another option that requires small pci writes.
> > > > In the current proposal, there are no small PCI writes.
> > > > It is a query interface from the device.
> > > >
> > > > > > Problem of course is that it might overflow and cause things
> > > > > > like packet drops.
> > > > >
> > > > > Exactly like PML. So sticking to wire speed should not be a
> > > > > general goal in the context of migration. It can be done if the
> > > > > speed of the migration interface is faster than the virtio device that
> needs to be migrated.
> > > > May not have to be.
> > > > Speed of page recording should be fast enough.
> > > > It usually improves with subsequent generation.
> > > > >
> > > > > >
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Even if we manage to do that, it doesn't mean we won't have
> issues.
> > > > > > > > >
> > > > > > > > > 1) For many reasons it can neither see nor log via GPA,
> > > > > > > > > so this requires a traversal of the vIOMMU mapping
> > > > > > > > > tables by the hypervisor afterwards, it would be
> > > > > > > > > expensive and need synchronization with the guest
> > > > > > > > > modification of the IO page table which
> > > > > looks very hard.
> > > > > > > >
> > > > > > > > vIOMMU is fast enough to be used on data path but not fast
> > > > > > > > enough for dirty tracking?
> > > > > > >
> > > > > > > We set up SPTEs or using nesting offloading where the PTEs
> > > > > > > could be iterated by hardware directly which is fast.
> > > > > >
> > > > > > There's a way to have hardware find dirty PTEs for you quickly?
> > > > >
> > > > > Scanning PTEs on the host is faster and more secure than
> > > > > scanning guests, that's what I want to say:
> > > > >
> > > > > 1) the guest page could be swapped out but not the host one.
> > > > > 2) no guest triggerable behavior
> > > > >
> > > >
> > > > Device page tracking table to be consulted to flush on mapping change.
> > > >
> > > > > > I don't know how it's done. Do tell.
> > > > > >
> > > > > >
> > > > > > > This is not the case here where software needs to iterate
> > > > > > > the IO page tables in the guest which could be slow.
> > > > > > >
> > > > > > > > Hard to believe.  If true and you want to speed up vIOMMU
> > > > > > > > then you implement an efficient datastructure for that.
> > > > > > >
> > > > > > > Besides the issue of performance, it's also racy, assuming
> > > > > > > we are logging
> > > > > IOVA.
> > > > > > >
> > > > > > > 0) device log IOVA
> > > > > > > 1) hypervisor fetches IOVA from log buffer
> > > > > > > 2) guest map IOVA to a new GPA
> > > > > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > > > > >
> > > > > > > Then we lost the old GPA.
> > > > > >
> > > > > > Interesting and a good point.
> > > > >
> > > > > Note that PML logs at GPA as it works at L1 of EPT.
> > > > >
> > > > > > And by the way e.g. vhost has the same issue.  You need to
> > > > > > flush dirty tracking info when changing the mappings somehow.
> > > > >
> > > > > It's not,
> > > > >
> > > > > 1) memory translation is done by vhost
> > > > > 2) vhost knows GPA and it doesn't log via IOVA.
> > > > >
> > > > > See this for example, and DPDK has similar fixes.
> > > > >
> > > > > commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
> > > > > Author: Jason Wang <jasowang@redhat.com>
> > > > > Date:   Wed Jan 16 16:54:42 2019 +0800
> > > > >
> > > > >     vhost: log dirty page correctly
> > > > >
> > > > >     Vhost dirty page logging API is designed to sync through GPA. But we
> > > > >     try to log GIOVA when device IOTLB is enabled. This is wrong and may
> > > > >     lead to missing data after migration.
> > > > >
> > > > >     To solve this issue, when logging with device IOTLB enabled, we will:
> > > > >
> > > > >     1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
> > > > >        get HVA, for writable descriptor, get HVA through iovec. For used
> > > > >        ring update, translate its GIOVA to HVA
> > > > >     2) traverse the GPA->HVA mapping to get the possible GPA and log
> > > > >        through GPA. Pay attention this reverse mapping is not guaranteed
> > > > >        to be unique, so we should log each possible GPA in this case.
> > > > >
> > > > >     This fix the failure of scp to guest during migration. In -next, we
> > > > >     will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> > > > >
> > > > >     Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> > > > >     Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> > > > >     Cc: Jintack Lim <jintack@cs.columbia.edu>
> > > > >     Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > >     Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > > > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > > > >
> > > > > All of the above is not what virtio did right now.
> > > > >
> > > > > > Parav what's the plan for this? Should be addressed in the spec too.
> > > > > >
> > > > >
> > > > > AFAIK, there's no easy/efficient way to do that. I hope I was wrong.
> > > > >
> > > >
> > > > The query interface in this proposal works on the granular
> > > > boundary to read
> > > and clear.
> > > > This will ensure that mapping is consistent.
> > >
> > > By itself it does not, you have to actually keep querying until you
> > > flush all dirty info and do it each time there's an invalidation in the IOMMU.
> > >
> > Only during device migration time.
> > It only applied on those specific cases when unmapping and migration both in
> progress at same time.
> > But yes, it can slow down unmapping.
> >
> > >
> > > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > 2) There are a lot of special or reserved IOVA ranges
> > > > > > > > > (for example the interrupt areas in x86) that need
> > > > > > > > > special care which is architectural and where it is
> > > > > > > > > beyond the scope or knowledge of the virtio device but the
> platform IOMMU.
> > > > > > > > > Things would be more complicated when SVA is enabled.
> > > > > > > >
> > > > > > > > SVA being what here?
> > > > > > >
> > > > > > > For example, IOMMU may treat interrupt ranges differently
> > > > > > > depending on whether SVA is enabled or not. It's very hard
> > > > > > > and unnecessary to teach devices about this.
> > > > > >
> > > > > > Oh, shared virtual memory. So what you are saying here? virtio
> > > > > > does not care, it just uses some addresses and if you want it
> > > > > > to it can record writes somewhere.
> > > > >
> > > > > One example, PCI allows devices to send translated requests, how
> > > > > can a hypervisor know it's a PA or IOVA in this case? We
> > > > > probably need a new bit. But it's not the only thing we need to deal with.
> > > > >
> > > > > By definition, interrupt ranges and other reserved ranges should
> > > > > not belong to dirty pages. And the logging should be done before
> > > > > the DMA where there's no way for the device to know whether or
> > > > > not an IOVA is valid or not. It would be more safe to just not
> > > > > report them from the source instead of leaving it to the
> > > > > hypervisor to deal with but this seems impossible at the device
> > > > > level. Otherwise the hypervisor driver needs to communicate with
> > > > > the (v)IOMMU to be reached with the
> > > > > interrupt(MSI) area, RMRR area etc in order to do the correct
> > > > > things or it might have security implications. And those areas
> > > > > don't make sense at L1 when vSVA is enabled. What's more, when
> > > > > vIOMMU could be fully offloaded, there's no easy way to fetch that
> information.
> > > > >
> > > > There cannot be logging before the DMA.
> > > > Only requirement is before the mapping changes, the dirty page
> > > > tracking to be
> > > synced.
> > > >
> > > > In most common cases where the perf is critical, such mapping wont
> > > > change
> > > so often dynamically anyway.
> > > >
> > > > > Again, it's hard to bypass or even duplicate the functionality
> > > > > of the platform or we need to step into every single detail of a
> > > > > specific transport, architecture or IOMMU to figure out whether
> > > > > or not logging at virtio is correct which is awkward and unrealistic.
> > > > > This proposal suffers from an exact similar issue when inventing
> > > > > things like freeze/stop where I've pointed out other branches of
> > > > > issues as
> > > well.
> > > > >
> > > > It is incorrect attribution that platform is duplicated here.
> > > > It feeds the data to the platform as needed without replicating.
> > > >
> > > > I do agree that there is overlap of IOMMU tracking the dirty and
> > > > storing it in
> > > the per PTE vs device supplying its dirty track via its own interface.
> > > > Both are consolidated at hypervisor level.
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > And there could be other architecte specific knowledge
> > > > > > > > > (e.g
> > > > > > > > > PAGE_SIZE) that might be needed. There's no easy way to
> > > > > > > > > deal with those cases.
> > > > > > > >
> > > > > > > > Good point about page size actually - using 4k
> > > > > > > > unconditionally is a waste of resources.
> > > > > > >
> > > > > > > Actually, they are more than just PAGE_SIZE, for example,
> > > > > > > PASID and
> > > others.
> > > > > >
> > > > > > what does pasid have to do with it? anyway, just give driver
> > > > > > control over page size.
> > > > >
> > > > > For example, two virtqueues have two PASIDs assigned. How can a
> > > > > hypervisor know which specific IOVA belongs to which IOVA? For
> > > > > platform IOMMU, they are handy as it talks to the transport. But
> > > > > I don't think we need to duplicate every transport specific
> > > > > address space
> > > feature in core virtio layer:
> > > > >
> > > > PASID to vq assignment won't be duplicated.
> > > > It is configured fully by the guest without consulting hypervisor
> > > > at the device
> > > level.
> > > > Guest IOMMU would consult hypervisor to setup any PASID mapping as
> > > > part
> > > of any mapping method.
> > > >
> > > > > 1) translated/untranslated request
> > > > > 2) request w/ and w/o PASID
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > We wouldn't need to care about all of them if it is done
> > > > > > > > > at platform IOMMU level.
> > > > > > > >
> > > > > > > > If someone logs at IOMMU level then nothing needs to be
> > > > > > > > done in the spec at all. This is about capability at the device level.
> > > > > > >
> > > > > > > True, but my question is where or not it can be done at the
> > > > > > > device level
> > > > > easily.
> > > > > >
> > > > > > there's no "easily" about live migration ever.
> > > > >
> > > > > I think I've stated sufficient issues to demonstrate how hard
> > > > > virtio wants to
> > > do it.
> > > > > And I've given the link that it is possible to do that in IOMMU
> > > > > without those issues. So in this context doing it in virtio is much harder.
> > > > >
> > > > > > For example on-device iommus are a thing.
> > > > >
> > > > > I'm not sure that's the way to go considering the platform IOMMU
> > > > > evolves very quickly.
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > > what Lingshan
> > > > > > > > > > proposed is analogous to bit per page - problem
> > > > > > > > > > unfortunately is you can't easily set a bit by DMA.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm not saying bit/bytemap is the best, but it has been
> > > > > > > > > used by real hardware. And we have many other options.
> > > > > > > > >
> > > > > > > > > > So I think this dirty tracking is a good option to have.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Why does it matter in 2024?
> > > > > > > > > > > > Because users needs to use it now.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > If not, we are better off to offer this, and
> > > > > > > > > > > > > > when/if platform support is, sure,
> > > > > > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 4) if the platform support is missing, we
> > > > > > > > > > > > > > > can use software or leverage transport for
> > > > > > > > > > > > > > > assistance like PRI
> > > > > > > > > > > > > > All of these are in theory.
> > > > > > > > > > > > > > Our experiment shows PRI performance is 21x
> > > > > > > > > > > > > > slower than page fault rate
> > > > > > > > > > > > > done by the cpu.
> > > > > > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > > > > > >
> > > > > > > > > > > > > If you stick to the wire speed during migration,
> > > > > > > > > > > > > it can
> > > converge.
> > > > > > > > > > > > Do you have perf data for this?
> > > > > > > > > > >
> > > > > > > > > > > No, but it's not hard to imagine the worst case.
> > > > > > > > > > > Wrote a small program that dirty every page by a NIC.
> > > > > > > > > > >
> > > > > > > > > > > > In the internal tests we don’t see this happening.
> > > > > > > > > > >
> > > > > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > > > > > >
> > > > > > > > > > > So if we get very high dirty rates (e.g by a high
> > > > > > > > > > > speed NIC), we can't satisfy the requirement of the downtime.
> > > > > > > > > > > Or if you see the converge, you might get help from
> > > > > > > > > > > the auto converge support by the hypervisors like
> > > > > > > > > > > KVM where it tries to throttle the VCPU then you can't reach
> the wire speed.
> > > > > > > > > >
> > > > > > > > > > Will only work for some device types.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yes, that's the point. Parav said he doesn't see the
> > > > > > > > > issue, it's probably because he is testing a virtio-net
> > > > > > > > > and so the vCPU is automatically throttled. It doesn't
> > > > > > > > > mean it can work for other virito devices.
> > > > > > > >
> > > > > > > > Only for TX, and I'm pretty sure they had the foresight to
> > > > > > > > test RX not just TX but let's confirm. Parav did you test both
> directions?
> > > > > > >
> > > > > > > RX speed somehow depends on the speed of refill, so
> > > > > > > throttling helps more or less.
> > > > > >
> > > > > > It doesn't depend on speed of refill you just underrun and
> > > > > > drop packets. then your nice 10usec latency becomes more like 10sec.
> > > > >
> > > > > I miss your point here. If the driver can't achieve wire speed
> > > > > without dirty page tracking, it can neither when dirty page
> > > > > tracking is
> > > enabled.
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > > > > > So it is unusable.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It's not about mandating, it's about doing
> > > > > > > > > > > > > things in the correct layer. If PRI is slow, PCI can evolve for
> sure.
> > > > > > > > > > > > You should try.
> > > > > > > > > > >
> > > > > > > > > > > Not my duty, I just want to make sure things are
> > > > > > > > > > > done in the correct layer, and once it needs to be
> > > > > > > > > > > done in the virtio, there's nothing obviously wrong.
> > > > > > > > > >
> > > > > > > > > > Yea but just vague questions don't help to make sure eiter way.
> > > > > > > > >
> > > > > > > > > I don't think it's vague, I have explained, if something
> > > > > > > > > in the virito slows down the PRI, we can try to fix them.
> > > > > > > >
> > > > > > > > I don't believe you are going to make PRI fast. No one managed so
> far.
> > > > > > >
> > > > > > > So it's the fault of PRI not virito, but it doesn't mean we
> > > > > > > need to do it in virtio.
> > > > > >
> > > > > > I keep saying with this approach we would just say "e1000
> > > > > > emulation is slow and encumbered this is the fault of e1000"
> > > > > > and never get virtio at all.  Assigning blame only gets you so far.
> > > > >
> > > > > I think we are discussing different things. My point is virtio
> > > > > needs to leverage the functionality provided by transport or
> > > > > platform (especially considering they evolve faster than
> > > > > virtio). It seems to me it's hard even to duplicate some basic
> > > > > function of platform IOMMU in
> > > virtio.
> > > > >
> > > > Not duplicated. Feeding into the platform.
> > >
> > > I mean IOMMU still sets the dirty bit, too. How is that not a duplication?
> > >
> > Only if the IOMMU is enabled for it.
> > For example AMD has DTE HAD bit to enable dirty page tracking in IOMMU.
> >
> > So if platform does not enable, it can be enabled on the device and vis-versa.
> 
> So again, if your motivation is on-device IOMMU then say so, and in this case I
> don't see the point of only adding write tracking without adding the actual
> device IOMMU interface.
It is not the device IOMMU as it does not do all the work of platform.

> And maybe that is the answer to resource management questions:
> there's going to be an IOMMU data structure on the device and it's just an extra
> bit in the PTE there.
> Makes sense but let's see it all together then.
> Because separate from on-device IOMMU it looks crazily expensive and just
> weird.
>
I would agree that there is an expense there but worth for those cpus which cannot track it.
 
> 
> > >
> > > > > >
> > > > > > > >
> > > > > > > > > Missing functions in
> > > > > > > > > platform or transport is not a good excuse to try to
> > > > > > > > > workaround it in the virtio. It's a layer violation and
> > > > > > > > > we never had any feature like this in the past.
> > > > > > > >
> > > > > > > > Yes missing functionality in the platform is exactly why
> > > > > > > > virtio was born in the first place.
> > > > > > >
> > > > > > > Well the platform can't do device specific logic. But that's
> > > > > > > not the case of dirty page tracking which is device logic agnostic.
> > > > > >
> > > > > > Not true platforms have things like NICs on board and have for
> > > > > > many years. It's about performance really.
> > > > >
> > > > > I've stated sufficient issues above. And one more obvious issue
> > > > > for device initiated page logging is that it needs a lot of
> > > > > extra or unnecessary PCI transactions which will throttle the
> > > > > performance of the whole system (and lead to other issues like
> > > > > QOS). So I can't believe it has
> > > good performance overall.
> > > > > Logging via IOMMU or using shadow virtqueue doesn't need any
> > > > > extra PCI transactions at least.
> > > > >
> > > > In the current proposal, it does not required PCI transactions, as
> > > > there is only a
> > > hypervisor-initiated query interface.
> > > > It is a trade off of using svq + pasid vs using something from the device.
> > > >
> > > > Again, both has different use case and value. One uses cpu and one
> > > > uses
> > > device.
> > > > Depending how much power one wants to spend where..
> > >
> > > Also how much effort we want to spend on this virtio specific thing.
> > > The needs to be a *reason* to do things in virtio as opposed to
> > > using platform capabilities, this is exactly the same thing I told
> > > Lingshan wrt using SUSPEND for power management as opposed to using
> > > PCI PM - relying on platform when we can is right there in the mission
> statement.
> > > For some reason I asssumed you guys have done a PoC and that's the
> > > motivation but if it's a "just in case" feature then I'd suggest we
> > > focus on merging patches 1-5 first.
> > >
> > It is not just in case feature.
> > We learnt that not all cpus have it.
> 
> Have dirty tracking? Well shadow is portable.
>
We have seen that shadow is not helpful. It has its own very weird issue. I wont bring up here.
 
> > There is ongoing efforts of the poc.
> > We will have the results in sometime.
> >
> > We have similar interface on at least two devices already and integrated in the
> Linux stack, one in upstream, other is in progress.
> 
> Aha.  I hear IOMMUFD is working on integrating access to dirty bit.
> Maybe compare performance to that?
> It does not have to be exactly virtio I think for PoC.
> 
Yes. it is.
The point is, even if we compare, there is no comparison point for the cpus that does not support it.
Users are not going to use mediation layer anyway and orchestrate things differently just because data center has mix of servers.

It is far easier to run through same set of hw + sw stack. This is the feedback we got from the users.
Hence the device expense.

Post merging the series 1-5, we will have some early perf numbers as well.
The expense is not a lot in current PoC round.
DPUs for dynamic workload has 8MB of RAM.

> > virtio is also in discussion here.
> >
> > Sure, it is proposed as optional. We can focus on 1-5 first.
> > I will split the series once I have comments.
> >
> > There is also extension after 1-5 for net device context as well.
> >
> >
> > >
> > > > > > So I'd like Parav to publish some experiment results and/or
> > > > > > some estimates.
> > > > > >
> > > > >
> > > > > That's fine, but the above equation (used by Qemu) is sufficient
> > > > > to demonstrate how hard to stick wire speed in the case.
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > In the current state, it is mandating.
> > > > > > > > > > > > And if you think PRI is the only way,
> > > > > > > > > > >
> > > > > > > > > > > I don't, it's just an example where virtio can
> > > > > > > > > > > leverage from either transport or platform. Or if
> > > > > > > > > > > it's the fault in virtio that slows down the PRI, then it is
> something we can do.
> > > > > > > > > > >
> > > > > > > > > > > >  than you should propose that in the dirty page
> > > > > > > > > > > > tracking series that
> > > > > you listed above to not do dirty page tracking. Rather depend on PRI,
> right?
> > > > > > > > > > >
> > > > > > > > > > > No, the point is to not duplicate works especially
> > > > > > > > > > > considering virtio can't do better than platform or transport.
> > > > > > > > > >
> > > > > > > > > > If someone says they tried and platform's migration
> > > > > > > > > > support does not work for them and they want to build
> > > > > > > > > > a solution in virtio then what exactly is the objection?
> > > > > > > > >
> > > > > > > > > The discussion is to make sure whether virtio can do
> > > > > > > > > this easily and correctly, then we can have a
> > > > > > > > > conclusion. I've stated some issues above, and I've
> > > > > > > > > asked other questions related to them which are still not
> answered.
> > > > > > > > >
> > > > > > > > > I think we had a very hard time in bypassing IOMMU in
> > > > > > > > > the past that we don't want to repeat.
> > > > > > > > >
> > > > > > > > > We've gone through several methods of logging dirty
> > > > > > > > > pages in the past (each with pros/cons), but this
> > > > > > > > > proposal never explains why it chooses one of them but
> > > > > > > > > not others. Spec needs to find the best path instead of
> > > > > > > > > just a possible path without any rationale about
> > > > > why.
> > > > > > > >
> > > > > > > > Adding more rationale isn't a bad thing.
> > > > > > > > In particular if platform supplies dirty tracking then how
> > > > > > > > does driver decide which to use platform or device capability?
> > > > > > > > A bit of discussion around this is a good idea.
> > > > > > > >
> > > > > > > >
> > > > > > > > > > virtio is here in the
> > > > > > > > > > first place because emulating devices didn't work well.
> > > > > > > > >
> > > > > > > > > I don't understand here. We have supported emulated
> > > > > > > > > devices for
> > > years.
> > > > > > > > > I'm pretty sure a lot of issues could be uncovered if
> > > > > > > > > this proposal can be prototyped with an emulated device first.
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > >
> > > > > > > > virtio was originally PV as opposed to emulation. That
> > > > > > > > there's now hardware virtio and you call software
> > > > > > > > implementation "an emulation" is very meta.
> > > > > > >
> > > > > > > Yes but I don't see how it relates to dirty page tracking.
> > > > > > > When we find a way it should work for both software and hardware
> devices.
> > > > > > >
> > > > > > > Thanks
> > > > > >
> > > > > > It has to work well on a variety of existing platforms. If it
> > > > > > does then sure, why would we roll our own.
> > > > >
> > > > > If virtio can do that in an efficient way without any issues, I agree.
> > > > > But it seems not.
> > > > >
> > > > > Thanks
> > >
> > >
> > > This publicly archived list offers a means to provide input to the
> > > OASIS Virtual I/O Device (VIRTIO) TC.
> > >
> > > In order to verify user consent to the Feedback License terms and to
> > > minimize spam in the list archive, subscription is required before posting.
> > >
> > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > > List help: virtio-comment-help@lists.oasis-open.org
> > > List archive: https://lists.oasis-open.org/archives/virtio-comment/
> > > Feedback License:
> > > https://www.oasis-open.org/who/ipr/feedback_license.pdf
> > > List Guidelines:
> > > https://www.oasis-open.org/policies-guidelines/mailing-lists
> > > Committee: https://www.oasis-open.org/committees/virtio/
> > > Join OASIS: https://www.oasis-open.org/join/
> >


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 10:52                                                               ` Parav Pandit
@ 2023-11-17 11:32                                                                 ` Michael S. Tsirkin
  2023-11-17 12:22                                                                   ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 11:32 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Fri, Nov 17, 2023 at 10:52:49AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 4:08 PM
> > 
> > On Fri, Nov 17, 2023 at 09:57:52AM +0000, Parav Pandit wrote:
> > >
> > > > From: virtio-comment@lists.oasis-open.org
> > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Michael S.
> > > > Tsirkin
> > > > Sent: Friday, November 17, 2023 3:21 PM
> > > >
> > > > On Fri, Nov 17, 2023 at 09:41:40AM +0000, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 3:08 PM
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 09:14:21AM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Friday, November 17, 2023 2:16 PM In any case you can
> > > > > > > > safely assume that many users will have migration that takes
> > > > > > > > seconds and minutes.
> > > > > > >
> > > > > > > Strange, but ok. I don't see any problem with current method.
> > > > > > > 8MB is used for very large VM of 1TB takes minutes. Should be fine.
> > > > > >
> > > > > > The problem is simple: vendors selling devices have no idea how
> > > > > > large the VM will be. So you have to over-provision for the max VM size.
> > > > > > If there was a way to instead allocate that in host memory, that
> > > > > > would improve on this.
> > > > >
> > > > > Not sure what to over provision for max VM size.
> > > > > Vendor does not know how many vcpus will be needed. It is no
> > > > > different
> > > > problem.
> > > > >
> > > > > When the VM migration is started, the individual tracking range is
> > > > > supplied by
> > > > the hypervisor to device.
> > > > > Device allocates necessary memory on this instruction.
> > > > >
> > > > > When the VM with certain size is provisioned, the member device
> > > > > can be
> > > > provisioned for the VM size.
> > > > > And if it cannot be provisioned, possibly this may not the right
> > > > > member device
> > > > to use at that point in time.
> > > >
> > > > For someone who keeps arguing against adding single bit registers
> > > > "because it does not scale" you seem very nonchalant about adding
> > 8Mbytes.
> > > >
> > > There is fundamental difference on how/when a bit is used.
> > > One wants to use a bit for non-performance part and keep it always available
> > vs data path.
> > > Not same comparison.
> > >
> > > > I thought we have a nicely contained and orthogonal feature, so if
> > > > it's optional it's not a problem.
> > > It is optional as always.
> > >
> > > >
> > > > But with such costs and corner cases what exactly is the motivation
> > > > for the feature here?
> > > New generations DPUs have memory for device data path workloads but not
> > for bits.
> > >
> > > > Do you have a PoC showing how this works better than e.g.
> > > > shadow VQ?
> > > >
> > > Not yet.
> > > But I don't think this can be even a criteria to consider as dependency on
> > PASID is nonstarter with other limitations.
> > 
> > You just need dirty bit in PTE, whether that is tied to PASID depends very much
> > on the platform.  For VTD I think it is.  And if shadow vq works as a fallback, it
> > just might be reasonable not to do any tracking in virtio.
> >
> Somehow the claim of shadow vq is great without sharing any performance numbers is what I don't agree with.

It's upstream in QEMU. Test it youself.

> And it fundamentally does not fit the generic stack where virtio to be used.
> 
> We have accelerated some of the shadow vq for non virtio devices and those optimizations are not elegant enough that I wouldn't want to bring to virtio spec.
> A different discussion.

Let's just say, it's more elegant than what I saw so far.

> > > > Maybe IOMMU based and shadow VQ based tracking are the way to go
> > > > initially, and if there's a problem then we should add this later, on top.
> > > >
> > > For the cpus that does not support IOMMU cannot shift to shadow VQ either.
> > 
> > I don't know what this means (no IOMMU at all?) but it looks like shadow vq
> > and similar approaches are in production with vdpa and have been
> > demonstrated for a while. All we are doing is supporting them in virtio proper.
> > 
> IOMMU is present but does not have support for D bit.

yes, there are systems like this.  It would be interesting to see some
info on how widespread this is.  Sometimes it is easier to just tell
customers "so buy a better IOMMU" instead of investing in work-arounds.

> > > > I really want us to finally make progress merging features and
> > > > anything that reduces scope initially is good for that.
> > > >
> > > Yes, if you prefer to split the last three patches, I am fine.
> > > Please let me know.
> > 
> > As here have not been any comments on 1-5 I don't think there's need to repost
> > this just yet. I'll review 1-5 next week.
> > I think in the next version it might be wise to split this and post as two series,
> > yes.
> Ok.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 11:05                                           ` Parav Pandit
@ 2023-11-17 11:33                                             ` Michael S. Tsirkin
  2023-11-17 11:45                                               ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 11:33 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 4:30 PM
> > 
> > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > Sent: Friday, November 17, 2023 3:30 PM
> > > >
> > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu, Lingshan wrote:
> > > > >>
> > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > > >>>> We should expose a limit of the device in the proposed
> > > > WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > > > >>>> So that future provisioning framework can use it.
> > > > >>>>
> > > > >>>> I will cover this in v5 early next week.
> > > > >>> I do worry about how this can even work though. If you want a
> > > > >>> generic device you do not get to dictate how much memory VM has.
> > > > >>>
> > > > >>> Aren't we talking bit per page? With 1TByte of memory to track
> > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > >>>
> > > > >>> And you happily say "we'll address this in the future" while at
> > > > >>> the same time fighting tooth and nail against adding single bit
> > > > >>> status registers because scalability?
> > > > >>>
> > > > >>>
> > > > >>> I have a feeling doing this completely theoretical like this is problematic.
> > > > >>> Maybe you have it all laid out neatly in your head but I suspect
> > > > >>> not all of TC can picture it clearly enough based just on spec text.
> > > > >>>
> > > > >>> We do sometimes ask for POC implementation in linux / qemu to
> > > > >>> demonstrate how things work before merging code. We skipped this
> > > > >>> for admin things so far but I think it's a good idea to start
> > > > >>> doing it here.
> > > > >>>
> > > > >>> What makes me pause a bit before saying please do a PoC is all
> > > > >>> the opposition that seems to exist to even using admin commands
> > > > >>> in the 1st place. I think once we finally stop arguing about
> > > > >>> whether to use admin commands at all then a PoC will be needed
> > before merging.
> > > > >> We have POR productions that implemented the approach in my series.
> > > > >> They are multiple generations of productions in market and
> > > > >> running in customers data centers for years.
> > > > >>
> > > > >> Back to 2019 when we start working on vDPA, we have sent some
> > > > >> samples of production(e.g., Cascade Glacier) and the datasheet,
> > > > >> you can find live migration facilities there, includes suspend,
> > > > >> vq state and other features.
> > > > >>
> > > > >> And there is an reference in DPDK live migration, I have provided
> > > > >> this page
> > > > >> before:
> > > > >> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html, it has been
> > > > >> working for long long time.
> > > > >>
> > > > >> So if we let the facts speak, if we want to see if the proposal
> > > > >> is proven to work, I would
> > > > >> say: They are POR for years, customers already deployed them for years.
> > > > > And I guess what you are trying to say is that this patchset we
> > > > > are reviewing here should be help to the same standard and there
> > > > > should be a PoC? Sounds reasonable.
> > > > Yes and the in-marketing productions are POR, the series just
> > > > improves the design, for example, our series also use registers to
> > > > track vq state, but improvements than CG or BSC. So I think they are proven
> > to work.
> > >
> > > If you prefer to go the route of POR and production and proven documents
> > etc, there is ton of it of multiple types of products I can dump here with open-
> > source code and documentation and more.
> > > Let me know what you would like to see.
> > >
> > > Michael has requested some performance comparisons, not all are ready to
> > share yet.
> > > Some are present that I will share in coming weeks.
> > >
> > > And all the vdpa dpdk you published does not have basic CVQ support when I
> > last looked at it.
> > > Do you know when was it added?
> > 
> > It's good enough for PoC I think, CVQ or not.
> > The problem with CVQ generally, is that VDPA wants to shadow CVQ it at all
> > times because it wants to decode and cache the content. But this problem has
> > nothing to do with dirty tracking even though it also mentions "shadow":
> > if device can report it's state then there's no need to shadow CVQ.
> 
> For the performance numbers with the pre-copy and device context of patches posted 1 to 5, the downtime reduction of the VM is 3.71x with active traffic on 8 RQs at 100Gbps port speed.

Sounds good can you please post a bit more detail?
which configs are you comparing what was the result on each of them.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 11:33                                             ` Michael S. Tsirkin
@ 2023-11-17 11:45                                               ` Parav Pandit
  2023-11-17 12:04                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 11:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 5:04 PM
> 
> On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 4:30 PM
> > >
> > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > >
> > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu, Lingshan wrote:
> > > > > >>
> > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > > > >>>> We should expose a limit of the device in the proposed
> > > > > WRITE_RECORD_CAP_QUERY command, that how much range it can
> track.
> > > > > >>>> So that future provisioning framework can use it.
> > > > > >>>>
> > > > > >>>> I will cover this in v5 early next week.
> > > > > >>> I do worry about how this can even work though. If you want
> > > > > >>> a generic device you do not get to dictate how much memory VM
> has.
> > > > > >>>
> > > > > >>> Aren't we talking bit per page? With 1TByte of memory to
> > > > > >>> track
> > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > >>>
> > > > > >>> And you happily say "we'll address this in the future" while
> > > > > >>> at the same time fighting tooth and nail against adding
> > > > > >>> single bit status registers because scalability?
> > > > > >>>
> > > > > >>>
> > > > > >>> I have a feeling doing this completely theoretical like this is
> problematic.
> > > > > >>> Maybe you have it all laid out neatly in your head but I
> > > > > >>> suspect not all of TC can picture it clearly enough based just on spec
> text.
> > > > > >>>
> > > > > >>> We do sometimes ask for POC implementation in linux / qemu
> > > > > >>> to demonstrate how things work before merging code. We
> > > > > >>> skipped this for admin things so far but I think it's a good
> > > > > >>> idea to start doing it here.
> > > > > >>>
> > > > > >>> What makes me pause a bit before saying please do a PoC is
> > > > > >>> all the opposition that seems to exist to even using admin
> > > > > >>> commands in the 1st place. I think once we finally stop
> > > > > >>> arguing about whether to use admin commands at all then a
> > > > > >>> PoC will be needed
> > > before merging.
> > > > > >> We have POR productions that implemented the approach in my
> series.
> > > > > >> They are multiple generations of productions in market and
> > > > > >> running in customers data centers for years.
> > > > > >>
> > > > > >> Back to 2019 when we start working on vDPA, we have sent some
> > > > > >> samples of production(e.g., Cascade Glacier) and the
> > > > > >> datasheet, you can find live migration facilities there,
> > > > > >> includes suspend, vq state and other features.
> > > > > >>
> > > > > >> And there is an reference in DPDK live migration, I have
> > > > > >> provided this page
> > > > > >> before:
> > > > > >> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html, it has
> > > > > >> been working for long long time.
> > > > > >>
> > > > > >> So if we let the facts speak, if we want to see if the
> > > > > >> proposal is proven to work, I would
> > > > > >> say: They are POR for years, customers already deployed them for
> years.
> > > > > > And I guess what you are trying to say is that this patchset
> > > > > > we are reviewing here should be help to the same standard and
> > > > > > there should be a PoC? Sounds reasonable.
> > > > > Yes and the in-marketing productions are POR, the series just
> > > > > improves the design, for example, our series also use registers
> > > > > to track vq state, but improvements than CG or BSC. So I think
> > > > > they are proven
> > > to work.
> > > >
> > > > If you prefer to go the route of POR and production and proven
> > > > documents
> > > etc, there is ton of it of multiple types of products I can dump
> > > here with open- source code and documentation and more.
> > > > Let me know what you would like to see.
> > > >
> > > > Michael has requested some performance comparisons, not all are
> > > > ready to
> > > share yet.
> > > > Some are present that I will share in coming weeks.
> > > >
> > > > And all the vdpa dpdk you published does not have basic CVQ
> > > > support when I
> > > last looked at it.
> > > > Do you know when was it added?
> > >
> > > It's good enough for PoC I think, CVQ or not.
> > > The problem with CVQ generally, is that VDPA wants to shadow CVQ it
> > > at all times because it wants to decode and cache the content. But
> > > this problem has nothing to do with dirty tracking even though it also
> mentions "shadow":
> > > if device can report it's state then there's no need to shadow CVQ.
> >
> > For the performance numbers with the pre-copy and device context of
> patches posted 1 to 5, the downtime reduction of the VM is 3.71x with active
> traffic on 8 RQs at 100Gbps port speed.
> 
> Sounds good can you please post a bit more detail?
> which configs are you comparing what was the result on each of them.

Common config: 8+8 tx and rx queues.
Port speed: 100Gbps
QEMU 8.1
Libvirt 7.0
GVM: Centos 7.4
Device: virtio VF hardware device

Config_1: virtio suspend/resume similar to what Lingshan has, largely vdpa stack
Config_2: Device context method of admin commands

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 11:32                                         ` Parav Pandit
@ 2023-11-17 11:49                                           ` Michael S. Tsirkin
  2023-11-17 12:15                                             ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 11:49 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Fri, Nov 17, 2023 at 11:32:35AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 4:50 PM
> > 
> > On Fri, Nov 17, 2023 at 10:48:36AM +0000, Parav Pandit wrote:
> > > > > > Actually not, Parav said the device needs to reserve sufficient
> > > > > > resources in another thread.
> > > > > The device resource reservation starts only when the device migration
> > starts.
> > > > > i.e. with WRITE_RECORDS_START command of patch 7 in the series.
> > > >
> > > > And now your precious VM can't migrate at all because -ENOSPC.
> > > >
> > > I am not aware of any Linux IOCTL that ensures a guaranteed execution
> > > without an error code. :)
> > >
> > > As we talked in other email, a VF can be provisioned too as extension and
> > capability can be exposed.
> > > This is not going the only error on device migration.
> > 
> > Allocating resources on outgoing migration is a very bad idea.
> > It is common to migrate prcisely because you are out of resources.
> > Incoming is a different story, less of a problem.
> >
> The resource allocated may not be on same system.
> Also the resource allocated while the VM is running, so I don’t see a problem.

It's not that you can't see it, it's that you don't care. I really wish
more people would try and see how spec has to address use-cases outside
their own narrow field but I guess most people just see it as not their
job, nvidia pays you to care about nvidia things and the rest is not
your problem. Oh well.

> Additionally, this is not what the Linux kernel maintainers of iommu subsystem told us either.
> Let me know if you check with Alex W and Jason who build this interface.

VFIO guys have their own ideas, if they want to talk to virtio guys they
can come here and do that.


> > 
> > > >
> > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > The data structure is different but I don't see why it is critical.
> > > > > > > > >
> > > > > > > > > I agree that I don't see out of buffers notifications too
> > > > > > > > > which implies device has to maintain something like a bitmap
> > internally.
> > > > > > > > > Which I guess could be fine but it is not clear to me how
> > > > > > > > > large that bitmap has to be. How does the device know?
> > > > > > > > > Needs to be
> > > > addressed.
> > > > > > > >
> > > > > > > > This is the question I asked Parav in another thread. Using
> > > > > > > > host memory as a queue with notification (like PML) might be much
> > better.
> > > > > > >
> > > > > > > Well if queue is what you want to do you can just do it internally.
> > > > > >
> > > > > > Then it's not the proposal here, Parav has explained it in
> > > > > > another reply, and as explained it lacks a lot of other facilities.
> > > > > >
> > > > > PML is yet another option that requires small pci writes.
> > > > > In the current proposal, there are no small PCI writes.
> > > > > It is a query interface from the device.
> > > > >
> > > > > > > Problem of course is that it might overflow and cause things
> > > > > > > like packet drops.
> > > > > >
> > > > > > Exactly like PML. So sticking to wire speed should not be a
> > > > > > general goal in the context of migration. It can be done if the
> > > > > > speed of the migration interface is faster than the virtio device that
> > needs to be migrated.
> > > > > May not have to be.
> > > > > Speed of page recording should be fast enough.
> > > > > It usually improves with subsequent generation.
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Even if we manage to do that, it doesn't mean we won't have
> > issues.
> > > > > > > > > >
> > > > > > > > > > 1) For many reasons it can neither see nor log via GPA,
> > > > > > > > > > so this requires a traversal of the vIOMMU mapping
> > > > > > > > > > tables by the hypervisor afterwards, it would be
> > > > > > > > > > expensive and need synchronization with the guest
> > > > > > > > > > modification of the IO page table which
> > > > > > looks very hard.
> > > > > > > > >
> > > > > > > > > vIOMMU is fast enough to be used on data path but not fast
> > > > > > > > > enough for dirty tracking?
> > > > > > > >
> > > > > > > > We set up SPTEs or using nesting offloading where the PTEs
> > > > > > > > could be iterated by hardware directly which is fast.
> > > > > > >
> > > > > > > There's a way to have hardware find dirty PTEs for you quickly?
> > > > > >
> > > > > > Scanning PTEs on the host is faster and more secure than
> > > > > > scanning guests, that's what I want to say:
> > > > > >
> > > > > > 1) the guest page could be swapped out but not the host one.
> > > > > > 2) no guest triggerable behavior
> > > > > >
> > > > >
> > > > > Device page tracking table to be consulted to flush on mapping change.
> > > > >
> > > > > > > I don't know how it's done. Do tell.
> > > > > > >
> > > > > > >
> > > > > > > > This is not the case here where software needs to iterate
> > > > > > > > the IO page tables in the guest which could be slow.
> > > > > > > >
> > > > > > > > > Hard to believe.  If true and you want to speed up vIOMMU
> > > > > > > > > then you implement an efficient datastructure for that.
> > > > > > > >
> > > > > > > > Besides the issue of performance, it's also racy, assuming
> > > > > > > > we are logging
> > > > > > IOVA.
> > > > > > > >
> > > > > > > > 0) device log IOVA
> > > > > > > > 1) hypervisor fetches IOVA from log buffer
> > > > > > > > 2) guest map IOVA to a new GPA
> > > > > > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > > > > > >
> > > > > > > > Then we lost the old GPA.
> > > > > > >
> > > > > > > Interesting and a good point.
> > > > > >
> > > > > > Note that PML logs at GPA as it works at L1 of EPT.
> > > > > >
> > > > > > > And by the way e.g. vhost has the same issue.  You need to
> > > > > > > flush dirty tracking info when changing the mappings somehow.
> > > > > >
> > > > > > It's not,
> > > > > >
> > > > > > 1) memory translation is done by vhost
> > > > > > 2) vhost knows GPA and it doesn't log via IOVA.
> > > > > >
> > > > > > See this for example, and DPDK has similar fixes.
> > > > > >
> > > > > > commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
> > > > > > Author: Jason Wang <jasowang@redhat.com>
> > > > > > Date:   Wed Jan 16 16:54:42 2019 +0800
> > > > > >
> > > > > >     vhost: log dirty page correctly
> > > > > >
> > > > > >     Vhost dirty page logging API is designed to sync through GPA. But we
> > > > > >     try to log GIOVA when device IOTLB is enabled. This is wrong and may
> > > > > >     lead to missing data after migration.
> > > > > >
> > > > > >     To solve this issue, when logging with device IOTLB enabled, we will:
> > > > > >
> > > > > >     1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
> > > > > >        get HVA, for writable descriptor, get HVA through iovec. For used
> > > > > >        ring update, translate its GIOVA to HVA
> > > > > >     2) traverse the GPA->HVA mapping to get the possible GPA and log
> > > > > >        through GPA. Pay attention this reverse mapping is not guaranteed
> > > > > >        to be unique, so we should log each possible GPA in this case.
> > > > > >
> > > > > >     This fix the failure of scp to guest during migration. In -next, we
> > > > > >     will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> > > > > >
> > > > > >     Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> > > > > >     Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> > > > > >     Cc: Jintack Lim <jintack@cs.columbia.edu>
> > > > > >     Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > > >     Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > > > > >
> > > > > > All of the above is not what virtio did right now.
> > > > > >
> > > > > > > Parav what's the plan for this? Should be addressed in the spec too.
> > > > > > >
> > > > > >
> > > > > > AFAIK, there's no easy/efficient way to do that. I hope I was wrong.
> > > > > >
> > > > >
> > > > > The query interface in this proposal works on the granular
> > > > > boundary to read
> > > > and clear.
> > > > > This will ensure that mapping is consistent.
> > > >
> > > > By itself it does not, you have to actually keep querying until you
> > > > flush all dirty info and do it each time there's an invalidation in the IOMMU.
> > > >
> > > Only during device migration time.
> > > It only applied on those specific cases when unmapping and migration both in
> > progress at same time.
> > > But yes, it can slow down unmapping.
> > >
> > > >
> > > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > 2) There are a lot of special or reserved IOVA ranges
> > > > > > > > > > (for example the interrupt areas in x86) that need
> > > > > > > > > > special care which is architectural and where it is
> > > > > > > > > > beyond the scope or knowledge of the virtio device but the
> > platform IOMMU.
> > > > > > > > > > Things would be more complicated when SVA is enabled.
> > > > > > > > >
> > > > > > > > > SVA being what here?
> > > > > > > >
> > > > > > > > For example, IOMMU may treat interrupt ranges differently
> > > > > > > > depending on whether SVA is enabled or not. It's very hard
> > > > > > > > and unnecessary to teach devices about this.
> > > > > > >
> > > > > > > Oh, shared virtual memory. So what you are saying here? virtio
> > > > > > > does not care, it just uses some addresses and if you want it
> > > > > > > to it can record writes somewhere.
> > > > > >
> > > > > > One example, PCI allows devices to send translated requests, how
> > > > > > can a hypervisor know it's a PA or IOVA in this case? We
> > > > > > probably need a new bit. But it's not the only thing we need to deal with.
> > > > > >
> > > > > > By definition, interrupt ranges and other reserved ranges should
> > > > > > not belong to dirty pages. And the logging should be done before
> > > > > > the DMA where there's no way for the device to know whether or
> > > > > > not an IOVA is valid or not. It would be more safe to just not
> > > > > > report them from the source instead of leaving it to the
> > > > > > hypervisor to deal with but this seems impossible at the device
> > > > > > level. Otherwise the hypervisor driver needs to communicate with
> > > > > > the (v)IOMMU to be reached with the
> > > > > > interrupt(MSI) area, RMRR area etc in order to do the correct
> > > > > > things or it might have security implications. And those areas
> > > > > > don't make sense at L1 when vSVA is enabled. What's more, when
> > > > > > vIOMMU could be fully offloaded, there's no easy way to fetch that
> > information.
> > > > > >
> > > > > There cannot be logging before the DMA.
> > > > > Only requirement is before the mapping changes, the dirty page
> > > > > tracking to be
> > > > synced.
> > > > >
> > > > > In most common cases where the perf is critical, such mapping wont
> > > > > change
> > > > so often dynamically anyway.
> > > > >
> > > > > > Again, it's hard to bypass or even duplicate the functionality
> > > > > > of the platform or we need to step into every single detail of a
> > > > > > specific transport, architecture or IOMMU to figure out whether
> > > > > > or not logging at virtio is correct which is awkward and unrealistic.
> > > > > > This proposal suffers from an exact similar issue when inventing
> > > > > > things like freeze/stop where I've pointed out other branches of
> > > > > > issues as
> > > > well.
> > > > > >
> > > > > It is incorrect attribution that platform is duplicated here.
> > > > > It feeds the data to the platform as needed without replicating.
> > > > >
> > > > > I do agree that there is overlap of IOMMU tracking the dirty and
> > > > > storing it in
> > > > the per PTE vs device supplying its dirty track via its own interface.
> > > > > Both are consolidated at hypervisor level.
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > And there could be other architecte specific knowledge
> > > > > > > > > > (e.g
> > > > > > > > > > PAGE_SIZE) that might be needed. There's no easy way to
> > > > > > > > > > deal with those cases.
> > > > > > > > >
> > > > > > > > > Good point about page size actually - using 4k
> > > > > > > > > unconditionally is a waste of resources.
> > > > > > > >
> > > > > > > > Actually, they are more than just PAGE_SIZE, for example,
> > > > > > > > PASID and
> > > > others.
> > > > > > >
> > > > > > > what does pasid have to do with it? anyway, just give driver
> > > > > > > control over page size.
> > > > > >
> > > > > > For example, two virtqueues have two PASIDs assigned. How can a
> > > > > > hypervisor know which specific IOVA belongs to which IOVA? For
> > > > > > platform IOMMU, they are handy as it talks to the transport. But
> > > > > > I don't think we need to duplicate every transport specific
> > > > > > address space
> > > > feature in core virtio layer:
> > > > > >
> > > > > PASID to vq assignment won't be duplicated.
> > > > > It is configured fully by the guest without consulting hypervisor
> > > > > at the device
> > > > level.
> > > > > Guest IOMMU would consult hypervisor to setup any PASID mapping as
> > > > > part
> > > > of any mapping method.
> > > > >
> > > > > > 1) translated/untranslated request
> > > > > > 2) request w/ and w/o PASID
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > We wouldn't need to care about all of them if it is done
> > > > > > > > > > at platform IOMMU level.
> > > > > > > > >
> > > > > > > > > If someone logs at IOMMU level then nothing needs to be
> > > > > > > > > done in the spec at all. This is about capability at the device level.
> > > > > > > >
> > > > > > > > True, but my question is where or not it can be done at the
> > > > > > > > device level
> > > > > > easily.
> > > > > > >
> > > > > > > there's no "easily" about live migration ever.
> > > > > >
> > > > > > I think I've stated sufficient issues to demonstrate how hard
> > > > > > virtio wants to
> > > > do it.
> > > > > > And I've given the link that it is possible to do that in IOMMU
> > > > > > without those issues. So in this context doing it in virtio is much harder.
> > > > > >
> > > > > > > For example on-device iommus are a thing.
> > > > > >
> > > > > > I'm not sure that's the way to go considering the platform IOMMU
> > > > > > evolves very quickly.
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > > what Lingshan
> > > > > > > > > > > proposed is analogous to bit per page - problem
> > > > > > > > > > > unfortunately is you can't easily set a bit by DMA.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'm not saying bit/bytemap is the best, but it has been
> > > > > > > > > > used by real hardware. And we have many other options.
> > > > > > > > > >
> > > > > > > > > > > So I think this dirty tracking is a good option to have.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Why does it matter in 2024?
> > > > > > > > > > > > > Because users needs to use it now.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If not, we are better off to offer this, and
> > > > > > > > > > > > > > > when/if platform support is, sure,
> > > > > > > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 4) if the platform support is missing, we
> > > > > > > > > > > > > > > > can use software or leverage transport for
> > > > > > > > > > > > > > > > assistance like PRI
> > > > > > > > > > > > > > > All of these are in theory.
> > > > > > > > > > > > > > > Our experiment shows PRI performance is 21x
> > > > > > > > > > > > > > > slower than page fault rate
> > > > > > > > > > > > > > done by the cpu.
> > > > > > > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If you stick to the wire speed during migration,
> > > > > > > > > > > > > > it can
> > > > converge.
> > > > > > > > > > > > > Do you have perf data for this?
> > > > > > > > > > > >
> > > > > > > > > > > > No, but it's not hard to imagine the worst case.
> > > > > > > > > > > > Wrote a small program that dirty every page by a NIC.
> > > > > > > > > > > >
> > > > > > > > > > > > > In the internal tests we don’t see this happening.
> > > > > > > > > > > >
> > > > > > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > > > > > > >
> > > > > > > > > > > > So if we get very high dirty rates (e.g by a high
> > > > > > > > > > > > speed NIC), we can't satisfy the requirement of the downtime.
> > > > > > > > > > > > Or if you see the converge, you might get help from
> > > > > > > > > > > > the auto converge support by the hypervisors like
> > > > > > > > > > > > KVM where it tries to throttle the VCPU then you can't reach
> > the wire speed.
> > > > > > > > > > >
> > > > > > > > > > > Will only work for some device types.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Yes, that's the point. Parav said he doesn't see the
> > > > > > > > > > issue, it's probably because he is testing a virtio-net
> > > > > > > > > > and so the vCPU is automatically throttled. It doesn't
> > > > > > > > > > mean it can work for other virito devices.
> > > > > > > > >
> > > > > > > > > Only for TX, and I'm pretty sure they had the foresight to
> > > > > > > > > test RX not just TX but let's confirm. Parav did you test both
> > directions?
> > > > > > > >
> > > > > > > > RX speed somehow depends on the speed of refill, so
> > > > > > > > throttling helps more or less.
> > > > > > >
> > > > > > > It doesn't depend on speed of refill you just underrun and
> > > > > > > drop packets. then your nice 10usec latency becomes more like 10sec.
> > > > > >
> > > > > > I miss your point here. If the driver can't achieve wire speed
> > > > > > without dirty page tracking, it can neither when dirty page
> > > > > > tracking is
> > > > enabled.
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > > > > > > So it is unusable.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It's not about mandating, it's about doing
> > > > > > > > > > > > > > things in the correct layer. If PRI is slow, PCI can evolve for
> > sure.
> > > > > > > > > > > > > You should try.
> > > > > > > > > > > >
> > > > > > > > > > > > Not my duty, I just want to make sure things are
> > > > > > > > > > > > done in the correct layer, and once it needs to be
> > > > > > > > > > > > done in the virtio, there's nothing obviously wrong.
> > > > > > > > > > >
> > > > > > > > > > > Yea but just vague questions don't help to make sure eiter way.
> > > > > > > > > >
> > > > > > > > > > I don't think it's vague, I have explained, if something
> > > > > > > > > > in the virito slows down the PRI, we can try to fix them.
> > > > > > > > >
> > > > > > > > > I don't believe you are going to make PRI fast. No one managed so
> > far.
> > > > > > > >
> > > > > > > > So it's the fault of PRI not virito, but it doesn't mean we
> > > > > > > > need to do it in virtio.
> > > > > > >
> > > > > > > I keep saying with this approach we would just say "e1000
> > > > > > > emulation is slow and encumbered this is the fault of e1000"
> > > > > > > and never get virtio at all.  Assigning blame only gets you so far.
> > > > > >
> > > > > > I think we are discussing different things. My point is virtio
> > > > > > needs to leverage the functionality provided by transport or
> > > > > > platform (especially considering they evolve faster than
> > > > > > virtio). It seems to me it's hard even to duplicate some basic
> > > > > > function of platform IOMMU in
> > > > virtio.
> > > > > >
> > > > > Not duplicated. Feeding into the platform.
> > > >
> > > > I mean IOMMU still sets the dirty bit, too. How is that not a duplication?
> > > >
> > > Only if the IOMMU is enabled for it.
> > > For example AMD has DTE HAD bit to enable dirty page tracking in IOMMU.
> > >
> > > So if platform does not enable, it can be enabled on the device and vis-versa.
> > 
> > So again, if your motivation is on-device IOMMU then say so, and in this case I
> > don't see the point of only adding write tracking without adding the actual
> > device IOMMU interface.
> It is not the device IOMMU as it does not do all the work of platform.
> 
> > And maybe that is the answer to resource management questions:
> > there's going to be an IOMMU data structure on the device and it's just an extra
> > bit in the PTE there.
> > Makes sense but let's see it all together then.
> > Because separate from on-device IOMMU it looks crazily expensive and just
> > weird.
> >
> I would agree that there is an expense there but worth for those cpus which cannot track it.

Can't parse this.

> > 
> > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > Missing functions in
> > > > > > > > > > platform or transport is not a good excuse to try to
> > > > > > > > > > workaround it in the virtio. It's a layer violation and
> > > > > > > > > > we never had any feature like this in the past.
> > > > > > > > >
> > > > > > > > > Yes missing functionality in the platform is exactly why
> > > > > > > > > virtio was born in the first place.
> > > > > > > >
> > > > > > > > Well the platform can't do device specific logic. But that's
> > > > > > > > not the case of dirty page tracking which is device logic agnostic.
> > > > > > >
> > > > > > > Not true platforms have things like NICs on board and have for
> > > > > > > many years. It's about performance really.
> > > > > >
> > > > > > I've stated sufficient issues above. And one more obvious issue
> > > > > > for device initiated page logging is that it needs a lot of
> > > > > > extra or unnecessary PCI transactions which will throttle the
> > > > > > performance of the whole system (and lead to other issues like
> > > > > > QOS). So I can't believe it has
> > > > good performance overall.
> > > > > > Logging via IOMMU or using shadow virtqueue doesn't need any
> > > > > > extra PCI transactions at least.
> > > > > >
> > > > > In the current proposal, it does not required PCI transactions, as
> > > > > there is only a
> > > > hypervisor-initiated query interface.
> > > > > It is a trade off of using svq + pasid vs using something from the device.
> > > > >
> > > > > Again, both has different use case and value. One uses cpu and one
> > > > > uses
> > > > device.
> > > > > Depending how much power one wants to spend where..
> > > >
> > > > Also how much effort we want to spend on this virtio specific thing.
> > > > The needs to be a *reason* to do things in virtio as opposed to
> > > > using platform capabilities, this is exactly the same thing I told
> > > > Lingshan wrt using SUSPEND for power management as opposed to using
> > > > PCI PM - relying on platform when we can is right there in the mission
> > statement.
> > > > For some reason I asssumed you guys have done a PoC and that's the
> > > > motivation but if it's a "just in case" feature then I'd suggest we
> > > > focus on merging patches 1-5 first.
> > > >
> > > It is not just in case feature.
> > > We learnt that not all cpus have it.
> > 
> > Have dirty tracking? Well shadow is portable.
> >
> We have seen that shadow is not helpful. It has its own very weird issue. I wont bring up here.
>  
> > > There is ongoing efforts of the poc.
> > > We will have the results in sometime.
> > >
> > > We have similar interface on at least two devices already and integrated in the
> > Linux stack, one in upstream, other is in progress.
> > 
> > Aha.  I hear IOMMUFD is working on integrating access to dirty bit.
> > Maybe compare performance to that?
> > It does not have to be exactly virtio I think for PoC.
> > 
> Yes. it is.
> The point is, even if we compare, there is no comparison point for the cpus that does not support it.
> Users are not going to use mediation layer anyway and orchestrate things differently just because data center has mix of servers.

Just let the mediation layer thing be please. You call whatever you have
passthrough and whatever you don't like mediation layer. It helps you
sell hardware more power to you but it has nothing to do with the spec.


> It is far easier to run through same set of hw + sw stack. This is the feedback we got from the users.
> Hence the device expense.
> 
> Post merging the series 1-5, we will have some early perf numbers as well.
> The expense is not a lot in current PoC round.
> DPUs for dynamic workload has 8MB of RAM.

Total? So just one VM can migrate at a time? Wow. Talk about not
scaling.

> > > virtio is also in discussion here.
> > >
> > > Sure, it is proposed as optional. We can focus on 1-5 first.
> > > I will split the series once I have comments.
> > >
> > > There is also extension after 1-5 for net device context as well.
> > >
> > >
> > > >
> > > > > > > So I'd like Parav to publish some experiment results and/or
> > > > > > > some estimates.
> > > > > > >
> > > > > >
> > > > > > That's fine, but the above equation (used by Qemu) is sufficient
> > > > > > to demonstrate how hard to stick wire speed in the case.
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > > In the current state, it is mandating.
> > > > > > > > > > > > > And if you think PRI is the only way,
> > > > > > > > > > > >
> > > > > > > > > > > > I don't, it's just an example where virtio can
> > > > > > > > > > > > leverage from either transport or platform. Or if
> > > > > > > > > > > > it's the fault in virtio that slows down the PRI, then it is
> > something we can do.
> > > > > > > > > > > >
> > > > > > > > > > > > >  than you should propose that in the dirty page
> > > > > > > > > > > > > tracking series that
> > > > > > you listed above to not do dirty page tracking. Rather depend on PRI,
> > right?
> > > > > > > > > > > >
> > > > > > > > > > > > No, the point is to not duplicate works especially
> > > > > > > > > > > > considering virtio can't do better than platform or transport.
> > > > > > > > > > >
> > > > > > > > > > > If someone says they tried and platform's migration
> > > > > > > > > > > support does not work for them and they want to build
> > > > > > > > > > > a solution in virtio then what exactly is the objection?
> > > > > > > > > >
> > > > > > > > > > The discussion is to make sure whether virtio can do
> > > > > > > > > > this easily and correctly, then we can have a
> > > > > > > > > > conclusion. I've stated some issues above, and I've
> > > > > > > > > > asked other questions related to them which are still not
> > answered.
> > > > > > > > > >
> > > > > > > > > > I think we had a very hard time in bypassing IOMMU in
> > > > > > > > > > the past that we don't want to repeat.
> > > > > > > > > >
> > > > > > > > > > We've gone through several methods of logging dirty
> > > > > > > > > > pages in the past (each with pros/cons), but this
> > > > > > > > > > proposal never explains why it chooses one of them but
> > > > > > > > > > not others. Spec needs to find the best path instead of
> > > > > > > > > > just a possible path without any rationale about
> > > > > > why.
> > > > > > > > >
> > > > > > > > > Adding more rationale isn't a bad thing.
> > > > > > > > > In particular if platform supplies dirty tracking then how
> > > > > > > > > does driver decide which to use platform or device capability?
> > > > > > > > > A bit of discussion around this is a good idea.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > > virtio is here in the
> > > > > > > > > > > first place because emulating devices didn't work well.
> > > > > > > > > >
> > > > > > > > > > I don't understand here. We have supported emulated
> > > > > > > > > > devices for
> > > > years.
> > > > > > > > > > I'm pretty sure a lot of issues could be uncovered if
> > > > > > > > > > this proposal can be prototyped with an emulated device first.
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > >
> > > > > > > > > virtio was originally PV as opposed to emulation. That
> > > > > > > > > there's now hardware virtio and you call software
> > > > > > > > > implementation "an emulation" is very meta.
> > > > > > > >
> > > > > > > > Yes but I don't see how it relates to dirty page tracking.
> > > > > > > > When we find a way it should work for both software and hardware
> > devices.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > >
> > > > > > > It has to work well on a variety of existing platforms. If it
> > > > > > > does then sure, why would we roll our own.
> > > > > >
> > > > > > If virtio can do that in an efficient way without any issues, I agree.
> > > > > > But it seems not.
> > > > > >
> > > > > > Thanks
> > > >
> > > >
> > > > This publicly archived list offers a means to provide input to the
> > > > OASIS Virtual I/O Device (VIRTIO) TC.
> > > >
> > > > In order to verify user consent to the Feedback License terms and to
> > > > minimize spam in the list archive, subscription is required before posting.
> > > >
> > > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > > > List help: virtio-comment-help@lists.oasis-open.org
> > > > List archive: https://lists.oasis-open.org/archives/virtio-comment/
> > > > Feedback License:
> > > > https://www.oasis-open.org/who/ipr/feedback_license.pdf
> > > > List Guidelines:
> > > > https://www.oasis-open.org/policies-guidelines/mailing-lists
> > > > Committee: https://www.oasis-open.org/committees/virtio/
> > > > Join OASIS: https://www.oasis-open.org/join/
> > >
> 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 11:45                                               ` Parav Pandit
@ 2023-11-17 12:04                                                 ` Michael S. Tsirkin
  2023-11-17 12:11                                                   ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 12:04 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 5:04 PM
> > 
> > On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 4:30 PM
> > > >
> > > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > > >
> > > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu, Lingshan wrote:
> > > > > > >>
> > > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > > > > >>>> We should expose a limit of the device in the proposed
> > > > > > WRITE_RECORD_CAP_QUERY command, that how much range it can
> > track.
> > > > > > >>>> So that future provisioning framework can use it.
> > > > > > >>>>
> > > > > > >>>> I will cover this in v5 early next week.
> > > > > > >>> I do worry about how this can even work though. If you want
> > > > > > >>> a generic device you do not get to dictate how much memory VM
> > has.
> > > > > > >>>
> > > > > > >>> Aren't we talking bit per page? With 1TByte of memory to
> > > > > > >>> track
> > > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > >>>
> > > > > > >>> And you happily say "we'll address this in the future" while
> > > > > > >>> at the same time fighting tooth and nail against adding
> > > > > > >>> single bit status registers because scalability?
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> I have a feeling doing this completely theoretical like this is
> > problematic.
> > > > > > >>> Maybe you have it all laid out neatly in your head but I
> > > > > > >>> suspect not all of TC can picture it clearly enough based just on spec
> > text.
> > > > > > >>>
> > > > > > >>> We do sometimes ask for POC implementation in linux / qemu
> > > > > > >>> to demonstrate how things work before merging code. We
> > > > > > >>> skipped this for admin things so far but I think it's a good
> > > > > > >>> idea to start doing it here.
> > > > > > >>>
> > > > > > >>> What makes me pause a bit before saying please do a PoC is
> > > > > > >>> all the opposition that seems to exist to even using admin
> > > > > > >>> commands in the 1st place. I think once we finally stop
> > > > > > >>> arguing about whether to use admin commands at all then a
> > > > > > >>> PoC will be needed
> > > > before merging.
> > > > > > >> We have POR productions that implemented the approach in my
> > series.
> > > > > > >> They are multiple generations of productions in market and
> > > > > > >> running in customers data centers for years.
> > > > > > >>
> > > > > > >> Back to 2019 when we start working on vDPA, we have sent some
> > > > > > >> samples of production(e.g., Cascade Glacier) and the
> > > > > > >> datasheet, you can find live migration facilities there,
> > > > > > >> includes suspend, vq state and other features.
> > > > > > >>
> > > > > > >> And there is an reference in DPDK live migration, I have
> > > > > > >> provided this page
> > > > > > >> before:
> > > > > > >> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html, it has
> > > > > > >> been working for long long time.
> > > > > > >>
> > > > > > >> So if we let the facts speak, if we want to see if the
> > > > > > >> proposal is proven to work, I would
> > > > > > >> say: They are POR for years, customers already deployed them for
> > years.
> > > > > > > And I guess what you are trying to say is that this patchset
> > > > > > > we are reviewing here should be help to the same standard and
> > > > > > > there should be a PoC? Sounds reasonable.
> > > > > > Yes and the in-marketing productions are POR, the series just
> > > > > > improves the design, for example, our series also use registers
> > > > > > to track vq state, but improvements than CG or BSC. So I think
> > > > > > they are proven
> > > > to work.
> > > > >
> > > > > If you prefer to go the route of POR and production and proven
> > > > > documents
> > > > etc, there is ton of it of multiple types of products I can dump
> > > > here with open- source code and documentation and more.
> > > > > Let me know what you would like to see.
> > > > >
> > > > > Michael has requested some performance comparisons, not all are
> > > > > ready to
> > > > share yet.
> > > > > Some are present that I will share in coming weeks.
> > > > >
> > > > > And all the vdpa dpdk you published does not have basic CVQ
> > > > > support when I
> > > > last looked at it.
> > > > > Do you know when was it added?
> > > >
> > > > It's good enough for PoC I think, CVQ or not.
> > > > The problem with CVQ generally, is that VDPA wants to shadow CVQ it
> > > > at all times because it wants to decode and cache the content. But
> > > > this problem has nothing to do with dirty tracking even though it also
> > mentions "shadow":
> > > > if device can report it's state then there's no need to shadow CVQ.
> > >
> > > For the performance numbers with the pre-copy and device context of
> > patches posted 1 to 5, the downtime reduction of the VM is 3.71x with active
> > traffic on 8 RQs at 100Gbps port speed.
> > 
> > Sounds good can you please post a bit more detail?
> > which configs are you comparing what was the result on each of them.
> 
> Common config: 8+8 tx and rx queues.
> Port speed: 100Gbps
> QEMU 8.1
> Libvirt 7.0
> GVM: Centos 7.4
> Device: virtio VF hardware device
> 
> Config_1: virtio suspend/resume similar to what Lingshan has, largely vdpa stack
> Config_2: Device context method of admin commands

OK that sounds good. The weird thing here is that you measure
"downtime". What exactly do you mean here?
I am guessing it's the time to retrieve on source and re-program device state
on destination? And this is 3.71x out of how long?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 12:04                                                 ` Michael S. Tsirkin
@ 2023-11-17 12:11                                                   ` Parav Pandit
  2023-11-17 12:32                                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 12:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 5:35 PM
> To: Parav Pandit <parav@nvidia.com>
> 
> On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 5:04 PM
> > >
> > > On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 4:30 PM
> > > > >
> > > > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
> > > > > >
> > > > > >
> > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > > > >
> > > > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu, Lingshan wrote:
> > > > > > > >>
> > > > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > > > > > >>>> We should expose a limit of the device in the proposed
> > > > > > > WRITE_RECORD_CAP_QUERY command, that how much range it can
> > > track.
> > > > > > > >>>> So that future provisioning framework can use it.
> > > > > > > >>>>
> > > > > > > >>>> I will cover this in v5 early next week.
> > > > > > > >>> I do worry about how this can even work though. If you
> > > > > > > >>> want a generic device you do not get to dictate how much
> > > > > > > >>> memory VM
> > > has.
> > > > > > > >>>
> > > > > > > >>> Aren't we talking bit per page? With 1TByte of memory to
> > > > > > > >>> track
> > > > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > >>>
> > > > > > > >>> And you happily say "we'll address this in the future"
> > > > > > > >>> while at the same time fighting tooth and nail against
> > > > > > > >>> adding single bit status registers because scalability?
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> I have a feeling doing this completely theoretical like
> > > > > > > >>> this is
> > > problematic.
> > > > > > > >>> Maybe you have it all laid out neatly in your head but I
> > > > > > > >>> suspect not all of TC can picture it clearly enough
> > > > > > > >>> based just on spec
> > > text.
> > > > > > > >>>
> > > > > > > >>> We do sometimes ask for POC implementation in linux /
> > > > > > > >>> qemu to demonstrate how things work before merging code.
> > > > > > > >>> We skipped this for admin things so far but I think it's
> > > > > > > >>> a good idea to start doing it here.
> > > > > > > >>>
> > > > > > > >>> What makes me pause a bit before saying please do a PoC
> > > > > > > >>> is all the opposition that seems to exist to even using
> > > > > > > >>> admin commands in the 1st place. I think once we finally
> > > > > > > >>> stop arguing about whether to use admin commands at all
> > > > > > > >>> then a PoC will be needed
> > > > > before merging.
> > > > > > > >> We have POR productions that implemented the approach in
> > > > > > > >> my
> > > series.
> > > > > > > >> They are multiple generations of productions in market
> > > > > > > >> and running in customers data centers for years.
> > > > > > > >>
> > > > > > > >> Back to 2019 when we start working on vDPA, we have sent
> > > > > > > >> some samples of production(e.g., Cascade Glacier) and the
> > > > > > > >> datasheet, you can find live migration facilities there,
> > > > > > > >> includes suspend, vq state and other features.
> > > > > > > >>
> > > > > > > >> And there is an reference in DPDK live migration, I have
> > > > > > > >> provided this page
> > > > > > > >> before:
> > > > > > > >> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html, it
> > > > > > > >> has been working for long long time.
> > > > > > > >>
> > > > > > > >> So if we let the facts speak, if we want to see if the
> > > > > > > >> proposal is proven to work, I would
> > > > > > > >> say: They are POR for years, customers already deployed
> > > > > > > >> them for
> > > years.
> > > > > > > > And I guess what you are trying to say is that this
> > > > > > > > patchset we are reviewing here should be help to the same
> > > > > > > > standard and there should be a PoC? Sounds reasonable.
> > > > > > > Yes and the in-marketing productions are POR, the series
> > > > > > > just improves the design, for example, our series also use
> > > > > > > registers to track vq state, but improvements than CG or
> > > > > > > BSC. So I think they are proven
> > > > > to work.
> > > > > >
> > > > > > If you prefer to go the route of POR and production and proven
> > > > > > documents
> > > > > etc, there is ton of it of multiple types of products I can dump
> > > > > here with open- source code and documentation and more.
> > > > > > Let me know what you would like to see.
> > > > > >
> > > > > > Michael has requested some performance comparisons, not all
> > > > > > are ready to
> > > > > share yet.
> > > > > > Some are present that I will share in coming weeks.
> > > > > >
> > > > > > And all the vdpa dpdk you published does not have basic CVQ
> > > > > > support when I
> > > > > last looked at it.
> > > > > > Do you know when was it added?
> > > > >
> > > > > It's good enough for PoC I think, CVQ or not.
> > > > > The problem with CVQ generally, is that VDPA wants to shadow CVQ
> > > > > it at all times because it wants to decode and cache the
> > > > > content. But this problem has nothing to do with dirty tracking
> > > > > even though it also
> > > mentions "shadow":
> > > > > if device can report it's state then there's no need to shadow CVQ.
> > > >
> > > > For the performance numbers with the pre-copy and device context
> > > > of
> > > patches posted 1 to 5, the downtime reduction of the VM is 3.71x
> > > with active traffic on 8 RQs at 100Gbps port speed.
> > >
> > > Sounds good can you please post a bit more detail?
> > > which configs are you comparing what was the result on each of them.
> >
> > Common config: 8+8 tx and rx queues.
> > Port speed: 100Gbps
> > QEMU 8.1
> > Libvirt 7.0
> > GVM: Centos 7.4
> > Device: virtio VF hardware device
> >
> > Config_1: virtio suspend/resume similar to what Lingshan has, largely
> > vdpa stack
> > Config_2: Device context method of admin commands
> 
> OK that sounds good. The weird thing here is that you measure "downtime".
> What exactly do you mean here?
> I am guessing it's the time to retrieve on source and re-program device state on
> destination? And this is 3.71x out of how long?
Yes. Downtime is the time during which the VM is not responding or receiving packets, which involves reprogramming the device.
3.71x is relative time for this discussion.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 11:49                                           ` Michael S. Tsirkin
@ 2023-11-17 12:15                                             ` Parav Pandit
  2023-11-17 12:37                                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 12:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

Hi Alex, Jason,

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 5:20 PM
> To: Parav Pandit <parav@nvidia.com>

> > > Allocating resources on outgoing migration is a very bad idea.
> > > It is common to migrate prcisely because you are out of resources.
> > > Incoming is a different story, less of a problem.
> > >
> > The resource allocated may not be on same system.
> > Also the resource allocated while the VM is running, so I don’t see a problem.
> 
> > Additionally, this is not what the Linux kernel maintainers of iommu subsystem
> told us either.
> > Let me know if you check with Alex W and Jason who build this interface.
> 
> VFIO guys have their own ideas, if they want to talk to virtio guys they can come
> here and do that.

Since one of the use cases would have accepted to let dirty tracking to fail, I dont see a problem.
This is not the only command on source that fails.
So I anticipate that QEMU and libvirt or any vfio user would build the orchestration around the possible failure because the UAPI is well defined.

When there is hypervisor, that must have zero failures on src side, such kernel + device can build everything reserved upfront.

Do you say, QEMU has zero memory allocations on source side for migration?
That would be interesting to know.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 11:32                                                                 ` Michael S. Tsirkin
@ 2023-11-17 12:22                                                                   ` Parav Pandit
  2023-11-17 12:40                                                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 12:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 5:03 PM
> To: Parav Pandit <parav@nvidia.com>

> > Somehow the claim of shadow vq is great without sharing any performance
> numbers is what I don't agree with.
> 
> It's upstream in QEMU. Test it youself.
> 
We did few minutes back.
It results in a call trace.
Vhost_vdpa_setup_vq_irq crashes on list corruption on net-next.

We are stopping any shadow vq tests on unstable stuff.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 12:11                                                   ` Parav Pandit
@ 2023-11-17 12:32                                                     ` Michael S. Tsirkin
  2023-11-17 13:03                                                       ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 12:32 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 5:35 PM
> > To: Parav Pandit <parav@nvidia.com>
> > 
> > On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 5:04 PM
> > > >
> > > > On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 4:30 PM
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > > > > >
> > > > > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu, Lingshan wrote:
> > > > > > > > >>
> > > > > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > > > > > > >>>> We should expose a limit of the device in the proposed
> > > > > > > > WRITE_RECORD_CAP_QUERY command, that how much range it can
> > > > track.
> > > > > > > > >>>> So that future provisioning framework can use it.
> > > > > > > > >>>>
> > > > > > > > >>>> I will cover this in v5 early next week.
> > > > > > > > >>> I do worry about how this can even work though. If you
> > > > > > > > >>> want a generic device you do not get to dictate how much
> > > > > > > > >>> memory VM
> > > > has.
> > > > > > > > >>>
> > > > > > > > >>> Aren't we talking bit per page? With 1TByte of memory to
> > > > > > > > >>> track
> > > > > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > >>>
> > > > > > > > >>> And you happily say "we'll address this in the future"
> > > > > > > > >>> while at the same time fighting tooth and nail against
> > > > > > > > >>> adding single bit status registers because scalability?
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> I have a feeling doing this completely theoretical like
> > > > > > > > >>> this is
> > > > problematic.
> > > > > > > > >>> Maybe you have it all laid out neatly in your head but I
> > > > > > > > >>> suspect not all of TC can picture it clearly enough
> > > > > > > > >>> based just on spec
> > > > text.
> > > > > > > > >>>
> > > > > > > > >>> We do sometimes ask for POC implementation in linux /
> > > > > > > > >>> qemu to demonstrate how things work before merging code.
> > > > > > > > >>> We skipped this for admin things so far but I think it's
> > > > > > > > >>> a good idea to start doing it here.
> > > > > > > > >>>
> > > > > > > > >>> What makes me pause a bit before saying please do a PoC
> > > > > > > > >>> is all the opposition that seems to exist to even using
> > > > > > > > >>> admin commands in the 1st place. I think once we finally
> > > > > > > > >>> stop arguing about whether to use admin commands at all
> > > > > > > > >>> then a PoC will be needed
> > > > > > before merging.
> > > > > > > > >> We have POR productions that implemented the approach in
> > > > > > > > >> my
> > > > series.
> > > > > > > > >> They are multiple generations of productions in market
> > > > > > > > >> and running in customers data centers for years.
> > > > > > > > >>
> > > > > > > > >> Back to 2019 when we start working on vDPA, we have sent
> > > > > > > > >> some samples of production(e.g., Cascade Glacier) and the
> > > > > > > > >> datasheet, you can find live migration facilities there,
> > > > > > > > >> includes suspend, vq state and other features.
> > > > > > > > >>
> > > > > > > > >> And there is an reference in DPDK live migration, I have
> > > > > > > > >> provided this page
> > > > > > > > >> before:
> > > > > > > > >> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html, it
> > > > > > > > >> has been working for long long time.
> > > > > > > > >>
> > > > > > > > >> So if we let the facts speak, if we want to see if the
> > > > > > > > >> proposal is proven to work, I would
> > > > > > > > >> say: They are POR for years, customers already deployed
> > > > > > > > >> them for
> > > > years.
> > > > > > > > > And I guess what you are trying to say is that this
> > > > > > > > > patchset we are reviewing here should be help to the same
> > > > > > > > > standard and there should be a PoC? Sounds reasonable.
> > > > > > > > Yes and the in-marketing productions are POR, the series
> > > > > > > > just improves the design, for example, our series also use
> > > > > > > > registers to track vq state, but improvements than CG or
> > > > > > > > BSC. So I think they are proven
> > > > > > to work.
> > > > > > >
> > > > > > > If you prefer to go the route of POR and production and proven
> > > > > > > documents
> > > > > > etc, there is ton of it of multiple types of products I can dump
> > > > > > here with open- source code and documentation and more.
> > > > > > > Let me know what you would like to see.
> > > > > > >
> > > > > > > Michael has requested some performance comparisons, not all
> > > > > > > are ready to
> > > > > > share yet.
> > > > > > > Some are present that I will share in coming weeks.
> > > > > > >
> > > > > > > And all the vdpa dpdk you published does not have basic CVQ
> > > > > > > support when I
> > > > > > last looked at it.
> > > > > > > Do you know when was it added?
> > > > > >
> > > > > > It's good enough for PoC I think, CVQ or not.
> > > > > > The problem with CVQ generally, is that VDPA wants to shadow CVQ
> > > > > > it at all times because it wants to decode and cache the
> > > > > > content. But this problem has nothing to do with dirty tracking
> > > > > > even though it also
> > > > mentions "shadow":
> > > > > > if device can report it's state then there's no need to shadow CVQ.
> > > > >
> > > > > For the performance numbers with the pre-copy and device context
> > > > > of
> > > > patches posted 1 to 5, the downtime reduction of the VM is 3.71x
> > > > with active traffic on 8 RQs at 100Gbps port speed.
> > > >
> > > > Sounds good can you please post a bit more detail?
> > > > which configs are you comparing what was the result on each of them.
> > >
> > > Common config: 8+8 tx and rx queues.
> > > Port speed: 100Gbps
> > > QEMU 8.1
> > > Libvirt 7.0
> > > GVM: Centos 7.4
> > > Device: virtio VF hardware device
> > >
> > > Config_1: virtio suspend/resume similar to what Lingshan has, largely
> > > vdpa stack
> > > Config_2: Device context method of admin commands
> > 
> > OK that sounds good. The weird thing here is that you measure "downtime".
> > What exactly do you mean here?
> > I am guessing it's the time to retrieve on source and re-program device state on
> > destination? And this is 3.71x out of how long?
> Yes. Downtime is the time during which the VM is not responding or receiving packets, which involves reprogramming the device.
> 3.71x is relative time for this discussion.

Oh interesting. So VM state movement including reprogramming the CPU is
dominated by reprogramming this single NIC, by a factor of almost 4?
CAn we get some absolute numbers too, please?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 12:15                                             ` Parav Pandit
@ 2023-11-17 12:37                                               ` Michael S. Tsirkin
  2023-11-17 12:49                                                 ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 12:37 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Fri, Nov 17, 2023 at 12:15:54PM +0000, Parav Pandit wrote:
> Hi Alex, Jason,
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 5:20 PM
> > To: Parav Pandit <parav@nvidia.com>
> 
> > > > Allocating resources on outgoing migration is a very bad idea.
> > > > It is common to migrate prcisely because you are out of resources.
> > > > Incoming is a different story, less of a problem.
> > > >
> > > The resource allocated may not be on same system.
> > > Also the resource allocated while the VM is running, so I don’t see a problem.
> > 
> > > Additionally, this is not what the Linux kernel maintainers of iommu subsystem
> > told us either.
> > > Let me know if you check with Alex W and Jason who build this interface.
> > 
> > VFIO guys have their own ideas, if they want to talk to virtio guys they can come
> > here and do that.
> 
> Since one of the use cases would have accepted to let dirty tracking to fail, I dont see a problem.
> This is not the only command on source that fails.
> So I anticipate that QEMU and libvirt or any vfio user would build the orchestration around the possible failure because the UAPI is well defined.
>
> When there is hypervisor, that must have zero failures on src side, such kernel + device can build everything reserved upfront.
> 
> Do you say, QEMU has zero memory allocations on source side for migration?
> That would be interesting to know.

More or less yes. More precisely while in theory allocations it's doing
can fail in practice it happens rarely enough that QEMU does not even
bother checking and will immediately crash if they do. The reason is
that it's using virtual memory, so it scales to a huge number of VMs.
Migrating a single VM at a time is not even worth discussing.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 12:22                                                                   ` Parav Pandit
@ 2023-11-17 12:40                                                                     ` Michael S. Tsirkin
  2023-11-17 12:51                                                                       ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 12:40 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Fri, Nov 17, 2023 at 12:22:59PM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 5:03 PM
> > To: Parav Pandit <parav@nvidia.com>
> 
> > > Somehow the claim of shadow vq is great without sharing any performance
> > numbers is what I don't agree with.
> > 
> > It's upstream in QEMU. Test it youself.
> > 
> We did few minutes back.
> It results in a call trace.
> Vhost_vdpa_setup_vq_irq crashes on list corruption on net-next.

Wrong list for this bug report.

> We are stopping any shadow vq tests on unstable stuff.

If you don't want to benchmark against alternatives how are you
going to prove your stuff is worth everyone's time?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 12:37                                               ` Michael S. Tsirkin
@ 2023-11-17 12:49                                                 ` Parav Pandit
  2023-11-17 13:58                                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 12:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 6:07 PM
> To: Parav Pandit <parav@nvidia.com>
> 
> On Fri, Nov 17, 2023 at 12:15:54PM +0000, Parav Pandit wrote:
> > Hi Alex, Jason,
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 5:20 PM
> > > To: Parav Pandit <parav@nvidia.com>
> >
> > > > > Allocating resources on outgoing migration is a very bad idea.
> > > > > It is common to migrate prcisely because you are out of resources.
> > > > > Incoming is a different story, less of a problem.
> > > > >
> > > > The resource allocated may not be on same system.
> > > > Also the resource allocated while the VM is running, so I don’t see a
> problem.
> > >
> > > > Additionally, this is not what the Linux kernel maintainers of
> > > > iommu subsystem
> > > told us either.
> > > > Let me know if you check with Alex W and Jason who build this interface.
> > >
> > > VFIO guys have their own ideas, if they want to talk to virtio guys
> > > they can come here and do that.
> >
> > Since one of the use cases would have accepted to let dirty tracking to fail, I
> dont see a problem.
> > This is not the only command on source that fails.
> > So I anticipate that QEMU and libvirt or any vfio user would build the
> orchestration around the possible failure because the UAPI is well defined.
> >
> > When there is hypervisor, that must have zero failures on src side, such kernel
> + device can build everything reserved upfront.
> >
> > Do you say, QEMU has zero memory allocations on source side for migration?
> > That would be interesting to know.
> 
> More or less yes. More precisely while in theory allocations it's doing can fail in
> practice it happens rarely enough that QEMU does not even bother checking
> and will immediately crash if they do. The reason is that it's using virtual
> memory, so it scales to a huge number of VMs.
> Migrating a single VM at a time is not even worth discussing.

Wow that is even much worse to crash the running VM, instead of failing the migration.
I have live migrated VMs one by one and have seen customers migrate on hyperconverged systems. Ofcourse it was not QEMU.
Single VM migration is real and used by cloud operators.
Why would you ignore it?

^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 12:40                                                                     ` Michael S. Tsirkin
@ 2023-11-17 12:51                                                                       ` Parav Pandit
  2023-11-21  5:16                                                                         ` Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 12:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Michael S. Tsirkin
> Sent: Friday, November 17, 2023 6:11 PM
> 
> On Fri, Nov 17, 2023 at 12:22:59PM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 5:03 PM
> > > To: Parav Pandit <parav@nvidia.com>
> >
> > > > Somehow the claim of shadow vq is great without sharing any
> > > > performance
> > > numbers is what I don't agree with.
> > >
> > > It's upstream in QEMU. Test it youself.
> > >
> > We did few minutes back.
> > It results in a call trace.
> > Vhost_vdpa_setup_vq_irq crashes on list corruption on net-next.
> 
> Wrong list for this bug report.
> 
> > We are stopping any shadow vq tests on unstable stuff.
> 
> If you don't want to benchmark against alternatives how are you going to prove
> your stuff is worth everyone's time?

Comparing performance of the functional things count.
You suggest shadow vq, frankly you should post the grand numbers of shadow vq.

It is really not my role to report bug of unstable stuff and compare the perf against.

We propose device context and provided the numbers you asked. Mostly wont be able to go farther than this.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 12:32                                                     ` Michael S. Tsirkin
@ 2023-11-17 13:03                                                       ` Parav Pandit
  2023-11-17 14:00                                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 13:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 6:02 PM
> 
> On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 5:35 PM
> > > To: Parav Pandit <parav@nvidia.com>
> > >
> > > On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 5:04 PM
> > > > >
> > > > > On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> > > > > >
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 4:30 PM
> > > > > > >
> > > > > > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > > > > > >
> > > > > > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu, Lingshan
> wrote:
> > > > > > > > > >>
> > > > > > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit
> wrote:
> > > > > > > > > >>>> We should expose a limit of the device in the
> > > > > > > > > >>>> proposed
> > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how much range it
> > > > > > > > > can
> > > > > track.
> > > > > > > > > >>>> So that future provisioning framework can use it.
> > > > > > > > > >>>>
> > > > > > > > > >>>> I will cover this in v5 early next week.
> > > > > > > > > >>> I do worry about how this can even work though. If
> > > > > > > > > >>> you want a generic device you do not get to dictate
> > > > > > > > > >>> how much memory VM
> > > > > has.
> > > > > > > > > >>>
> > > > > > > > > >>> Aren't we talking bit per page? With 1TByte of
> > > > > > > > > >>> memory to track
> > > > > > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > > >>>
> > > > > > > > > >>> And you happily say "we'll address this in the future"
> > > > > > > > > >>> while at the same time fighting tooth and nail
> > > > > > > > > >>> against adding single bit status registers because scalability?
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>> I have a feeling doing this completely theoretical
> > > > > > > > > >>> like this is
> > > > > problematic.
> > > > > > > > > >>> Maybe you have it all laid out neatly in your head
> > > > > > > > > >>> but I suspect not all of TC can picture it clearly
> > > > > > > > > >>> enough based just on spec
> > > > > text.
> > > > > > > > > >>>
> > > > > > > > > >>> We do sometimes ask for POC implementation in linux
> > > > > > > > > >>> / qemu to demonstrate how things work before merging
> code.
> > > > > > > > > >>> We skipped this for admin things so far but I think
> > > > > > > > > >>> it's a good idea to start doing it here.
> > > > > > > > > >>>
> > > > > > > > > >>> What makes me pause a bit before saying please do a
> > > > > > > > > >>> PoC is all the opposition that seems to exist to
> > > > > > > > > >>> even using admin commands in the 1st place. I think
> > > > > > > > > >>> once we finally stop arguing about whether to use
> > > > > > > > > >>> admin commands at all then a PoC will be needed
> > > > > > > before merging.
> > > > > > > > > >> We have POR productions that implemented the approach
> > > > > > > > > >> in my
> > > > > series.
> > > > > > > > > >> They are multiple generations of productions in
> > > > > > > > > >> market and running in customers data centers for years.
> > > > > > > > > >>
> > > > > > > > > >> Back to 2019 when we start working on vDPA, we have
> > > > > > > > > >> sent some samples of production(e.g., Cascade
> > > > > > > > > >> Glacier) and the datasheet, you can find live
> > > > > > > > > >> migration facilities there, includes suspend, vq state and other
> features.
> > > > > > > > > >>
> > > > > > > > > >> And there is an reference in DPDK live migration, I
> > > > > > > > > >> have provided this page
> > > > > > > > > >> before:
> > > > > > > > > >> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html,
> > > > > > > > > >> it has been working for long long time.
> > > > > > > > > >>
> > > > > > > > > >> So if we let the facts speak, if we want to see if
> > > > > > > > > >> the proposal is proven to work, I would
> > > > > > > > > >> say: They are POR for years, customers already
> > > > > > > > > >> deployed them for
> > > > > years.
> > > > > > > > > > And I guess what you are trying to say is that this
> > > > > > > > > > patchset we are reviewing here should be help to the
> > > > > > > > > > same standard and there should be a PoC? Sounds reasonable.
> > > > > > > > > Yes and the in-marketing productions are POR, the series
> > > > > > > > > just improves the design, for example, our series also
> > > > > > > > > use registers to track vq state, but improvements than
> > > > > > > > > CG or BSC. So I think they are proven
> > > > > > > to work.
> > > > > > > >
> > > > > > > > If you prefer to go the route of POR and production and
> > > > > > > > proven documents
> > > > > > > etc, there is ton of it of multiple types of products I can
> > > > > > > dump here with open- source code and documentation and more.
> > > > > > > > Let me know what you would like to see.
> > > > > > > >
> > > > > > > > Michael has requested some performance comparisons, not
> > > > > > > > all are ready to
> > > > > > > share yet.
> > > > > > > > Some are present that I will share in coming weeks.
> > > > > > > >
> > > > > > > > And all the vdpa dpdk you published does not have basic
> > > > > > > > CVQ support when I
> > > > > > > last looked at it.
> > > > > > > > Do you know when was it added?
> > > > > > >
> > > > > > > It's good enough for PoC I think, CVQ or not.
> > > > > > > The problem with CVQ generally, is that VDPA wants to shadow
> > > > > > > CVQ it at all times because it wants to decode and cache the
> > > > > > > content. But this problem has nothing to do with dirty
> > > > > > > tracking even though it also
> > > > > mentions "shadow":
> > > > > > > if device can report it's state then there's no need to shadow CVQ.
> > > > > >
> > > > > > For the performance numbers with the pre-copy and device
> > > > > > context of
> > > > > patches posted 1 to 5, the downtime reduction of the VM is 3.71x
> > > > > with active traffic on 8 RQs at 100Gbps port speed.
> > > > >
> > > > > Sounds good can you please post a bit more detail?
> > > > > which configs are you comparing what was the result on each of them.
> > > >
> > > > Common config: 8+8 tx and rx queues.
> > > > Port speed: 100Gbps
> > > > QEMU 8.1
> > > > Libvirt 7.0
> > > > GVM: Centos 7.4
> > > > Device: virtio VF hardware device
> > > >
> > > > Config_1: virtio suspend/resume similar to what Lingshan has,
> > > > largely vdpa stack
> > > > Config_2: Device context method of admin commands
> > >
> > > OK that sounds good. The weird thing here is that you measure "downtime".
> > > What exactly do you mean here?
> > > I am guessing it's the time to retrieve on source and re-program
> > > device state on destination? And this is 3.71x out of how long?
> > Yes. Downtime is the time during which the VM is not responding or receiving
> packets, which involves reprogramming the device.
> > 3.71x is relative time for this discussion.
> 
> Oh interesting. So VM state movement including reprogramming the CPU is
> dominated by reprogramming this single NIC, by a factor of almost 4?
Yes.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 12:49                                                 ` Parav Pandit
@ 2023-11-17 13:58                                                   ` Michael S. Tsirkin
  2023-11-17 14:49                                                     ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 13:58 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Fri, Nov 17, 2023 at 12:49:36PM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 6:07 PM
> > To: Parav Pandit <parav@nvidia.com>
> > 
> > On Fri, Nov 17, 2023 at 12:15:54PM +0000, Parav Pandit wrote:
> > > Hi Alex, Jason,
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 5:20 PM
> > > > To: Parav Pandit <parav@nvidia.com>
> > >
> > > > > > Allocating resources on outgoing migration is a very bad idea.
> > > > > > It is common to migrate prcisely because you are out of resources.
> > > > > > Incoming is a different story, less of a problem.
> > > > > >
> > > > > The resource allocated may not be on same system.
> > > > > Also the resource allocated while the VM is running, so I don’t see a
> > problem.
> > > >
> > > > > Additionally, this is not what the Linux kernel maintainers of
> > > > > iommu subsystem
> > > > told us either.
> > > > > Let me know if you check with Alex W and Jason who build this interface.
> > > >
> > > > VFIO guys have their own ideas, if they want to talk to virtio guys
> > > > they can come here and do that.
> > >
> > > Since one of the use cases would have accepted to let dirty tracking to fail, I
> > dont see a problem.
> > > This is not the only command on source that fails.
> > > So I anticipate that QEMU and libvirt or any vfio user would build the
> > orchestration around the possible failure because the UAPI is well defined.
> > >
> > > When there is hypervisor, that must have zero failures on src side, such kernel
> > + device can build everything reserved upfront.
> > >
> > > Do you say, QEMU has zero memory allocations on source side for migration?
> > > That would be interesting to know.
> > 
> > More or less yes. More precisely while in theory allocations it's doing can fail in
> > practice it happens rarely enough that QEMU does not even bother checking
> > and will immediately crash if they do. The reason is that it's using virtual
> > memory, so it scales to a huge number of VMs.
> > Migrating a single VM at a time is not even worth discussing.
> 
> Wow that is even much worse to crash the running VM, instead of failing the migration.

I am telling you.  It does not practically crash because it does not
allocate physical memory.

> I have live migrated VMs one by one and have seen customers migrate on hyperconverged systems. Ofcourse it was not QEMU.
> Single VM migration is real and used by cloud operators.
> Why would you ignore it?

I guess they only have single tenant hosts then? Because each tenant
might request migration at any time.
-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 13:03                                                       ` Parav Pandit
@ 2023-11-17 14:00                                                         ` Michael S. Tsirkin
  2023-11-17 14:48                                                           ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 14:00 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 6:02 PM
> > 
> > On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 5:35 PM
> > > > To: Parav Pandit <parav@nvidia.com>
> > > >
> > > > On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 5:04 PM
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Friday, November 17, 2023 4:30 PM
> > > > > > > >
> > > > > > > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > > > > > > >
> > > > > > > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu, Lingshan
> > wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit
> > wrote:
> > > > > > > > > > >>>> We should expose a limit of the device in the
> > > > > > > > > > >>>> proposed
> > > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how much range it
> > > > > > > > > > can
> > > > > > track.
> > > > > > > > > > >>>> So that future provisioning framework can use it.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> I will cover this in v5 early next week.
> > > > > > > > > > >>> I do worry about how this can even work though. If
> > > > > > > > > > >>> you want a generic device you do not get to dictate
> > > > > > > > > > >>> how much memory VM
> > > > > > has.
> > > > > > > > > > >>>
> > > > > > > > > > >>> Aren't we talking bit per page? With 1TByte of
> > > > > > > > > > >>> memory to track
> > > > > > > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > > > >>>
> > > > > > > > > > >>> And you happily say "we'll address this in the future"
> > > > > > > > > > >>> while at the same time fighting tooth and nail
> > > > > > > > > > >>> against adding single bit status registers because scalability?
> > > > > > > > > > >>>
> > > > > > > > > > >>>
> > > > > > > > > > >>> I have a feeling doing this completely theoretical
> > > > > > > > > > >>> like this is
> > > > > > problematic.
> > > > > > > > > > >>> Maybe you have it all laid out neatly in your head
> > > > > > > > > > >>> but I suspect not all of TC can picture it clearly
> > > > > > > > > > >>> enough based just on spec
> > > > > > text.
> > > > > > > > > > >>>
> > > > > > > > > > >>> We do sometimes ask for POC implementation in linux
> > > > > > > > > > >>> / qemu to demonstrate how things work before merging
> > code.
> > > > > > > > > > >>> We skipped this for admin things so far but I think
> > > > > > > > > > >>> it's a good idea to start doing it here.
> > > > > > > > > > >>>
> > > > > > > > > > >>> What makes me pause a bit before saying please do a
> > > > > > > > > > >>> PoC is all the opposition that seems to exist to
> > > > > > > > > > >>> even using admin commands in the 1st place. I think
> > > > > > > > > > >>> once we finally stop arguing about whether to use
> > > > > > > > > > >>> admin commands at all then a PoC will be needed
> > > > > > > > before merging.
> > > > > > > > > > >> We have POR productions that implemented the approach
> > > > > > > > > > >> in my
> > > > > > series.
> > > > > > > > > > >> They are multiple generations of productions in
> > > > > > > > > > >> market and running in customers data centers for years.
> > > > > > > > > > >>
> > > > > > > > > > >> Back to 2019 when we start working on vDPA, we have
> > > > > > > > > > >> sent some samples of production(e.g., Cascade
> > > > > > > > > > >> Glacier) and the datasheet, you can find live
> > > > > > > > > > >> migration facilities there, includes suspend, vq state and other
> > features.
> > > > > > > > > > >>
> > > > > > > > > > >> And there is an reference in DPDK live migration, I
> > > > > > > > > > >> have provided this page
> > > > > > > > > > >> before:
> > > > > > > > > > >> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.html,
> > > > > > > > > > >> it has been working for long long time.
> > > > > > > > > > >>
> > > > > > > > > > >> So if we let the facts speak, if we want to see if
> > > > > > > > > > >> the proposal is proven to work, I would
> > > > > > > > > > >> say: They are POR for years, customers already
> > > > > > > > > > >> deployed them for
> > > > > > years.
> > > > > > > > > > > And I guess what you are trying to say is that this
> > > > > > > > > > > patchset we are reviewing here should be help to the
> > > > > > > > > > > same standard and there should be a PoC? Sounds reasonable.
> > > > > > > > > > Yes and the in-marketing productions are POR, the series
> > > > > > > > > > just improves the design, for example, our series also
> > > > > > > > > > use registers to track vq state, but improvements than
> > > > > > > > > > CG or BSC. So I think they are proven
> > > > > > > > to work.
> > > > > > > > >
> > > > > > > > > If you prefer to go the route of POR and production and
> > > > > > > > > proven documents
> > > > > > > > etc, there is ton of it of multiple types of products I can
> > > > > > > > dump here with open- source code and documentation and more.
> > > > > > > > > Let me know what you would like to see.
> > > > > > > > >
> > > > > > > > > Michael has requested some performance comparisons, not
> > > > > > > > > all are ready to
> > > > > > > > share yet.
> > > > > > > > > Some are present that I will share in coming weeks.
> > > > > > > > >
> > > > > > > > > And all the vdpa dpdk you published does not have basic
> > > > > > > > > CVQ support when I
> > > > > > > > last looked at it.
> > > > > > > > > Do you know when was it added?
> > > > > > > >
> > > > > > > > It's good enough for PoC I think, CVQ or not.
> > > > > > > > The problem with CVQ generally, is that VDPA wants to shadow
> > > > > > > > CVQ it at all times because it wants to decode and cache the
> > > > > > > > content. But this problem has nothing to do with dirty
> > > > > > > > tracking even though it also
> > > > > > mentions "shadow":
> > > > > > > > if device can report it's state then there's no need to shadow CVQ.
> > > > > > >
> > > > > > > For the performance numbers with the pre-copy and device
> > > > > > > context of
> > > > > > patches posted 1 to 5, the downtime reduction of the VM is 3.71x
> > > > > > with active traffic on 8 RQs at 100Gbps port speed.
> > > > > >
> > > > > > Sounds good can you please post a bit more detail?
> > > > > > which configs are you comparing what was the result on each of them.
> > > > >
> > > > > Common config: 8+8 tx and rx queues.
> > > > > Port speed: 100Gbps
> > > > > QEMU 8.1
> > > > > Libvirt 7.0
> > > > > GVM: Centos 7.4
> > > > > Device: virtio VF hardware device
> > > > >
> > > > > Config_1: virtio suspend/resume similar to what Lingshan has,
> > > > > largely vdpa stack
> > > > > Config_2: Device context method of admin commands
> > > >
> > > > OK that sounds good. The weird thing here is that you measure "downtime".
> > > > What exactly do you mean here?
> > > > I am guessing it's the time to retrieve on source and re-program
> > > > device state on destination? And this is 3.71x out of how long?
> > > Yes. Downtime is the time during which the VM is not responding or receiving
> > packets, which involves reprogramming the device.
> > > 3.71x is relative time for this discussion.
> > 
> > Oh interesting. So VM state movement including reprogramming the CPU is
> > dominated by reprogramming this single NIC, by a factor of almost 4?
> Yes.

Could you post some numbers too then?  I want to know whether that would imply that VM
boot is slowed down significantly too. If yes that's another motivation
for pci transport 2.0.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 14:00                                                         ` Michael S. Tsirkin
@ 2023-11-17 14:48                                                           ` Parav Pandit
  2023-11-17 14:59                                                             ` Michael S. Tsirkin
  2023-11-21  6:55                                                             ` Jason Wang
  0 siblings, 2 replies; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 14:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 7:31 PM
> To: Parav Pandit <parav@nvidia.com>
> 
> On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 6:02 PM
> > >
> > > On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 5:35 PM
> > > > > To: Parav Pandit <parav@nvidia.com>
> > > > >
> > > > > On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 5:04 PM
> > > > > > >
> > > > > > > On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > Sent: Friday, November 17, 2023 4:30 PM
> > > > > > > > >
> > > > > > > > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > > > > > > > >
> > > > > > > > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu,
> > > > > > > > > > > > Lingshan
> > > wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav
> > > > > > > > > > > >>> Pandit
> > > wrote:
> > > > > > > > > > > >>>> We should expose a limit of the device in the
> > > > > > > > > > > >>>> proposed
> > > > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how much
> range
> > > > > > > > > > > it can
> > > > > > > track.
> > > > > > > > > > > >>>> So that future provisioning framework can use it.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> I will cover this in v5 early next week.
> > > > > > > > > > > >>> I do worry about how this can even work though.
> > > > > > > > > > > >>> If you want a generic device you do not get to
> > > > > > > > > > > >>> dictate how much memory VM
> > > > > > > has.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Aren't we talking bit per page? With 1TByte of
> > > > > > > > > > > >>> memory to track
> > > > > > > > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> And you happily say "we'll address this in the future"
> > > > > > > > > > > >>> while at the same time fighting tooth and nail
> > > > > > > > > > > >>> against adding single bit status registers because
> scalability?
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> I have a feeling doing this completely
> > > > > > > > > > > >>> theoretical like this is
> > > > > > > problematic.
> > > > > > > > > > > >>> Maybe you have it all laid out neatly in your
> > > > > > > > > > > >>> head but I suspect not all of TC can picture it
> > > > > > > > > > > >>> clearly enough based just on spec
> > > > > > > text.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> We do sometimes ask for POC implementation in
> > > > > > > > > > > >>> linux / qemu to demonstrate how things work
> > > > > > > > > > > >>> before merging
> > > code.
> > > > > > > > > > > >>> We skipped this for admin things so far but I
> > > > > > > > > > > >>> think it's a good idea to start doing it here.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> What makes me pause a bit before saying please
> > > > > > > > > > > >>> do a PoC is all the opposition that seems to
> > > > > > > > > > > >>> exist to even using admin commands in the 1st
> > > > > > > > > > > >>> place. I think once we finally stop arguing
> > > > > > > > > > > >>> about whether to use admin commands at all then
> > > > > > > > > > > >>> a PoC will be needed
> > > > > > > > > before merging.
> > > > > > > > > > > >> We have POR productions that implemented the
> > > > > > > > > > > >> approach in my
> > > > > > > series.
> > > > > > > > > > > >> They are multiple generations of productions in
> > > > > > > > > > > >> market and running in customers data centers for years.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Back to 2019 when we start working on vDPA, we
> > > > > > > > > > > >> have sent some samples of production(e.g.,
> > > > > > > > > > > >> Cascade
> > > > > > > > > > > >> Glacier) and the datasheet, you can find live
> > > > > > > > > > > >> migration facilities there, includes suspend, vq
> > > > > > > > > > > >> state and other
> > > features.
> > > > > > > > > > > >>
> > > > > > > > > > > >> And there is an reference in DPDK live migration,
> > > > > > > > > > > >> I have provided this page
> > > > > > > > > > > >> before:
> > > > > > > > > > > >> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.ht
> > > > > > > > > > > >> ml, it has been working for long long time.
> > > > > > > > > > > >>
> > > > > > > > > > > >> So if we let the facts speak, if we want to see
> > > > > > > > > > > >> if the proposal is proven to work, I would
> > > > > > > > > > > >> say: They are POR for years, customers already
> > > > > > > > > > > >> deployed them for
> > > > > > > years.
> > > > > > > > > > > > And I guess what you are trying to say is that
> > > > > > > > > > > > this patchset we are reviewing here should be help
> > > > > > > > > > > > to the same standard and there should be a PoC? Sounds
> reasonable.
> > > > > > > > > > > Yes and the in-marketing productions are POR, the
> > > > > > > > > > > series just improves the design, for example, our
> > > > > > > > > > > series also use registers to track vq state, but
> > > > > > > > > > > improvements than CG or BSC. So I think they are
> > > > > > > > > > > proven
> > > > > > > > > to work.
> > > > > > > > > >
> > > > > > > > > > If you prefer to go the route of POR and production
> > > > > > > > > > and proven documents
> > > > > > > > > etc, there is ton of it of multiple types of products I
> > > > > > > > > can dump here with open- source code and documentation and
> more.
> > > > > > > > > > Let me know what you would like to see.
> > > > > > > > > >
> > > > > > > > > > Michael has requested some performance comparisons,
> > > > > > > > > > not all are ready to
> > > > > > > > > share yet.
> > > > > > > > > > Some are present that I will share in coming weeks.
> > > > > > > > > >
> > > > > > > > > > And all the vdpa dpdk you published does not have
> > > > > > > > > > basic CVQ support when I
> > > > > > > > > last looked at it.
> > > > > > > > > > Do you know when was it added?
> > > > > > > > >
> > > > > > > > > It's good enough for PoC I think, CVQ or not.
> > > > > > > > > The problem with CVQ generally, is that VDPA wants to
> > > > > > > > > shadow CVQ it at all times because it wants to decode
> > > > > > > > > and cache the content. But this problem has nothing to
> > > > > > > > > do with dirty tracking even though it also
> > > > > > > mentions "shadow":
> > > > > > > > > if device can report it's state then there's no need to shadow
> CVQ.
> > > > > > > >
> > > > > > > > For the performance numbers with the pre-copy and device
> > > > > > > > context of
> > > > > > > patches posted 1 to 5, the downtime reduction of the VM is
> > > > > > > 3.71x with active traffic on 8 RQs at 100Gbps port speed.
> > > > > > >
> > > > > > > Sounds good can you please post a bit more detail?
> > > > > > > which configs are you comparing what was the result on each of
> them.
> > > > > >
> > > > > > Common config: 8+8 tx and rx queues.
> > > > > > Port speed: 100Gbps
> > > > > > QEMU 8.1
> > > > > > Libvirt 7.0
> > > > > > GVM: Centos 7.4
> > > > > > Device: virtio VF hardware device
> > > > > >
> > > > > > Config_1: virtio suspend/resume similar to what Lingshan has,
> > > > > > largely vdpa stack
> > > > > > Config_2: Device context method of admin commands
> > > > >
> > > > > OK that sounds good. The weird thing here is that you measure
> "downtime".
> > > > > What exactly do you mean here?
> > > > > I am guessing it's the time to retrieve on source and re-program
> > > > > device state on destination? And this is 3.71x out of how long?
> > > > Yes. Downtime is the time during which the VM is not responding or
> > > > receiving
> > > packets, which involves reprogramming the device.
> > > > 3.71x is relative time for this discussion.
> > >
> > > Oh interesting. So VM state movement including reprogramming the CPU
> > > is dominated by reprogramming this single NIC, by a factor of almost 4?
> > Yes.
> 
> Could you post some numbers too then?  I want to know whether that would
> imply that VM boot is slowed down significantly too. If yes that's another
> motivation for pci transport 2.0.
It was 1.8 sec down to 480msec.
The time didn't come from pci side or boot side.

For pci side of things you would want to compare the pci vs non pci device based VM boot time.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 13:58                                                   ` Michael S. Tsirkin
@ 2023-11-17 14:49                                                     ` Parav Pandit
  2023-11-17 15:00                                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-17 14:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 7:28 PM
> 
> On Fri, Nov 17, 2023 at 12:49:36PM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 6:07 PM
> > > To: Parav Pandit <parav@nvidia.com>
> > >
> > > On Fri, Nov 17, 2023 at 12:15:54PM +0000, Parav Pandit wrote:
> > > > Hi Alex, Jason,
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 5:20 PM
> > > > > To: Parav Pandit <parav@nvidia.com>
> > > >
> > > > > > > Allocating resources on outgoing migration is a very bad idea.
> > > > > > > It is common to migrate prcisely because you are out of resources.
> > > > > > > Incoming is a different story, less of a problem.
> > > > > > >
> > > > > > The resource allocated may not be on same system.
> > > > > > Also the resource allocated while the VM is running, so I
> > > > > > don’t see a
> > > problem.
> > > > >
> > > > > > Additionally, this is not what the Linux kernel maintainers of
> > > > > > iommu subsystem
> > > > > told us either.
> > > > > > Let me know if you check with Alex W and Jason who build this
> interface.
> > > > >
> > > > > VFIO guys have their own ideas, if they want to talk to virtio
> > > > > guys they can come here and do that.
> > > >
> > > > Since one of the use cases would have accepted to let dirty
> > > > tracking to fail, I
> > > dont see a problem.
> > > > This is not the only command on source that fails.
> > > > So I anticipate that QEMU and libvirt or any vfio user would build
> > > > the
> > > orchestration around the possible failure because the UAPI is well defined.
> > > >
> > > > When there is hypervisor, that must have zero failures on src
> > > > side, such kernel
> > > + device can build everything reserved upfront.
> > > >
> > > > Do you say, QEMU has zero memory allocations on source side for
> migration?
> > > > That would be interesting to know.
> > >
> > > More or less yes. More precisely while in theory allocations it's
> > > doing can fail in practice it happens rarely enough that QEMU does
> > > not even bother checking and will immediately crash if they do. The
> > > reason is that it's using virtual memory, so it scales to a huge number of
> VMs.
> > > Migrating a single VM at a time is not even worth discussing.
> >
> > Wow that is even much worse to crash the running VM, instead of failing the
> migration.
> 
> I am telling you.  It does not practically crash because it does not allocate
> physical memory.
> 
Once you run through the memory cgroup, it can fail.

> > I have live migrated VMs one by one and have seen customers migrate on
> hyperconverged systems. Ofcourse it was not QEMU.
> > Single VM migration is real and used by cloud operators.
> > Why would you ignore it?
> 
> I guess they only have single tenant hosts then? Because each tenant might
> request migration at any time.
Typically, tenant do not initiate the migration. The cloud operator initiates the migration.
And there is some agent in-between to throttle the number of VMs migrating. It also needs to carve some bw on physical network and more.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 14:48                                                           ` Parav Pandit
@ 2023-11-17 14:59                                                             ` Michael S. Tsirkin
  2023-11-21  6:55                                                             ` Jason Wang
  1 sibling, 0 replies; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 14:59 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 02:48:37PM +0000, Parav Pandit wrote:
> > Could you post some numbers too then?  I want to know whether that would
> > imply that VM boot is slowed down significantly too. If yes that's another
> > motivation for pci transport 2.0.
> It was 1.8 sec down to 480msec.
> The time didn't come from pci side or boot side.
> 
> For pci side of things you would want to compare the pci vs non pci device based VM boot time.

I mean setup currently requires same config as vm load. so it will take
about 1 second? That's very bad.
-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 14:49                                                     ` Parav Pandit
@ 2023-11-17 15:00                                                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 15:00 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Fri, Nov 17, 2023 at 02:49:03PM +0000, Parav Pandit wrote:
> > I guess they only have single tenant hosts then? Because each tenant might
> > request migration at any time.
> Typically, tenant do not initiate the migration. The cloud operator initiates the migration.

Can be at tenant's request.

> And there is some agent in-between to throttle the number of VMs migrating. It also needs to carve some bw on physical network and more.
> 

you are presumably paying for that.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16  6:49                                       ` Michael S. Tsirkin
@ 2023-11-21  4:21                                         ` Jason Wang
  2023-11-21 16:24                                           ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-21  4:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 16, 2023 at 2:49 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Nov 16, 2023 at 12:24:27PM +0800, Jason Wang wrote:
> > On Thu, Nov 16, 2023 at 1:37 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Monday, November 13, 2023 9:11 AM
> > > >
> > > > On Fri, Nov 10, 2023 at 2:46 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > > Hi Michael,
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Thursday, November 9, 2023 1:29 PM
> > > > >
> > > > > [..]
> > > > > > > Besides the issue of performance, it's also racy, assuming we are
> > > > > > > logging
> > > > > > IOVA.
> > > > > > >
> > > > > > > 0) device log IOVA
> > > > > > > 1) hypervisor fetches IOVA from log buffer
> > > > > > > 2) guest map IOVA to a new GPA
> > > > > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > > > > >
> > > > > > > Then we lost the old GPA.
> > > > > >
> > > > > > Interesting and a good point. And by the way e.g. vhost has the same
> > > > > > issue.  You need to flush dirty tracking info when changing the
> > > > > > mappings somehow.  Parav what's the plan for this? Should be addressed in
> > > > the spec too.
> > > > > >
> > > > > As you listed the flush is needed for vhost or device-based DPT.
> > > >
> > > > What does DPT mean? Device Page Table? Let's not invent terminology which is
> > > > not known by others please.
> > > >
> > > Sorry for using the acronym. I meant dirty page tracking.
> > >
> > > > We have discussed it many times. You can't just depend on ATS or reinventing
> > > > wheels in virtio.
> > > The dependency is on the iommu which would have the mapping of GIOVA to GPA like any sw implementation.
> > > No dependency on ATS.
> > >
> > > >
> > > > What's more, please try not to give me the impression that the proposal is
> > > > optimized for a specific vendor (like device IOMMU stuffs).
> > > >
> > > You should stop calling this specific vendor thing.
> >
> > Well, as you have explained, the confusion came from "DPT" ...
> >
> > > One can equally say that suspend bit proposal is for the sw_vendor device who is forcing virtio hw device to only implement ioqueues + PASID + non_unified interface for PF, VF, SIOVs + non_TDISP based devices.
> > >
> > > > > The necessary plumbing is already covered for this in the query (read and
> > > > clear) command of this v3 proposal.
> > > >
> > > > The issue is logging via IOVA ... I don't see how "read and clear" can help.
> > > >
> > > Read and clear helps that ensures that all the dirty pages are reported, hence there is no mapping/unmapping race.
> >
> > Reported as IOVA ...
> >
> > > As everything is reported.
> > >
> > > > > It is listed in Device Write Records Read Command.
> > > >
> > > > Please explain how your proposal can solve the above race.
> > > >
> > > In below manner.
> > > 1. guest has GIOVA to GPA_1 mapping
> > > 2. RX packets occurred to GIOVA
> > > 3. device reported dirty page log for GIOVA (hypervisor is yet to read)
> > > 4. guest requested mapping change from GIOVA to GPA_2
> > > 4.1 During this IOTLB is invalidated and dirty page report is queried ensuring, it can change the mapping
> >
> > It requires
> >
> > 1) hypervisor traps IOTLB invalidation, which doesn't work when
> > nesting could be offloaded (IOMMUFD has started the work to support
> > nesting)
> > 2) query the device about the dirty page on each IOTLB invalidation which:
> > 2.1) A huge round trip: guest IOTLB invalidation -> trapped by
> > hypervisor -> start the query from the device -> device return ->
> > hypervisor reports IOTLB invalidation is done -> let guest run. Have
> > you benchmarked the RTT in this case? There are just too many places
> > that cause the delay in the middle.
>
> To be fair invalidations are already expensive e.g. with vhost iotlb
> it requires a slow system call.
> This will make them *even more* expensive.

Yes, a slow syscall plus a virtqueue query RTT.

Need some benchmark. It looks to me currently the invalidation is done
via a queued based interface in vtd. So guests may need to spin where
it may trigger a lockup in the guest.

>
> Problem for some but not all workloads.  Again I agree motivation,
> tradeoffs and comparison with both dirty tracking by iommu and shadow vq
> approaches really should be included.

+1

>
>
> > 2.2) Guest triggerable behaviour, malicious guest can simply do
> > endless IOTLB invalidation to DOS the e.g admin virtqueue
>
> I'm not sure how much to worry about it - just don't allow more
> than one in flight per VM.

That's fine but it may need a note.

Thanks


>
>
>
> > >
> > > > >
> > > > > When the page write record is fully read, it is flushed.
> > > > > How/when to use, I think its hypervisor specific, so we probably better off not
> > > > documenting those details.
> > > >
> > > > Well, as the author of this proposal, at least you need to know how a hypervisor
> > > > can work with your proposal, no?
> > > >
> > > Likely yes, but it is not the scope of the spec to list those paths etc.
> >
> > Fine, but as a reviewer I need to know if it can work with a hypervisor well.
> >
> > >
> > > > > May be such read is needed in some other path too depending on how
> > > > hypervisor implemented.
> > > >
> > > > What do you mean by "May be ... some other path" here? You're inventing a
> > > > mechanism that you don't know how a hypervisor can use?
> > >
> > > No. I meant hypervisor may have more operations that map/unmap/flush where it may need to implement it.
> > > Some one may call it set_map(), some may say dma_map()...
> >
> > Ok.
> >
> > Thanks
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16  5:51                               ` [virtio-comment] " Michael S. Tsirkin
  2023-11-16  7:35                                 ` Michael S. Tsirkin
  2023-11-16 10:28                                 ` Zhu, Lingshan
@ 2023-11-21  4:23                                 ` Jason Wang
  2 siblings, 0 replies; 157+ messages in thread
From: Jason Wang @ 2023-11-21  4:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 16, 2023 at 1:51 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > We should expose a limit of the device in the proposed WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > So that future provisioning framework can use it.
> >
> > I will cover this in v5 early next week.
>
> I do worry about how this can even work though. If you want a generic
> device you do not get to dictate how much memory VM has.
>
> Aren't we talking bit per page? With 1TByte of memory to track ->
> 256Gbit -> 32Gbit -> 8Gbyte per VF?
>
> And you happily say "we'll address this in the future" while at the same
> time fighting tooth and nail against adding single bit status registers
> because scalability?
>
>
> I have a feeling doing this completely theoretical like this is problematic.
> Maybe you have it all laid out neatly in your head but I suspect
> not all of TC can picture it clearly enough based just on spec text.
>
> We do sometimes ask for POC implementation in linux / qemu to
> demonstrate how things work before merging code. We skipped this
> for admin things so far but I think it's a good idea to start doing
> it here.

+1

Most virtio features were developed in this way. It's cheap and easy
to find issues in both design and implementation.

Another non virtio example, that is the rocker switch which is
prototyped in Qemu for switchdev.

>
> What makes me pause a bit before saying please do a PoC is
> all the opposition that seems to exist to even using admin
> commands in the 1st place. I think once we finally stop
> arguing about whether to use admin commands at all then
> a PoC will be needed before merging.

Exactly.

Thanks

>


>
> --
> MST
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17  3:02                                               ` [virtio-comment] " Parav Pandit
  2023-11-17  8:46                                                 ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-21  4:24                                                 ` Jason Wang
  2023-11-21 16:26                                                   ` [virtio-comment] " Parav Pandit
  1 sibling, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-21  4:24 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Fri, Nov 17, 2023 at 11:02 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, November 16, 2023 11:51 PM
> >
> > On Thu, Nov 16, 2023 at 05:29:49PM +0000, Parav Pandit wrote:
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Thursday, November 16, 2023 10:56 PM
> > > >
> > > > On Thu, Nov 16, 2023 at 04:26:53PM +0000, Parav Pandit wrote:
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Thursday, November 16, 2023 5:18 PM
> > > > > >
> > > > > > On Thu, Nov 16, 2023 at 07:40:57AM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Thursday, November 16, 2023 1:06 PM
> > > > > > > >
> > > > > > > > On Thu, Nov 16, 2023 at 12:51:40AM -0500, Michael S. Tsirkin wrote:
> > > > > > > > > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit wrote:
> > > > > > > > > > We should expose a limit of the device in the proposed
> > > > > > > > WRITE_RECORD_CAP_QUERY command, that how much range it can
> > > > track.
> > > > > > > > > > So that future provisioning framework can use it.
> > > > > > > > > >
> > > > > > > > > > I will cover this in v5 early next week.
> > > > > > > > >
> > > > > > > > > I do worry about how this can even work though. If you
> > > > > > > > > want a generic device you do not get to dictate how much memory
> > VM has.
> > > > > > > > >
> > > > > > > > > Aren't we talking bit per page? With 1TByte of memory to
> > > > > > > > > track
> > > > > > > > > -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > >
> > > > > > > > Ugh. Actually of course:
> > > > > > > > With 1TByte of memory to track -> 256Mbit -> 32Mbit ->
> > > > > > > > 8Mbyte per VF
> > > > > > > >
> > > > > > > > 8Gbyte per *PF* with 1K VFs.
> > > > > > > >
> > > > > > > Device may not maintain as a bitmap.
> > > > > >
> > > > > > However you maintain it, there's 256Mega bit of information.
> > > > > There may be other data structures that device may deploy as for
> > > > > example
> > > > hash or tree or something else.
> > > >
> > > > Point being?
> > > The device may have some hashing accelerator or other improvements that
> > may perform better than bitmap as many queues in parallel attempt to update
> > the shared database.
> >
> > Maybe, I didn't give this thought.
> >
> > My point was that to be able to keep all combinations of dirty/non dirty page
> > for each 4k page in a 1TByte guest device needs 8MBytes of on-device memory
> > per VF. As designed the query also has to report it for each VF accurately even if
> > multiple VFs are accessing same guest.
> Yes.
>
> >
> > > >
> > > > > And this is runtime memory only during the short live migration
> > > > > period of
> > > > 400msec or less.
> > > > > It is not some _always_ resident memory.

When developing the spec, we should not have any assumption for the
implementation. For example, you can't just assume virtio is always
emulated in the software in the DPU.

How can you make sure you can converge in 400ms without having a
interface for the driver to set the correct parameter like dirty
rates?

Thanks

> > > >
> > > > No - write tracking is used in the live phase of migration. It can
> > > > be enabled as long as you wish - it's a question of policy.  There
> > > > actually exist solutions that utilize this phase for redundancy, permanently
> > running in this mode.
> > >
> > > If such use case exists, one may further improve the device implementation.
> >
> > Yes such use cases exist, there is no limit on how long migration takes.
> > So go ahead and further improve it please. Do not give us "we did not get
> > requests for this feature" please.
>
> Please describe the use case more precisely.
> If there is any application or OS API etc exists, please point to it where would you like to fit this dirty page tracking beyond device migration.
> We may have to draw a line to have reasonable point and not keep discussing infinitely.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 12:51                                                                       ` Parav Pandit
@ 2023-11-21  5:16                                                                         ` Jason Wang
  2023-11-21 16:29                                                                           ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-21  5:16 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Fri, Nov 17, 2023 at 8:51 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> > open.org> On Behalf Of Michael S. Tsirkin
> > Sent: Friday, November 17, 2023 6:11 PM
> >
> > On Fri, Nov 17, 2023 at 12:22:59PM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 5:03 PM
> > > > To: Parav Pandit <parav@nvidia.com>
> > >
> > > > > Somehow the claim of shadow vq is great without sharing any
> > > > > performance
> > > > numbers is what I don't agree with.
> > > >
> > > > It's upstream in QEMU. Test it youself.
> > > >
> > > We did few minutes back.
> > > It results in a call trace.
> > > Vhost_vdpa_setup_vq_irq crashes on list corruption on net-next.
> >
> > Wrong list for this bug report.
> >
> > > We are stopping any shadow vq tests on unstable stuff.
> >
> > If you don't want to benchmark against alternatives how are you going to prove
> > your stuff is worth everyone's time?
>
> Comparing performance of the functional things count.
> You suggest shadow vq, frankly you should post the grand numbers of shadow vq.

We need an apple to apple comparison. Otherwise you may argue with that, no?

>
> It is really not my role to report bug of unstable stuff and compare the perf against.

Qemu/KVM is highly relevant here no? And it's the way to develop the
community. The shadow vq code is handy.

Just an email to Qemu should be fine, we're not asking you to fix the bug.

Btw, how do you define stable? E.g do you think the Linus tree is stable?

Thanks

>
> We propose device context and provided the numbers you asked. Mostly wont be able to go farther than this.
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-17 14:48                                                           ` Parav Pandit
  2023-11-17 14:59                                                             ` Michael S. Tsirkin
@ 2023-11-21  6:55                                                             ` Jason Wang
  2023-11-21 16:30                                                               ` Parav Pandit
  2023-11-22  2:31                                                               ` Si-Wei Liu
  1 sibling, 2 replies; 157+ messages in thread
From: Jason Wang @ 2023-11-21  6:55 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan, virtio-comment, cohuck,
	sburla, Shahaf Shuler, Maor Gottlieb, Yishai Hadas, eperezma,
	Siwei Liu

On Fri, Nov 17, 2023 at 10:48 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 7:31 PM
> > To: Parav Pandit <parav@nvidia.com>
> >
> > On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 6:02 PM
> > > >
> > > > On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 5:35 PM
> > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Friday, November 17, 2023 5:04 PM
> > > > > > > >
> > > > > > > > On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > Sent: Friday, November 17, 2023 4:30 PM
> > > > > > > > > >
> > > > > > > > > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > > > > > > > > >
> > > > > > > > > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu,
> > > > > > > > > > > > > Lingshan
> > > > wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav
> > > > > > > > > > > > >>> Pandit
> > > > wrote:
> > > > > > > > > > > > >>>> We should expose a limit of the device in the
> > > > > > > > > > > > >>>> proposed
> > > > > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how much
> > range
> > > > > > > > > > > > it can
> > > > > > > > track.
> > > > > > > > > > > > >>>> So that future provisioning framework can use it.
> > > > > > > > > > > > >>>>
> > > > > > > > > > > > >>>> I will cover this in v5 early next week.
> > > > > > > > > > > > >>> I do worry about how this can even work though.
> > > > > > > > > > > > >>> If you want a generic device you do not get to
> > > > > > > > > > > > >>> dictate how much memory VM
> > > > > > > > has.
> > > > > > > > > > > > >>>
> > > > > > > > > > > > >>> Aren't we talking bit per page? With 1TByte of
> > > > > > > > > > > > >>> memory to track
> > > > > > > > > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > > > > > >>>
> > > > > > > > > > > > >>> And you happily say "we'll address this in the future"
> > > > > > > > > > > > >>> while at the same time fighting tooth and nail
> > > > > > > > > > > > >>> against adding single bit status registers because
> > scalability?
> > > > > > > > > > > > >>>
> > > > > > > > > > > > >>>
> > > > > > > > > > > > >>> I have a feeling doing this completely
> > > > > > > > > > > > >>> theoretical like this is
> > > > > > > > problematic.
> > > > > > > > > > > > >>> Maybe you have it all laid out neatly in your
> > > > > > > > > > > > >>> head but I suspect not all of TC can picture it
> > > > > > > > > > > > >>> clearly enough based just on spec
> > > > > > > > text.
> > > > > > > > > > > > >>>
> > > > > > > > > > > > >>> We do sometimes ask for POC implementation in
> > > > > > > > > > > > >>> linux / qemu to demonstrate how things work
> > > > > > > > > > > > >>> before merging
> > > > code.
> > > > > > > > > > > > >>> We skipped this for admin things so far but I
> > > > > > > > > > > > >>> think it's a good idea to start doing it here.
> > > > > > > > > > > > >>>
> > > > > > > > > > > > >>> What makes me pause a bit before saying please
> > > > > > > > > > > > >>> do a PoC is all the opposition that seems to
> > > > > > > > > > > > >>> exist to even using admin commands in the 1st
> > > > > > > > > > > > >>> place. I think once we finally stop arguing
> > > > > > > > > > > > >>> about whether to use admin commands at all then
> > > > > > > > > > > > >>> a PoC will be needed
> > > > > > > > > > before merging.
> > > > > > > > > > > > >> We have POR productions that implemented the
> > > > > > > > > > > > >> approach in my
> > > > > > > > series.
> > > > > > > > > > > > >> They are multiple generations of productions in
> > > > > > > > > > > > >> market and running in customers data centers for years.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Back to 2019 when we start working on vDPA, we
> > > > > > > > > > > > >> have sent some samples of production(e.g.,
> > > > > > > > > > > > >> Cascade
> > > > > > > > > > > > >> Glacier) and the datasheet, you can find live
> > > > > > > > > > > > >> migration facilities there, includes suspend, vq
> > > > > > > > > > > > >> state and other
> > > > features.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> And there is an reference in DPDK live migration,
> > > > > > > > > > > > >> I have provided this page
> > > > > > > > > > > > >> before:
> > > > > > > > > > > > >> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.ht
> > > > > > > > > > > > >> ml, it has been working for long long time.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> So if we let the facts speak, if we want to see
> > > > > > > > > > > > >> if the proposal is proven to work, I would
> > > > > > > > > > > > >> say: They are POR for years, customers already
> > > > > > > > > > > > >> deployed them for
> > > > > > > > years.
> > > > > > > > > > > > > And I guess what you are trying to say is that
> > > > > > > > > > > > > this patchset we are reviewing here should be help
> > > > > > > > > > > > > to the same standard and there should be a PoC? Sounds
> > reasonable.
> > > > > > > > > > > > Yes and the in-marketing productions are POR, the
> > > > > > > > > > > > series just improves the design, for example, our
> > > > > > > > > > > > series also use registers to track vq state, but
> > > > > > > > > > > > improvements than CG or BSC. So I think they are
> > > > > > > > > > > > proven
> > > > > > > > > > to work.
> > > > > > > > > > >
> > > > > > > > > > > If you prefer to go the route of POR and production
> > > > > > > > > > > and proven documents
> > > > > > > > > > etc, there is ton of it of multiple types of products I
> > > > > > > > > > can dump here with open- source code and documentation and
> > more.
> > > > > > > > > > > Let me know what you would like to see.
> > > > > > > > > > >
> > > > > > > > > > > Michael has requested some performance comparisons,
> > > > > > > > > > > not all are ready to
> > > > > > > > > > share yet.
> > > > > > > > > > > Some are present that I will share in coming weeks.
> > > > > > > > > > >
> > > > > > > > > > > And all the vdpa dpdk you published does not have
> > > > > > > > > > > basic CVQ support when I
> > > > > > > > > > last looked at it.
> > > > > > > > > > > Do you know when was it added?
> > > > > > > > > >
> > > > > > > > > > It's good enough for PoC I think, CVQ or not.
> > > > > > > > > > The problem with CVQ generally, is that VDPA wants to
> > > > > > > > > > shadow CVQ it at all times because it wants to decode
> > > > > > > > > > and cache the content. But this problem has nothing to
> > > > > > > > > > do with dirty tracking even though it also
> > > > > > > > mentions "shadow":
> > > > > > > > > > if device can report it's state then there's no need to shadow
> > CVQ.
> > > > > > > > >
> > > > > > > > > For the performance numbers with the pre-copy and device
> > > > > > > > > context of
> > > > > > > > patches posted 1 to 5, the downtime reduction of the VM is
> > > > > > > > 3.71x with active traffic on 8 RQs at 100Gbps port speed.
> > > > > > > >
> > > > > > > > Sounds good can you please post a bit more detail?
> > > > > > > > which configs are you comparing what was the result on each of
> > them.
> > > > > > >
> > > > > > > Common config: 8+8 tx and rx queues.
> > > > > > > Port speed: 100Gbps
> > > > > > > QEMU 8.1
> > > > > > > Libvirt 7.0
> > > > > > > GVM: Centos 7.4
> > > > > > > Device: virtio VF hardware device
> > > > > > >
> > > > > > > Config_1: virtio suspend/resume similar to what Lingshan has,
> > > > > > > largely vdpa stack
> > > > > > > Config_2: Device context method of admin commands
> > > > > >
> > > > > > OK that sounds good. The weird thing here is that you measure
> > "downtime".
> > > > > > What exactly do you mean here?
> > > > > > I am guessing it's the time to retrieve on source and re-program
> > > > > > device state on destination? And this is 3.71x out of how long?
> > > > > Yes. Downtime is the time during which the VM is not responding or
> > > > > receiving
> > > > packets, which involves reprogramming the device.
> > > > > 3.71x is relative time for this discussion.
> > > >
> > > > Oh interesting. So VM state movement including reprogramming the CPU
> > > > is dominated by reprogramming this single NIC, by a factor of almost 4?
> > > Yes.
> >
> > Could you post some numbers too then?  I want to know whether that would
> > imply that VM boot is slowed down significantly too. If yes that's another
> > motivation for pci transport 2.0.
> It was 1.8 sec down to 480msec.

Well, there's work ongoing to reduce the downtime of the shadow virtqueue.

Eugenio or Si-wei may share an exact number, but it should be several
hundreds of ms.

But it seems the shadow virtqueue itself is not the major factor but
the time spent on programming vendor specific mappings for example.

Thanks

> The time didn't come from pci side or boot side.
>
> For pci side of things you would want to compare the pci vs non pci device based VM boot time.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-16  5:29                             ` [virtio-comment] " Parav Pandit
  2023-11-16  5:51                               ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-21  7:14                               ` Jason Wang
  2023-11-21 16:31                                 ` [virtio-comment] " Parav Pandit
  1 sibling, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-21  7:14 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Thu, Nov 16, 2023 at 1:30 PM Parav Pandit <parav@nvidia.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Thursday, November 16, 2023 9:54 AM
> >
> > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Monday, November 13, 2023 9:07 AM
> > > >
> > > > On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Tuesday, November 7, 2023 9:34 AM
> > > > > >
> > > > > > On Mon, Nov 6, 2023 at 2:54 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Monday, November 6, 2023 12:04 PM
> > > > > > > >
> > > > > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > Sent: Thursday, November 2, 2023 9:54 AM
> > > > > > > > > >
> > > > > > > > > > On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit
> > > > > > > > > > <parav@nvidia.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > Sent: Wednesday, November 1, 2023 6:00 AM
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit
> > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit
> > > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > During a device migration flow (typically in a
> > > > > > > > > > > > > > > precopy phase of the live migration), a device
> > > > > > > > > > > > > > > may write to the guest memory. Some
> > > > > > > > > > > > > > > iommu/hypervisor may not be able to track
> > > > > > > > > > > > > > > these
> > > > > > > > > > written pages.
> > > > > > > > > > > > > > > These pages to be migrated from source to
> > > > > > > > > > > > > > > destination
> > > > > > hypervisor.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > A device which writes to these pages, provides
> > > > > > > > > > > > > > > the page address record of the to the owner device.
> > > > > > > > > > > > > > > The owner device starts write recording for
> > > > > > > > > > > > > > > the device and queries all the page addresses
> > > > > > > > > > > > > > > written by the
> > > > device.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Fixes:
> > > > > > > > > > > > > > > https://github.com/oasis-tcs/virtio-spec/issue
> > > > > > > > > > > > > > > s/17
> > > > > > > > > > > > > > > 6
> > > > > > > > > > > > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > > > > > > > > > > > Signed-off-by: Satananda Burla
> > > > > > > > > > > > > > > <sburla@marvell.com>
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > changelog:
> > > > > > > > > > > > > > > v1->v2:
> > > > > > > > > > > > > > > - addressed comments from Michael
> > > > > > > > > > > > > > > - replaced iova with physical address
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > >  admin-cmds-device-migration.tex | 15
> > > > > > > > > > > > > > > +++++++++++++++
> > > > > > > > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > diff --git a/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > b/admin-cmds-device-migration.tex index
> > > > > > > > > > > > > > > ed911e4..2e32f2c
> > > > > > > > > > > > > > > 100644
> > > > > > > > > > > > > > > --- a/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > +++ b/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > @@ -95,6 +95,21 @@ \subsubsection{Device
> > > > > > > > > > > > > > > Migration}\label{sec:Basic Facilities of a
> > > > > > > > > > > > > > > Virtio Device / The owner driver can discard
> > > > > > > > > > > > > > > any partially read or written device context
> > > > > > > > > > > > > > > when  any of the device migration flow
> > > > > > > > > > > > > > should be aborted.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > +During the device migration flow, a
> > > > > > > > > > > > > > > +passthrough device may write data to the
> > > > > > > > > > > > > > > +guest virtual machine's memory, a source
> > > > > > > > > > > > > > > +hypervisor needs to keep track of these
> > > > > > > > > > > > > > > +written memory to migrate such memory to
> > > > > > > > > > > > > > > +destination
> > > > > > > > > > > > > > hypervisor.
> > > > > > > > > > > > > > > +Some systems may not be able to keep track of
> > > > > > > > > > > > > > > +such memory write addresses at hypervisor level.
> > > > > > > > > > > > > > > +In such a scenario, a device records and
> > > > > > > > > > > > > > > +reports these written memory addresses to the
> > > > > > > > > > > > > > > +owner device. The owner driver enables write
> > > > > > > > > > > > > > > +recording for one or more physical address
> > > > > > > > > > > > > > > +ranges per device during device
> > > > > > > > > > migration flow.
> > > > > > > > > > > > > > > +The owner driver periodically queries these
> > > > > > > > > > > > > > > +written physical address
> > > > > > > > > > > > records from the device.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I wonder how PA works in this case. Device uses
> > > > > > > > > > > > > > untranslated requests so it can only see IOVA.
> > > > > > > > > > > > > > We can't mandate
> > > > > > ATS anyhow.
> > > > > > > > > > > > > Michael suggested to keep the language uniform as
> > > > > > > > > > > > > PA as this is ultimately
> > > > > > > > > > > > what the guest driver is supplying during vq
> > > > > > > > > > > > creation and in posting buffers as physical address.
> > > > > > > > > > > >
> > > > > > > > > > > > This seems to need some work. And, can you show me
> > > > > > > > > > > > how it can
> > > > > > work?
> > > > > > > > > > > >
> > > > > > > > > > > > 1) e.g if GAW is 48 bit, is the hypervisor expected
> > > > > > > > > > > > to do a bisection of the whole range?
> > > > > > > > > > > > 2) does the device need to reserve sufficient
> > > > > > > > > > > > internal resources for logging the dirty page and why (not)?
> > > > > > > > > > > No when dirty page logging starts, only at that time,
> > > > > > > > > > > device will reserve
> > > > > > > > > > enough resources.
> > > > > > > > > >
> > > > > > > > > > GAW is 48bit, how large would it have then?
> > > > > > > > > Dirty page tracking is not dependent on the size of the GAW.
> > > > > > > > > It is function of address ranges for the amount of guest
> > > > > > > > > memory regardless of
> > > > > > > > GAW.
> > > > > > > >
> > > > > > > > The problem is, e.g when vIOMMU is enabled, you can't know
> > > > > > > > which IOVA is actually used by guests. And even for the case
> > > > > > > > when vIOMMU is not enabled, the guest may have several TBs.
> > > > > > > > Is it easy to reserve sufficient resources by the device itself?
> > > > > > > >
> > > > > > > When page tracking is enabled per device, it knows about the
> > > > > > > range and it can
> > > > > > reserve certain resource.
> > > > > >
> > > > > > I didn't see such an interface in this series. Anything I miss?
> > > > > >
> > > > > Yes, this patch and the next patch is covering the page tracking
> > > > > start,stop and
> > > > query commands.
> > > > > They are named as write recording commands.
> > > >
> > > > So I still don't see how the device can reserve sufficient resources?
> > > > Guests may map a very large area of memory to IOMMU (or when vIOMMU
> > > > is disabled, GPA is used). It would be several TBs, how can the
> > > > device reserve sufficient resources in this case?
> > > When the map is established, the ranges are supplied to the device to know
> > how much to reserve.
> > > If device does not have enough resource, it fails the command.
> > >
> > > One can advance it further to provision for the desired range..
> >
> > Well, I think I've asked whether or not a bisection is needed, and you told me
> > not ...
> >
> > But at least we need to document this in the proposal, no?
> >
> We should expose a limit of the device in the proposed WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> So that future provisioning framework can use it.
>
> I will cover this in v5 early next week.
>
> > > >
> > > > >
> > > > > > Btw, the IOVA is allocated by the guest actually, how can we
> > > > > > know the
> > > > range?
> > > > > > (or using the host range?)
> > > > > >
> > > > > Hypervisor would have mapping translation.
> > > >
> > > > That's really tricky and can only work in some cases:
> > > >
> > > > 1) It requires the hypervisor to traverse the guest I/O page tables
> > > > which could be very large range
> > > > 2) It requests the hypervisor to trap the modification of guest I/O
> > > > page tables and synchronize with the range changes, which is
> > > > inefficient and can only be done when we are doing shadow PTEs. It
> > > > won't work when the nesting translation could be offloaded to the
> > > > hardware
> > > > 3) It is racy with the guest modification of I/O page tables which
> > > > is explained in another thread
> > > Mapping changes with more hw mmu's is not a frequent event and IOTLB
> > flush is done using querying the dirty log for the smaller range.
> > >
> > > > 4) No aware of new features like PASID which has been explained in
> > > > another thread
> > > For all the pinned work with non sw based IOMMU, it is typically small subset.
> > > PASID is guest controlled.
> >
> > Let's repeat my points:
> >
> > 1) vq1 use untranslated request with PASID1
> > 2) vq2 use untranslated request with PASID2
> >
> > Shouldn't we log PASID as well?
> >
> Possibly yes, either to request the tracking per PASID or to log the PASID.
> When in future PASID based VQ are supported, this part should be extended.

Who is going to do the extension? They are orthogonal features for sure.

>
> > And
> >
> > 1) vq1 is using translated request
> > 2) vq2 is using untranslated request
> >

How about this?

>
> > How could we differ?
> >
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > > Host should always have more resources than device, in that
> > > > > > > > sense there could be several methods that tries to utilize
> > > > > > > > host memory instead of the one in the device. I think we've
> > > > > > > > discussed this when going through the doc prepared by Eugenio.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > What happens if we're trying to migrate more than 1 device?
> > > > > > > > > >
> > > > > > > > > That is perfectly fine.
> > > > > > > > > Each device is updating its log of pages it wrote.
> > > > > > > > > The hypervisor is collecting their sum.
> > > > > > > >
> > > > > > > > See above.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 3) DMA is part of the transport, it's natural to do
> > > > > > > > > > > > logging there, why duplicate efforts in the virtio layer?
> > > > > > > > > > > He he, you have funny comment.
> > > > > > > > > > > When an abstract facility is added to virtio you say
> > > > > > > > > > > to do in
> > > > transport.
> > > > > > > > > >
> > > > > > > > > > So it's not done in the general facility but tied to the admin part.
> > > > > > > > > > And we all know dirty page tracking is a challenge and
> > > > > > > > > > Eugenio has a good summary of pros/cons. A revisit of
> > > > > > > > > > those docs make me think virtio is not the good place
> > > > > > > > > > for doing that for
> > > > may reasons:
> > > > > > > > > >
> > > > > > > > > > 1) as stated, platform will evolve to be able to
> > > > > > > > > > tracking dirty pages, actually, it has been supported by
> > > > > > > > > > a lot of major IOMMU vendors
> > > > > > > > >
> > > > > > > > > This is optional facility in virtio.
> > > > > > > > > Can you please point to the references? I don’t see it in
> > > > > > > > > the common Linux
> > > > > > > > kernel support for it.
> > > > > > > >
> > > > > > > > Note that when IOMMUFD is being proposed, dirty page
> > > > > > > > tracking is one of the major considerations.
> > > > > > > >
> > > > > > > > This is one recent proposal:
> > > > > > > >
> > > > > > > > https://www.spinics.net/lists/kvm/msg330894.html
> > > > > > > >
> > > > > > > Sure, so if platform supports it. it can be used from the platform.
> > > > > > > If it does not, the device supplies it.
> > > > > > >
> > > > > > > > > Instead Linux kernel choose to extend to the devices.
> > > > > > > >
> > > > > > > > Well, as I stated, tracking dirty pages is challenging if
> > > > > > > > you want to do it on a device, and you can't simply invent
> > > > > > > > dirty page tracking for each type of the devices.
> > > > > > > >
> > > > > > > It is not invented.
> > > > > > > It is generic framework for all virtio device types as proposed here.
> > > > > > > Keep in mind, that it is optional already in v3 series.
> > > > > > >
> > > > > > > > > At least not seen to arrive this in any near term in start
> > > > > > > > > of
> > > > > > > > > 2024 which is
> > > > > > > > where users must use this.
> > > > > > > > >
> > > > > > > > > > 2) you can't assume virtio is the only device that can
> > > > > > > > > > be used by the guest, having dirty pages tracking to be
> > > > > > > > > > implemented in each type of device is unrealistic
> > > > > > > > > Of course, there is no such assumption made. Where did you
> > > > > > > > > see a text that
> > > > > > > > made such assumption?
> > > > > > > >
> > > > > > > > So what happens if you have a guest with virtio and other
> > > > > > > > devices
> > > > assigned?
> > > > > > > >
> > > > > > > What happens? Each device type would do its own dirty page tracking.
> > > > > > > And if all devices does not have support, hypervisor knows to
> > > > > > > fall back to
> > > > > > platform iommu or its own.
> > > > > > >
> > > > > > > > > Each virtio and non virtio devices who wants to report
> > > > > > > > > their dirty page report,
> > > > > > > > will do their way.
> > > > > > > > >
> > > > > > > > > > 3) inventing it in the virtio layer will be deprecated
> > > > > > > > > > in the future for sure, as platform will provide much
> > > > > > > > > > rich features for logging e.g it can do it per PASID
> > > > > > > > > > etc, I don't see any reason virtio need to compete with
> > > > > > > > > > the features that will be provided by the platform
> > > > > > > > > Can you bring the cpu vendors and committement to virtio
> > > > > > > > > tc with timelines
> > > > > > > > so that virtio TC can omit?
> > > > > > > >
> > > > > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio
> > > > > > > > needs to be built on top of transport or platform. There's
> > > > > > > > no need to duplicate
> > > > > > their job.
> > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > >
> > > > > > > I wanted to see a strong commitment for the cpu vendors to
> > > > > > > support dirty
> > > > > > page tracking.
> > > > > >
> > > > > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD
> > > > > > and ARM are all supporting that now.
> > > > > >
> > > > > > > And the work seems to have started for some platforms.
> > > > > >
> > > > > > Let me quote from the above link:
> > > > > >
> > > > > > """
> > > > > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > > > > alongside VT-D rev3.x also do support.
> > > > > > """
> > > > > >
> > > > > > > Without such platform commitment, virtio also skipping it would not
> > work.
> > > > > >
> > > > > > Is the above sufficient? I'm a little bit more familiar with
> > > > > > vtd, the hw feature has been there for years.
> > > > > >
> > > > > Vtd has a sticky D bit that requires synchronization with IOPTE
> > > > > page caches
> > > > when sw wants to clear it.
> > > >
> > > > This is by design.
> > > >
> > > > > Do you know if is it reliable when device does multiple writes,
> > > > > ie,
> > > > >
> > > > > a. iommu write D bit
> > > > > b. software read it
> > > > > c. sw synchronize cache
> > > > > d. iommu write D bit on next write by device
> > > >
> > > > What issue did you see here? But that's not even an excuse, if
> > > > there's a bug, let's report it to IOMMU vendors and let them fix it.
> > > > The thread I point to you is actually a good space.
> > > >
> > > So we cannot claim that it is there in the platform.
> >
> > I'm confused, the thread I point to you did the cache synchronization which has
> > been explained in the changelog, so what's the issue?
> >
> If the ask is for IOMMU chip to fix something, we cannot claim that dirty page tracking is available already in platform.

Again, can you describe the issue? Why do you think the sticky part is
an issue? IOTLB needs to be sync with IO page tables, what's wrong
with this?

>
> > >
> > > > Again, the point is to let the correct role play.
> > > >
> > > How many more years should we block the virtio device migration when
> > platform do not have it?
> >
> > At least for VT-D, it has been used for years.
> Is this device written pages tracked by KVM for VT-d as dirty page log, instead through vfio?

I don't get this question.

>
> >
> > >
> > > > >
> > > > > ARM SMMU based servers to be present with D bit tracking.
> > > > > It is still early to say platform is ready.
> > > >
> > > > This is not what I read from both the series I posted and the spec,
> > > > dirty bit has been supported several years ago at least for vtd.
> > > Supported, but spec listed it as sticky bit that may require special handling.
> >
> > Please explain why this is "special handling". IOMMU has several different layers
> > of caching, by design, it can't just open a window for D bit.
> >
> > > May be it is working, but not all cpu platforms have it.
> >
> > I don't see the point. Migration is not supported for virito as well.
> >
> I don’t see a point either to discuss.
>
> I already acked that platform may have support as well, and not all platform has it.
> So the device feeds the data and its platform's choice to enable/disable.

I've pointed out sufficient issues and I don't want to repeat them.

>
> > >
> > > >
> > > > >
> > > > > It is optional so whichever has the support it will be used.
> > > >
> > > > I can't see the point of this, it is already available. And
> > > > migration doesn't exist in virtio spec yet.
> > > >
> > > > >
> > > > > > >
> > > > > > > > > i.e. in first year of 2024?
> > > > > > > >
> > > > > > > > Why does it matter in 2024?
> > > > > > > Because users needs to use it now.
> > > > > > >
> > > > > > > >
> > > > > > > > > If not, we are better off to offer this, and when/if
> > > > > > > > > platform support is, sure,
> > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > >
> > > > > > > > > > 4) if the platform support is missing, we can use
> > > > > > > > > > software or leverage transport for assistance like PRI
> > > > > > > > > All of these are in theory.
> > > > > > > > > Our experiment shows PRI performance is 21x slower than
> > > > > > > > > page fault rate
> > > > > > > > done by the cpu.
> > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > >
> > > > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > > > Do you have perf data for this?
> > > > > >
> > > > > > No, but it's not hard to imagine the worst case. Wrote a small
> > > > > > program that dirty every page by a NIC.
> > > > > >
> > > > > > > In the internal tests we don’t see this happening.
> > > > > >
> > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > >
> > > > > > So if we get very high dirty rates (e.g by a high speed NIC), we
> > > > > > can't satisfy the requirement of the downtime. Or if you see the
> > > > > > converge, you might get help from the auto converge support by
> > > > > > the hypervisors like KVM where it tries to throttle the VCPU
> > > > > > then you can't reach
> > > > the wire speed.
> > > > > >
> > > > > Once PRI is enabled, even without migration, there is basic perf issues.
> > > >
> > > > The context is not PRI here...
> > > >
> > > > It's about if you can stick to wire speed during live migration.
> > > > Based on the analysis so far, you can't achieve wirespeed and downtime at
> > the same time.
> > > > That's why the hypervisor needs to throttle VCPU or devices.
> > > >
> > > So?
> > > Device also may throttle itself.
> >
> > That's perfectly fine. We are on the same page, no? It's wrong to judge the dirty
> > page tracking in the context of live migration by measuring whether or not the
> > device can work at wire speed.
> >
> > >
> > > > For PRI, it really depends on how you want to use it. E.g if you
> > > > don't want to pin a page, the performance is the price you must pay.
> > > PRI without pinning does not make sense for device to make large mapping
> > queries.
> >
> > That's also fine. Hypervisors can choose to enable and use PRI depending on
> > the different cases.
> >
> So PRI is not must for device migration.

I never say it's a must.

> Device migration must be able to work without PRI enabled, as simple as that as first base line.

My point is that, you need document

1) why you think dirty page is a must or not
2) why did you choose one of a specific way instead of others

>
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > So it is unusable.
> > > > > > > >
> > > > > > > > It's not about mandating, it's about doing things in the
> > > > > > > > correct layer. If PRI is slow, PCI can evolve for sure.
> > > > > > > You should try.
> > > > > >
> > > > > > Not my duty, I just want to make sure things are done in the
> > > > > > correct layer, and once it needs to be done in the virtio,
> > > > > > there's nothing obviously
> > > > wrong.
> > > > > >
> > > > > At present, it looks all platforms are not equally ready for page tracking.
> > > >
> > > > That's not an excuse to let virtio support that.
> > > It is wrong attribution as excuse.
> > >
> > > > And we need also to figure out if
> > > > virtio can do that easily. I've pointed out sufficient issues, I'm
> > > > pretty sure there would be more as the platform evolves.
> > > >
> > > I am not sure if virtio feeds the log into the platform.
> >
> > I don't understand the meaning here.
> >
> I mistakenly merged two sentences.
>
> Virtio feeds the dirty page details to the hypervisor platform which collects and merges the page record.
> So it is platform choice to use iommu based tracking or device based.
>
> > >
> > > > >
> > > > > > > In the current state, it is mandating.
> > > > > > > And if you think PRI is the only way,
> > > > > >
> > > > > > I don't, it's just an example where virtio can leverage from
> > > > > > either transport or platform. Or if it's the fault in virtio
> > > > > > that slows down the PRI, then it is something we can do.
> > > > > >
> > > > > Yea, it does not seem to be ready yet.
> > > > >
> > > > > > >  than you should propose that in the dirty page tracking
> > > > > > > series that you listed
> > > > > > above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > >
> > > > > > No, the point is to not duplicate works especially considering
> > > > > > virtio can't do better than platform or transport.
> > > > > >
> > > > > Both the platform and virtio work is ongoing.
> > > >
> > > > Why duplicate the work then?
> > > >
> > > Not all cpu platforms support as far as I know.
> >
> > Yes, but we all know the platform is working to support this.
> >
> > Supporting this on the device is hard.
> >
> This is optional, whichever device would like to implement it, will support it.
>
> > >
> > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > When one does something in transport, you say, this is
> > > > > > > > > > > transport specific, do
> > > > > > > > > > some generic.
> > > > > > > > > > >
> > > > > > > > > > > Here the device is being tracked is virtio device.
> > > > > > > > > > > PCI-SIG has told already that PCIM interface is
> > > > > > > > > > > outside the scope of
> > > > it.
> > > > > > > > > > > Hence, this is done in virtio layer here in abstract way.
> > > > > > > > > >
> > > > > > > > > > You will end up with a competition with the
> > > > > > > > > > platform/transport one that will fail.
> > > > > > > > > >
> > > > > > > > > I don’t see a reason. There is no competition.
> > > > > > > > > Platform always have a choice to not use device side page
> > > > > > > > > tracking when it is
> > > > > > > > supported.
> > > > > > > >
> > > > > > > > Platform provides a lot of other functionalities for dirty logging:
> > > > > > > > e.g per PASID, granular, etc. So you want to duplicate them
> > > > > > > > again in the virtio? If not, why choose this way?
> > > > > > > >
> > > > > > > It is optional for the platforms where platform do not have it.
> > > > > >
> > > > > > We are developing new virtio functionalities that are targeted
> > > > > > for future platforms. Otherwise we would end up with a feature
> > > > > > with a very narrow use case.
> > > > > In general I agree that platform is an option too.
> > > > > Hypervisor will be able to make the decision to use platform when
> > > > > available
> > > > and fallback to device method when platform does not have it.
> > > > >
> > > > > Future and to be equally usable in near term :)
> > > >
> > > > Please don't double standard again:
> > > >
> > > > When you are talking about TDISP, you want virtio to be designed to
> > > > fit for the future where the platform is ready in the future When
> > > > you are talking about dirty tracking, you want it to work now even
> > > > if
> > > >
> > > The proposal of transport VQ is anti-TDISP.
> >
> > It's nothing about transport VQ, it's about you're saying the adminq based
> > device context. There's a comment to point out that the current TDISP spec
> > forbids modifying device state when TVM is attached. Then you told us the
> > TDISP may evolve for that.
> So? That is not double standard.
> The proposal is based on main principle that it is not depending on hypervisor traping + emulating which is the baseline of TDISP
>
> >
> > > The proposal of dirty tracking is not anti-platform. It is optional like rest of the
> > platform.
> > >
> > > > 1) most of the platform is ready now
> > > Can you list a ARM server CPU in production that has it? (not in some pdf
> > spec).
> >
> > Then in the context of a dirty page, I've proved you dirty page tracking has been
> > supported by all major vendors.
> Major IP vendor != major cpu chip vendor.
> I don’t agree with the proof.

So this will be an endless debate. Did I ever ask you about ETA or any
product for TDISP?

>
> I already acknowledged that I have seen internal test report for dirty tracking with one cpu and nic.
>
> I just don’t see all cpus have support for it.
> Hence, this optional feature.

Repeat myself again.

If it can be done easily and efficiently in virtio, I agree. But I've
pointed out several issues where it is not answered.

>
> > Where you refuse to use the standard you used
> > in explaining adminq for device context in TDISP.
> >
> > So I didn't ask you the ETA of the TDISP support for migration or adminq, but
> > you want me to give you the production information which is pointless.
> Because you keep claiming that _all_ cpus in the world has support for efficient dirty page tracking.
>
> > You
> > might need to ask ARM to get an answer, but a simple google told me the effort
> > to support dirty page tracking in SMMUv3 could go back to early 2021.
> >
> To my knowledge ARM do not produce physical chips.
> Your proposal is to keep those ARM server vendors to not use virtio devices.

This arbitrary conclusion makes no sense.

I know at least one cloud vendor has used a virtio based device for
years on ARM. And that vendor has posted patches to support dirty page
tracking since 2020.

Thanks

> Does not make sense to me.
>
> > https://lore.kernel.org/linux-iommu/56b001fa-b4fe-c595-dc5e-
> > f362d2f07a19@linux.intel.com/t/
> >
> > Why is it not merged? It's simply because we agree to do it in the layer of
> > IOMMUFD so it needs to wait.
> >
> > Thanks
> >
> >
> > >
> > > > 2) whether or not virtio can log dirty page correctly is still
> > > > suspicious
> > > >
> > > > Thanks
> > >
> > > There is no double standard. The feature is optional which co-exists as
> > explained above.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-21  4:21                                         ` Jason Wang
@ 2023-11-21 16:24                                           ` Parav Pandit
  2023-11-22  4:11                                             ` [virtio-comment] " Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-21 16:24 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin
  Cc: virtio-comment, cohuck, sburla, Shahaf Shuler, Maor Gottlieb,
	Yishai Hadas, lingshan.zhu


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, November 21, 2023 9:52 AM
> 
> On Thu, Nov 16, 2023 at 2:49 PM Michael S. Tsirkin <mst@redhat.com>
> wrote:
> >
> > On Thu, Nov 16, 2023 at 12:24:27PM +0800, Jason Wang wrote:
> > > On Thu, Nov 16, 2023 at 1:37 AM Parav Pandit <parav@nvidia.com>
> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Monday, November 13, 2023 9:11 AM
> > > > >
> > > > > On Fri, Nov 10, 2023 at 2:46 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > > Hi Michael,
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Thursday, November 9, 2023 1:29 PM
> > > > > >
> > > > > > [..]
> > > > > > > > Besides the issue of performance, it's also racy, assuming
> > > > > > > > we are logging
> > > > > > > IOVA.
> > > > > > > >
> > > > > > > > 0) device log IOVA
> > > > > > > > 1) hypervisor fetches IOVA from log buffer
> > > > > > > > 2) guest map IOVA to a new GPA
> > > > > > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > > > > > >
> > > > > > > > Then we lost the old GPA.
> > > > > > >
> > > > > > > Interesting and a good point. And by the way e.g. vhost has
> > > > > > > the same issue.  You need to flush dirty tracking info when
> > > > > > > changing the mappings somehow.  Parav what's the plan for
> > > > > > > this? Should be addressed in
> > > > > the spec too.
> > > > > > >
> > > > > > As you listed the flush is needed for vhost or device-based DPT.
> > > > >
> > > > > What does DPT mean? Device Page Table? Let's not invent
> > > > > terminology which is not known by others please.
> > > > >
> > > > Sorry for using the acronym. I meant dirty page tracking.
> > > >
> > > > > We have discussed it many times. You can't just depend on ATS or
> > > > > reinventing wheels in virtio.
> > > > The dependency is on the iommu which would have the mapping of
> GIOVA to GPA like any sw implementation.
> > > > No dependency on ATS.
> > > >
> > > > >
> > > > > What's more, please try not to give me the impression that the
> > > > > proposal is optimized for a specific vendor (like device IOMMU stuffs).
> > > > >
> > > > You should stop calling this specific vendor thing.
> > >
> > > Well, as you have explained, the confusion came from "DPT" ...
> > >
> > > > One can equally say that suspend bit proposal is for the sw_vendor
> device who is forcing virtio hw device to only implement ioqueues + PASID +
> non_unified interface for PF, VF, SIOVs + non_TDISP based devices.
> > > >
> > > > > > The necessary plumbing is already covered for this in the
> > > > > > query (read and
> > > > > clear) command of this v3 proposal.
> > > > >
> > > > > The issue is logging via IOVA ... I don't see how "read and clear" can
> help.
> > > > >
> > > > Read and clear helps that ensures that all the dirty pages are reported,
> hence there is no mapping/unmapping race.
> > >
> > > Reported as IOVA ...
> > >
> > > > As everything is reported.
> > > >
> > > > > > It is listed in Device Write Records Read Command.
> > > > >
> > > > > Please explain how your proposal can solve the above race.
> > > > >
> > > > In below manner.
> > > > 1. guest has GIOVA to GPA_1 mapping 2. RX packets occurred to
> > > > GIOVA 3. device reported dirty page log for GIOVA (hypervisor is
> > > > yet to read) 4. guest requested mapping change from GIOVA to GPA_2
> > > > 4.1 During this IOTLB is invalidated and dirty page report is
> > > > queried ensuring, it can change the mapping
> > >
> > > It requires
> > >
> > > 1) hypervisor traps IOTLB invalidation, which doesn't work when
> > > nesting could be offloaded (IOMMUFD has started the work to support
> > > nesting)
> > > 2) query the device about the dirty page on each IOTLB invalidation which:
> > > 2.1) A huge round trip: guest IOTLB invalidation -> trapped by
> > > hypervisor -> start the query from the device -> device return ->
> > > hypervisor reports IOTLB invalidation is done -> let guest run. Have
> > > you benchmarked the RTT in this case? There are just too many places
> > > that cause the delay in the middle.
> >
> > To be fair invalidations are already expensive e.g. with vhost iotlb
> > it requires a slow system call.
> > This will make them *even more* expensive.
> 
> Yes, a slow syscall plus a virtqueue query RTT.
> 
Only during viommu case.
Without this is not applicable.

> Need some benchmark. It looks to me currently the invalidation is done via a
> queued based interface in vtd. So guests may need to spin where it may trigger
> a lockup in the guest.
> 

> >
> > Problem for some but not all workloads.  Again I agree motivation,
> > tradeoffs and comparison with both dirty tracking by iommu and shadow
> > vq approaches really should be included.
> 
Dirty tracking is iommu to be considered.
Shadow vq is not in my scope and it does not fit the basic requirements as explained before.
So it is different discussion.

> +1
> 
> >
> >
> > > 2.2) Guest triggerable behaviour, malicious guest can simply do
> > > endless IOTLB invalidation to DOS the e.g admin virtqueue
> >
> > I'm not sure how much to worry about it - just don't allow more than
> > one in flight per VM.
> 
> That's fine but it may need a note.
> 
> Thanks
> 
> 
> >
> >
> >
> > > >
> > > > > >
> > > > > > When the page write record is fully read, it is flushed.
> > > > > > How/when to use, I think its hypervisor specific, so we
> > > > > > probably better off not
> > > > > documenting those details.
> > > > >
> > > > > Well, as the author of this proposal, at least you need to know
> > > > > how a hypervisor can work with your proposal, no?
> > > > >
> > > > Likely yes, but it is not the scope of the spec to list those paths etc.
> > >
> > > Fine, but as a reviewer I need to know if it can work with a hypervisor well.
> > >
> > > >
> > > > > > May be such read is needed in some other path too depending on
> > > > > > how
> > > > > hypervisor implemented.
> > > > >
> > > > > What do you mean by "May be ... some other path" here? You're
> > > > > inventing a mechanism that you don't know how a hypervisor can use?
> > > >
> > > > No. I meant hypervisor may have more operations that
> map/unmap/flush where it may need to implement it.
> > > > Some one may call it set_map(), some may say dma_map()...
> > >
> > > Ok.
> > >
> > > Thanks
> >


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-21  4:24                                                 ` Jason Wang
@ 2023-11-21 16:26                                                   ` Parav Pandit
  2023-11-22  4:14                                                     ` [virtio-comment] " Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-21 16:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, November 21, 2023 9:55 AM
> 
> On Fri, Nov 17, 2023 at 11:02 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, November 16, 2023 11:51 PM
> > >
> > > On Thu, Nov 16, 2023 at 05:29:49PM +0000, Parav Pandit wrote:
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Thursday, November 16, 2023 10:56 PM
> > > > >
> > > > > On Thu, Nov 16, 2023 at 04:26:53PM +0000, Parav Pandit wrote:
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Thursday, November 16, 2023 5:18 PM
> > > > > > >
> > > > > > > On Thu, Nov 16, 2023 at 07:40:57AM +0000, Parav Pandit wrote:
> > > > > > > >
> > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > Sent: Thursday, November 16, 2023 1:06 PM
> > > > > > > > >
> > > > > > > > > On Thu, Nov 16, 2023 at 12:51:40AM -0500, Michael S. Tsirkin
> wrote:
> > > > > > > > > > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit
> wrote:
> > > > > > > > > > > We should expose a limit of the device in the
> > > > > > > > > > > proposed
> > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how much range it
> > > > > > > > > can
> > > > > track.
> > > > > > > > > > > So that future provisioning framework can use it.
> > > > > > > > > > >
> > > > > > > > > > > I will cover this in v5 early next week.
> > > > > > > > > >
> > > > > > > > > > I do worry about how this can even work though. If you
> > > > > > > > > > want a generic device you do not get to dictate how
> > > > > > > > > > much memory
> > > VM has.
> > > > > > > > > >
> > > > > > > > > > Aren't we talking bit per page? With 1TByte of memory
> > > > > > > > > > to track
> > > > > > > > > > -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > >
> > > > > > > > > Ugh. Actually of course:
> > > > > > > > > With 1TByte of memory to track -> 256Mbit -> 32Mbit ->
> > > > > > > > > 8Mbyte per VF
> > > > > > > > >
> > > > > > > > > 8Gbyte per *PF* with 1K VFs.
> > > > > > > > >
> > > > > > > > Device may not maintain as a bitmap.
> > > > > > >
> > > > > > > However you maintain it, there's 256Mega bit of information.
> > > > > > There may be other data structures that device may deploy as
> > > > > > for example
> > > > > hash or tree or something else.
> > > > >
> > > > > Point being?
> > > > The device may have some hashing accelerator or other improvements
> > > > that
> > > may perform better than bitmap as many queues in parallel attempt to
> > > update the shared database.
> > >
> > > Maybe, I didn't give this thought.
> > >
> > > My point was that to be able to keep all combinations of dirty/non
> > > dirty page for each 4k page in a 1TByte guest device needs 8MBytes
> > > of on-device memory per VF. As designed the query also has to report
> > > it for each VF accurately even if multiple VFs are accessing same guest.
> > Yes.
> >
> > >
> > > > >
> > > > > > And this is runtime memory only during the short live
> > > > > > migration period of
> > > > > 400msec or less.
> > > > > > It is not some _always_ resident memory.
> 
> When developing the spec, we should not have any assumption for the
> implementation. For example, you can't just assume virtio is always emulated
> in the software in the DPU.
> 
There is no such assumption.
It is supported on non DPU devices too.

> How can you make sure you can converge in 400ms without having a interface
> for the driver to set the correct parameter like dirty rates?

400msec is also written anywhere as requirement if this is what you want to argue about.
There is nothing prevents to extend the interface to define the SLA as additional commands in the future to improve the solution.

There is no need to boil the ocean now. Once the base infrastructure is built, we will improve it further.
And proposed patches are reasonably well covered to our knowledge.

> 
> Thanks
> 
> > > > >
> > > > > No - write tracking is used in the live phase of migration. It
> > > > > can be enabled as long as you wish - it's a question of policy.
> > > > > There actually exist solutions that utilize this phase for
> > > > > redundancy, permanently
> > > running in this mode.
> > > >
> > > > If such use case exists, one may further improve the device
> implementation.
> > >
> > > Yes such use cases exist, there is no limit on how long migration takes.
> > > So go ahead and further improve it please. Do not give us "we did
> > > not get requests for this feature" please.
> >
> > Please describe the use case more precisely.
> > If there is any application or OS API etc exists, please point to it where would
> you like to fit this dirty page tracking beyond device migration.
> > We may have to draw a line to have reasonable point and not keep
> discussing infinitely.
> >


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-21  5:16                                                                         ` Jason Wang
@ 2023-11-21 16:29                                                                           ` Parav Pandit
  2023-11-21 21:00                                                                             ` Michael S. Tsirkin
  2023-11-22  4:17                                                                             ` Jason Wang
  0 siblings, 2 replies; 157+ messages in thread
From: Parav Pandit @ 2023-11-21 16:29 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, November 21, 2023 10:47 AM
> 
> On Fri, Nov 17, 2023 at 8:51 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: virtio-comment@lists.oasis-open.org
> > > <virtio-comment@lists.oasis- open.org> On Behalf Of Michael S.
> > > Tsirkin
> > > Sent: Friday, November 17, 2023 6:11 PM
> > >
> > > On Fri, Nov 17, 2023 at 12:22:59PM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 5:03 PM
> > > > > To: Parav Pandit <parav@nvidia.com>
> > > >
> > > > > > Somehow the claim of shadow vq is great without sharing any
> > > > > > performance
> > > > > numbers is what I don't agree with.
> > > > >
> > > > > It's upstream in QEMU. Test it youself.
> > > > >
> > > > We did few minutes back.
> > > > It results in a call trace.
> > > > Vhost_vdpa_setup_vq_irq crashes on list corruption on net-next.
> > >
> > > Wrong list for this bug report.
> > >
> > > > We are stopping any shadow vq tests on unstable stuff.
> > >
> > > If you don't want to benchmark against alternatives how are you
> > > going to prove your stuff is worth everyone's time?
> >
> > Comparing performance of the functional things count.
> > You suggest shadow vq, frankly you should post the grand numbers of
> shadow vq.
> 
> We need an apple to apple comparison. Otherwise you may argue with that,
> no?
> 
When the requirements are met the comparison can be made of the solution.
And I don’t see that the basic requirements are matching for two different use cases.
So no point in discussing one OS specific implementation as reference point.
Otherwise I will end up adding vfio link in the commit log in next version as you are asking similar things here and being non neutral to your ask.

Anyway, please bring the perf data whichever you want to compare in another forum. It is not the criteria anyway.

> >
> > It is really not my role to report bug of unstable stuff and compare the perf
> against.
> 
> Qemu/KVM is highly relevant here no? And it's the way to develop the
> community. The shadow vq code is handy.
It is relevant for direct mapped device.
There is absolutely no point of converting virtio device to another virtualization layer and run again and get another virtio device.
So for direct mapping use case shadow vq is not relevant.
For other use cases, please continue.

> 
> Just an email to Qemu should be fine, we're not asking you to fix the bug.
> 
> Btw, how do you define stable? E.g do you think the Linus tree is stable?
> 
Basic test with iperf is not working. Crashing it.
All of this is complete unrelated discussion to this series to slow down the work.
I don’t see any value.
Michael asked to do the test, we did, it does not work. Functionally broken code has no comparison.

> Thanks
> 
> >
> > We propose device context and provided the numbers you asked. Mostly
> wont be able to go farther than this.
> >
> > This publicly archived list offers a means to provide input to the
> > OASIS Virtual I/O Device (VIRTIO) TC.
> >
> > In order to verify user consent to the Feedback License terms and to
> > minimize spam in the list archive, subscription is required before
> > posting.
> >
> > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > List help: virtio-comment-help@lists.oasis-open.org
> > List archive: https://lists.oasis-open.org/archives/virtio-comment/
> > Feedback License:
> > https://www.oasis-open.org/who/ipr/feedback_license.pdf
> > List Guidelines:
> > https://www.oasis-open.org/policies-guidelines/mailing-lists
> > Committee: https://www.oasis-open.org/committees/virtio/
> > Join OASIS: https://www.oasis-open.org/join/
> >


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-21  6:55                                                             ` Jason Wang
@ 2023-11-21 16:30                                                               ` Parav Pandit
  2023-11-22  4:19                                                                 ` Jason Wang
  2023-11-22  2:31                                                               ` Si-Wei Liu
  1 sibling, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-21 16:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan, virtio-comment, cohuck,
	sburla, Shahaf Shuler, Maor Gottlieb, Yishai Hadas, eperezma,
	Siwei Liu


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, November 21, 2023 12:25 PM
> 
> On Fri, Nov 17, 2023 at 10:48 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 7:31 PM
> > > To: Parav Pandit <parav@nvidia.com>
> > >
> > > On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 6:02 PM
> > > > >
> > > > > On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> > > > > >
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 5:35 PM
> > > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > > >
> > > > > > > On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> > > > > > > >
> > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > Sent: Friday, November 17, 2023 5:04 PM
> > > > > > > > >
> > > > > > > > > On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > Sent: Friday, November 17, 2023 4:30 PM
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit
> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > > > > > > > > > >
> > > > > > > > > > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu,
> > > > > > > > > > > > > > Lingshan
> > > > > wrote:
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000,
> > > > > > > > > > > > > >>> Parav Pandit
> > > > > wrote:
> > > > > > > > > > > > > >>>> We should expose a limit of the device in
> > > > > > > > > > > > > >>>> the proposed
> > > > > > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how much
> > > range
> > > > > > > > > > > > > it can
> > > > > > > > > track.
> > > > > > > > > > > > > >>>> So that future provisioning framework can use it.
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>> I will cover this in v5 early next week.
> > > > > > > > > > > > > >>> I do worry about how this can even work though.
> > > > > > > > > > > > > >>> If you want a generic device you do not get
> > > > > > > > > > > > > >>> to dictate how much memory VM
> > > > > > > > > has.
> > > > > > > > > > > > > >>>
> > > > > > > > > > > > > >>> Aren't we talking bit per page? With 1TByte
> > > > > > > > > > > > > >>> of memory to track
> > > > > > > > > > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > > > > > > >>>
> > > > > > > > > > > > > >>> And you happily say "we'll address this in the future"
> > > > > > > > > > > > > >>> while at the same time fighting tooth and
> > > > > > > > > > > > > >>> nail against adding single bit status
> > > > > > > > > > > > > >>> registers because
> > > scalability?
> > > > > > > > > > > > > >>>
> > > > > > > > > > > > > >>>
> > > > > > > > > > > > > >>> I have a feeling doing this completely
> > > > > > > > > > > > > >>> theoretical like this is
> > > > > > > > > problematic.
> > > > > > > > > > > > > >>> Maybe you have it all laid out neatly in
> > > > > > > > > > > > > >>> your head but I suspect not all of TC can
> > > > > > > > > > > > > >>> picture it clearly enough based just on spec
> > > > > > > > > text.
> > > > > > > > > > > > > >>>
> > > > > > > > > > > > > >>> We do sometimes ask for POC implementation
> > > > > > > > > > > > > >>> in linux / qemu to demonstrate how things
> > > > > > > > > > > > > >>> work before merging
> > > > > code.
> > > > > > > > > > > > > >>> We skipped this for admin things so far but
> > > > > > > > > > > > > >>> I think it's a good idea to start doing it here.
> > > > > > > > > > > > > >>>
> > > > > > > > > > > > > >>> What makes me pause a bit before saying
> > > > > > > > > > > > > >>> please do a PoC is all the opposition that
> > > > > > > > > > > > > >>> seems to exist to even using admin commands
> > > > > > > > > > > > > >>> in the 1st place. I think once we finally
> > > > > > > > > > > > > >>> stop arguing about whether to use admin
> > > > > > > > > > > > > >>> commands at all then a PoC will be needed
> > > > > > > > > > > before merging.
> > > > > > > > > > > > > >> We have POR productions that implemented the
> > > > > > > > > > > > > >> approach in my
> > > > > > > > > series.
> > > > > > > > > > > > > >> They are multiple generations of productions
> > > > > > > > > > > > > >> in market and running in customers data centers for
> years.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Back to 2019 when we start working on vDPA,
> > > > > > > > > > > > > >> we have sent some samples of production(e.g.,
> > > > > > > > > > > > > >> Cascade
> > > > > > > > > > > > > >> Glacier) and the datasheet, you can find live
> > > > > > > > > > > > > >> migration facilities there, includes suspend,
> > > > > > > > > > > > > >> vq state and other
> > > > > features.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> And there is an reference in DPDK live
> > > > > > > > > > > > > >> migration, I have provided this page
> > > > > > > > > > > > > >> before:
> > > > > > > > > > > > > >> https://doc.dpdk.org/guides-21.11/vdpadevs/if
> > > > > > > > > > > > > >> c.ht ml, it has been working for long long
> > > > > > > > > > > > > >> time.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> So if we let the facts speak, if we want to
> > > > > > > > > > > > > >> see if the proposal is proven to work, I
> > > > > > > > > > > > > >> would
> > > > > > > > > > > > > >> say: They are POR for years, customers
> > > > > > > > > > > > > >> already deployed them for
> > > > > > > > > years.
> > > > > > > > > > > > > > And I guess what you are trying to say is that
> > > > > > > > > > > > > > this patchset we are reviewing here should be
> > > > > > > > > > > > > > help to the same standard and there should be
> > > > > > > > > > > > > > a PoC? Sounds
> > > reasonable.
> > > > > > > > > > > > > Yes and the in-marketing productions are POR,
> > > > > > > > > > > > > the series just improves the design, for
> > > > > > > > > > > > > example, our series also use registers to track
> > > > > > > > > > > > > vq state, but improvements than CG or BSC. So I
> > > > > > > > > > > > > think they are proven
> > > > > > > > > > > to work.
> > > > > > > > > > > >
> > > > > > > > > > > > If you prefer to go the route of POR and
> > > > > > > > > > > > production and proven documents
> > > > > > > > > > > etc, there is ton of it of multiple types of
> > > > > > > > > > > products I can dump here with open- source code and
> > > > > > > > > > > documentation and
> > > more.
> > > > > > > > > > > > Let me know what you would like to see.
> > > > > > > > > > > >
> > > > > > > > > > > > Michael has requested some performance
> > > > > > > > > > > > comparisons, not all are ready to
> > > > > > > > > > > share yet.
> > > > > > > > > > > > Some are present that I will share in coming weeks.
> > > > > > > > > > > >
> > > > > > > > > > > > And all the vdpa dpdk you published does not have
> > > > > > > > > > > > basic CVQ support when I
> > > > > > > > > > > last looked at it.
> > > > > > > > > > > > Do you know when was it added?
> > > > > > > > > > >
> > > > > > > > > > > It's good enough for PoC I think, CVQ or not.
> > > > > > > > > > > The problem with CVQ generally, is that VDPA wants
> > > > > > > > > > > to shadow CVQ it at all times because it wants to
> > > > > > > > > > > decode and cache the content. But this problem has
> > > > > > > > > > > nothing to do with dirty tracking even though it
> > > > > > > > > > > also
> > > > > > > > > mentions "shadow":
> > > > > > > > > > > if device can report it's state then there's no need
> > > > > > > > > > > to shadow
> > > CVQ.
> > > > > > > > > >
> > > > > > > > > > For the performance numbers with the pre-copy and
> > > > > > > > > > device context of
> > > > > > > > > patches posted 1 to 5, the downtime reduction of the VM
> > > > > > > > > is 3.71x with active traffic on 8 RQs at 100Gbps port speed.
> > > > > > > > >
> > > > > > > > > Sounds good can you please post a bit more detail?
> > > > > > > > > which configs are you comparing what was the result on
> > > > > > > > > each of
> > > them.
> > > > > > > >
> > > > > > > > Common config: 8+8 tx and rx queues.
> > > > > > > > Port speed: 100Gbps
> > > > > > > > QEMU 8.1
> > > > > > > > Libvirt 7.0
> > > > > > > > GVM: Centos 7.4
> > > > > > > > Device: virtio VF hardware device
> > > > > > > >
> > > > > > > > Config_1: virtio suspend/resume similar to what Lingshan
> > > > > > > > has, largely vdpa stack
> > > > > > > > Config_2: Device context method of admin commands
> > > > > > >
> > > > > > > OK that sounds good. The weird thing here is that you
> > > > > > > measure
> > > "downtime".
> > > > > > > What exactly do you mean here?
> > > > > > > I am guessing it's the time to retrieve on source and
> > > > > > > re-program device state on destination? And this is 3.71x out of
> how long?
> > > > > > Yes. Downtime is the time during which the VM is not
> > > > > > responding or receiving
> > > > > packets, which involves reprogramming the device.
> > > > > > 3.71x is relative time for this discussion.
> > > > >
> > > > > Oh interesting. So VM state movement including reprogramming the
> > > > > CPU is dominated by reprogramming this single NIC, by a factor of
> almost 4?
> > > > Yes.
> > >
> > > Could you post some numbers too then?  I want to know whether that
> > > would imply that VM boot is slowed down significantly too. If yes
> > > that's another motivation for pci transport 2.0.
> > It was 1.8 sec down to 480msec.
> 
> Well, there's work ongoing to reduce the downtime of the shadow virtqueue.
> 
> Eugenio or Si-wei may share an exact number, but it should be several
> hundreds of ms.
> 
Shadow vq is not applicable at all as comparison point because there is no virtio specific qemu etc software involved here.

Anyways, the requested numbers are supplied for the device context based migration over admin vq proposed here.


> But it seems the shadow virtqueue itself is not the major factor but the time
> spent on programming vendor specific mappings for example.
> 
> Thanks
> 
> > The time didn't come from pci side or boot side.
> >
> > For pci side of things you would want to compare the pci vs non pci device
> based VM boot time.
> >


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-21  7:14                               ` Jason Wang
@ 2023-11-21 16:31                                 ` Parav Pandit
  2023-11-22  4:28                                   ` [virtio-comment] " Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-21 16:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, November 21, 2023 12:45 PM
> 
> On Thu, Nov 16, 2023 at 1:30 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Thursday, November 16, 2023 9:54 AM
> > >
> > > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com>
> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Monday, November 13, 2023 9:07 AM
> > > > >
> > > > > On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Tuesday, November 7, 2023 9:34 AM
> > > > > > >
> > > > > > > On Mon, Nov 6, 2023 at 2:54 PM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Sent: Monday, November 6, 2023 12:04 PM
> > > > > > > > >
> > > > > > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit
> > > > > > > > > <parav@nvidia.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > Sent: Thursday, November 2, 2023 9:54 AM
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit
> > > > > > > > > > > <parav@nvidia.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > Sent: Wednesday, November 1, 2023 6:00 AM
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit
> > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit
> > > > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > During a device migration flow (typically
> > > > > > > > > > > > > > > > in a precopy phase of the live migration),
> > > > > > > > > > > > > > > > a device may write to the guest memory.
> > > > > > > > > > > > > > > > Some iommu/hypervisor may not be able to
> > > > > > > > > > > > > > > > track these
> > > > > > > > > > > written pages.
> > > > > > > > > > > > > > > > These pages to be migrated from source to
> > > > > > > > > > > > > > > > destination
> > > > > > > hypervisor.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > A device which writes to these pages,
> > > > > > > > > > > > > > > > provides the page address record of the to the owner
> device.
> > > > > > > > > > > > > > > > The owner device starts write recording
> > > > > > > > > > > > > > > > for the device and queries all the page
> > > > > > > > > > > > > > > > addresses written by the
> > > > > device.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Fixes:
> > > > > > > > > > > > > > > > https://github.com/oasis-tcs/virtio-spec/i
> > > > > > > > > > > > > > > > ssue
> > > > > > > > > > > > > > > > s/17
> > > > > > > > > > > > > > > > 6
> > > > > > > > > > > > > > > > Signed-off-by: Parav Pandit
> > > > > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > > > > > > > > Signed-off-by: Satananda Burla
> > > > > > > > > > > > > > > > <sburla@marvell.com>
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > changelog:
> > > > > > > > > > > > > > > > v1->v2:
> > > > > > > > > > > > > > > > - addressed comments from Michael
> > > > > > > > > > > > > > > > - replaced iova with physical address
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > >  admin-cmds-device-migration.tex | 15
> > > > > > > > > > > > > > > > +++++++++++++++
> > > > > > > > > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > > b/admin-cmds-device-migration.tex index
> > > > > > > > > > > > > > > > ed911e4..2e32f2c
> > > > > > > > > > > > > > > > 100644
> > > > > > > > > > > > > > > > --- a/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > > +++ b/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > > @@ -95,6 +95,21 @@ \subsubsection{Device
> > > > > > > > > > > > > > > > Migration}\label{sec:Basic Facilities of a
> > > > > > > > > > > > > > > > Virtio Device / The owner driver can
> > > > > > > > > > > > > > > > discard any partially read or written
> > > > > > > > > > > > > > > > device context when  any of the device
> > > > > > > > > > > > > > > > migration flow
> > > > > > > > > > > > > > > should be aborted.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +During the device migration flow, a
> > > > > > > > > > > > > > > > +passthrough device may write data to the
> > > > > > > > > > > > > > > > +guest virtual machine's memory, a source
> > > > > > > > > > > > > > > > +hypervisor needs to keep track of these
> > > > > > > > > > > > > > > > +written memory to migrate such memory to
> > > > > > > > > > > > > > > > +destination
> > > > > > > > > > > > > > > hypervisor.
> > > > > > > > > > > > > > > > +Some systems may not be able to keep
> > > > > > > > > > > > > > > > +track of such memory write addresses at hypervisor
> level.
> > > > > > > > > > > > > > > > +In such a scenario, a device records and
> > > > > > > > > > > > > > > > +reports these written memory addresses to
> > > > > > > > > > > > > > > > +the owner device. The owner driver
> > > > > > > > > > > > > > > > +enables write recording for one or more
> > > > > > > > > > > > > > > > +physical address ranges per device during
> > > > > > > > > > > > > > > > +device
> > > > > > > > > > > migration flow.
> > > > > > > > > > > > > > > > +The owner driver periodically queries
> > > > > > > > > > > > > > > > +these written physical address
> > > > > > > > > > > > > records from the device.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I wonder how PA works in this case. Device
> > > > > > > > > > > > > > > uses untranslated requests so it can only see IOVA.
> > > > > > > > > > > > > > > We can't mandate
> > > > > > > ATS anyhow.
> > > > > > > > > > > > > > Michael suggested to keep the language uniform
> > > > > > > > > > > > > > as PA as this is ultimately
> > > > > > > > > > > > > what the guest driver is supplying during vq
> > > > > > > > > > > > > creation and in posting buffers as physical address.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This seems to need some work. And, can you show
> > > > > > > > > > > > > me how it can
> > > > > > > work?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1) e.g if GAW is 48 bit, is the hypervisor
> > > > > > > > > > > > > expected to do a bisection of the whole range?
> > > > > > > > > > > > > 2) does the device need to reserve sufficient
> > > > > > > > > > > > > internal resources for logging the dirty page and why
> (not)?
> > > > > > > > > > > > No when dirty page logging starts, only at that
> > > > > > > > > > > > time, device will reserve
> > > > > > > > > > > enough resources.
> > > > > > > > > > >
> > > > > > > > > > > GAW is 48bit, how large would it have then?
> > > > > > > > > > Dirty page tracking is not dependent on the size of the GAW.
> > > > > > > > > > It is function of address ranges for the amount of
> > > > > > > > > > guest memory regardless of
> > > > > > > > > GAW.
> > > > > > > > >
> > > > > > > > > The problem is, e.g when vIOMMU is enabled, you can't
> > > > > > > > > know which IOVA is actually used by guests. And even for
> > > > > > > > > the case when vIOMMU is not enabled, the guest may have
> several TBs.
> > > > > > > > > Is it easy to reserve sufficient resources by the device itself?
> > > > > > > > >
> > > > > > > > When page tracking is enabled per device, it knows about
> > > > > > > > the range and it can
> > > > > > > reserve certain resource.
> > > > > > >
> > > > > > > I didn't see such an interface in this series. Anything I miss?
> > > > > > >
> > > > > > Yes, this patch and the next patch is covering the page
> > > > > > tracking start,stop and
> > > > > query commands.
> > > > > > They are named as write recording commands.
> > > > >
> > > > > So I still don't see how the device can reserve sufficient resources?
> > > > > Guests may map a very large area of memory to IOMMU (or when
> > > > > vIOMMU is disabled, GPA is used). It would be several TBs, how
> > > > > can the device reserve sufficient resources in this case?
> > > > When the map is established, the ranges are supplied to the device
> > > > to know
> > > how much to reserve.
> > > > If device does not have enough resource, it fails the command.
> > > >
> > > > One can advance it further to provision for the desired range..
> > >
> > > Well, I think I've asked whether or not a bisection is needed, and
> > > you told me not ...
> > >
> > > But at least we need to document this in the proposal, no?
> > >
> > We should expose a limit of the device in the proposed
> WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > So that future provisioning framework can use it.
> >
> > I will cover this in v5 early next week.
> >
> > > > >
> > > > > >
> > > > > > > Btw, the IOVA is allocated by the guest actually, how can we
> > > > > > > know the
> > > > > range?
> > > > > > > (or using the host range?)
> > > > > > >
> > > > > > Hypervisor would have mapping translation.
> > > > >
> > > > > That's really tricky and can only work in some cases:
> > > > >
> > > > > 1) It requires the hypervisor to traverse the guest I/O page
> > > > > tables which could be very large range
> > > > > 2) It requests the hypervisor to trap the modification of guest
> > > > > I/O page tables and synchronize with the range changes, which is
> > > > > inefficient and can only be done when we are doing shadow PTEs.
> > > > > It won't work when the nesting translation could be offloaded to
> > > > > the hardware
> > > > > 3) It is racy with the guest modification of I/O page tables
> > > > > which is explained in another thread
> > > > Mapping changes with more hw mmu's is not a frequent event and
> > > > IOTLB
> > > flush is done using querying the dirty log for the smaller range.
> > > >
> > > > > 4) No aware of new features like PASID which has been explained
> > > > > in another thread
> > > > For all the pinned work with non sw based IOMMU, it is typically small
> subset.
> > > > PASID is guest controlled.
> > >
> > > Let's repeat my points:
> > >
> > > 1) vq1 use untranslated request with PASID1
> > > 2) vq2 use untranslated request with PASID2
> > >
> > > Shouldn't we log PASID as well?
> > >
> > Possibly yes, either to request the tracking per PASID or to log the PASID.
> > When in future PASID based VQ are supported, this part should be
> extended.
> 
> Who is going to do the extension? They are orthogonal features for sure.
Whoever extends the VQ for PASID programming.

I plan to have generic command for VQ creation over CVQ for the wider use cases we discussed.
It can have PASID parameter in future when one wants to add it.

> 
> >
> > > And
> > >
> > > 1) vq1 is using translated request
> > > 2) vq2 is using untranslated request
> > >
> 
> How about this?
How did driver program the device for vq1 to translated request and vq2 to not.
And for which use case?

> 
> >
> > > How could we differ?
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > Host should always have more resources than device, in
> > > > > > > > > that sense there could be several methods that tries to
> > > > > > > > > utilize host memory instead of the one in the device. I
> > > > > > > > > think we've discussed this when going through the doc prepared
> by Eugenio.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > What happens if we're trying to migrate more than 1 device?
> > > > > > > > > > >
> > > > > > > > > > That is perfectly fine.
> > > > > > > > > > Each device is updating its log of pages it wrote.
> > > > > > > > > > The hypervisor is collecting their sum.
> > > > > > > > >
> > > > > > > > > See above.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 3) DMA is part of the transport, it's natural to
> > > > > > > > > > > > > do logging there, why duplicate efforts in the virtio layer?
> > > > > > > > > > > > He he, you have funny comment.
> > > > > > > > > > > > When an abstract facility is added to virtio you
> > > > > > > > > > > > say to do in
> > > > > transport.
> > > > > > > > > > >
> > > > > > > > > > > So it's not done in the general facility but tied to the admin
> part.
> > > > > > > > > > > And we all know dirty page tracking is a challenge
> > > > > > > > > > > and Eugenio has a good summary of pros/cons. A
> > > > > > > > > > > revisit of those docs make me think virtio is not
> > > > > > > > > > > the good place for doing that for
> > > > > may reasons:
> > > > > > > > > > >
> > > > > > > > > > > 1) as stated, platform will evolve to be able to
> > > > > > > > > > > tracking dirty pages, actually, it has been
> > > > > > > > > > > supported by a lot of major IOMMU vendors
> > > > > > > > > >
> > > > > > > > > > This is optional facility in virtio.
> > > > > > > > > > Can you please point to the references? I don’t see it
> > > > > > > > > > in the common Linux
> > > > > > > > > kernel support for it.
> > > > > > > > >
> > > > > > > > > Note that when IOMMUFD is being proposed, dirty page
> > > > > > > > > tracking is one of the major considerations.
> > > > > > > > >
> > > > > > > > > This is one recent proposal:
> > > > > > > > >
> > > > > > > > > https://www.spinics.net/lists/kvm/msg330894.html
> > > > > > > > >
> > > > > > > > Sure, so if platform supports it. it can be used from the platform.
> > > > > > > > If it does not, the device supplies it.
> > > > > > > >
> > > > > > > > > > Instead Linux kernel choose to extend to the devices.
> > > > > > > > >
> > > > > > > > > Well, as I stated, tracking dirty pages is challenging
> > > > > > > > > if you want to do it on a device, and you can't simply
> > > > > > > > > invent dirty page tracking for each type of the devices.
> > > > > > > > >
> > > > > > > > It is not invented.
> > > > > > > > It is generic framework for all virtio device types as proposed here.
> > > > > > > > Keep in mind, that it is optional already in v3 series.
> > > > > > > >
> > > > > > > > > > At least not seen to arrive this in any near term in
> > > > > > > > > > start of
> > > > > > > > > > 2024 which is
> > > > > > > > > where users must use this.
> > > > > > > > > >
> > > > > > > > > > > 2) you can't assume virtio is the only device that
> > > > > > > > > > > can be used by the guest, having dirty pages
> > > > > > > > > > > tracking to be implemented in each type of device is
> > > > > > > > > > > unrealistic
> > > > > > > > > > Of course, there is no such assumption made. Where did
> > > > > > > > > > you see a text that
> > > > > > > > > made such assumption?
> > > > > > > > >
> > > > > > > > > So what happens if you have a guest with virtio and
> > > > > > > > > other devices
> > > > > assigned?
> > > > > > > > >
> > > > > > > > What happens? Each device type would do its own dirty page
> tracking.
> > > > > > > > And if all devices does not have support, hypervisor knows
> > > > > > > > to fall back to
> > > > > > > platform iommu or its own.
> > > > > > > >
> > > > > > > > > > Each virtio and non virtio devices who wants to report
> > > > > > > > > > their dirty page report,
> > > > > > > > > will do their way.
> > > > > > > > > >
> > > > > > > > > > > 3) inventing it in the virtio layer will be
> > > > > > > > > > > deprecated in the future for sure, as platform will
> > > > > > > > > > > provide much rich features for logging e.g it can do
> > > > > > > > > > > it per PASID etc, I don't see any reason virtio need
> > > > > > > > > > > to compete with the features that will be provided
> > > > > > > > > > > by the platform
> > > > > > > > > > Can you bring the cpu vendors and committement to
> > > > > > > > > > virtio tc with timelines
> > > > > > > > > so that virtio TC can omit?
> > > > > > > > >
> > > > > > > > > Why do we need to bring CPU vendors in the virtio TC?
> > > > > > > > > Virtio needs to be built on top of transport or
> > > > > > > > > platform. There's no need to duplicate
> > > > > > > their job.
> > > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > > >
> > > > > > > > I wanted to see a strong commitment for the cpu vendors to
> > > > > > > > support dirty
> > > > > > > page tracking.
> > > > > > >
> > > > > > > The RFC of IOMMUFD support can go back to early 2022. Intel,
> > > > > > > AMD and ARM are all supporting that now.
> > > > > > >
> > > > > > > > And the work seems to have started for some platforms.
> > > > > > >
> > > > > > > Let me quote from the above link:
> > > > > > >
> > > > > > > """
> > > > > > > Today, AMD Milan (or more recent) supports it while ARM
> > > > > > > SMMUv3.2 alongside VT-D rev3.x also do support.
> > > > > > > """
> > > > > > >
> > > > > > > > Without such platform commitment, virtio also skipping it
> > > > > > > > would not
> > > work.
> > > > > > >
> > > > > > > Is the above sufficient? I'm a little bit more familiar with
> > > > > > > vtd, the hw feature has been there for years.
> > > > > > >
> > > > > > Vtd has a sticky D bit that requires synchronization with
> > > > > > IOPTE page caches
> > > > > when sw wants to clear it.
> > > > >
> > > > > This is by design.
> > > > >
> > > > > > Do you know if is it reliable when device does multiple
> > > > > > writes, ie,
> > > > > >
> > > > > > a. iommu write D bit
> > > > > > b. software read it
> > > > > > c. sw synchronize cache
> > > > > > d. iommu write D bit on next write by device
> > > > >
> > > > > What issue did you see here? But that's not even an excuse, if
> > > > > there's a bug, let's report it to IOMMU vendors and let them fix it.
> > > > > The thread I point to you is actually a good space.
> > > > >
> > > > So we cannot claim that it is there in the platform.
> > >
> > > I'm confused, the thread I point to you did the cache
> > > synchronization which has been explained in the changelog, so what's the
> issue?
> > >
> > If the ask is for IOMMU chip to fix something, we cannot claim that dirty
> page tracking is available already in platform.
> 
> Again, can you describe the issue? Why do you think the sticky part is an
> issue? IOTLB needs to be sync with IO page tables, what's wrong with this?
Nothing wrong with it.
The text is not affirmative to say it works if the sw clears it.

> 
> >
> > > >
> > > > > Again, the point is to let the correct role play.
> > > > >
> > > > How many more years should we block the virtio device migration
> > > > when
> > > platform do not have it?
> > >
> > > At least for VT-D, it has been used for years.
> > Is this device written pages tracked by KVM for VT-d as dirty page log,
> instead through vfio?
> 
> I don't get this question.
You said the VT-d has dirty page tracking for years so it must be used by the sw during device migration.
And if that is there, how is these dirty pages of iommu are merged with the cpu side?
Is this done by KVM for passthrough devices for vfio?

> 
> >
> > >
> > > >
> > > > > >
> > > > > > ARM SMMU based servers to be present with D bit tracking.
> > > > > > It is still early to say platform is ready.
> > > > >
> > > > > This is not what I read from both the series I posted and the
> > > > > spec, dirty bit has been supported several years ago at least for vtd.
> > > > Supported, but spec listed it as sticky bit that may require special
> handling.
> > >
> > > Please explain why this is "special handling". IOMMU has several
> > > different layers of caching, by design, it can't just open a window for D bit.
> > >
> > > > May be it is working, but not all cpu platforms have it.
> > >
> > > I don't see the point. Migration is not supported for virito as well.
> > >
> > I don’t see a point either to discuss.
> >
> > I already acked that platform may have support as well, and not all platform
> has it.
> > So the device feeds the data and its platform's choice to enable/disable.
> 
> I've pointed out sufficient issues and I don't want to repeat them.
There does not seem to be any that is critical enough for non viommu case.
Viommu needs to flush the iotlb anyway.

> 
> >
> > > >
> > > > >
> > > > > >
> > > > > > It is optional so whichever has the support it will be used.
> > > > >
> > > > > I can't see the point of this, it is already available. And
> > > > > migration doesn't exist in virtio spec yet.
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > >
> > > > > > > > > Why does it matter in 2024?
> > > > > > > > Because users needs to use it now.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > If not, we are better off to offer this, and when/if
> > > > > > > > > > platform support is, sure,
> > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > >
> > > > > > > > > > > 4) if the platform support is missing, we can use
> > > > > > > > > > > software or leverage transport for assistance like
> > > > > > > > > > > PRI
> > > > > > > > > > All of these are in theory.
> > > > > > > > > > Our experiment shows PRI performance is 21x slower
> > > > > > > > > > than page fault rate
> > > > > > > > > done by the cpu.
> > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > >
> > > > > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > > > > Do you have perf data for this?
> > > > > > >
> > > > > > > No, but it's not hard to imagine the worst case. Wrote a
> > > > > > > small program that dirty every page by a NIC.
> > > > > > >
> > > > > > > > In the internal tests we don’t see this happening.
> > > > > > >
> > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > >
> > > > > > > So if we get very high dirty rates (e.g by a high speed
> > > > > > > NIC), we can't satisfy the requirement of the downtime. Or
> > > > > > > if you see the converge, you might get help from the auto
> > > > > > > converge support by the hypervisors like KVM where it tries
> > > > > > > to throttle the VCPU then you can't reach
> > > > > the wire speed.
> > > > > > >
> > > > > > Once PRI is enabled, even without migration, there is basic perf issues.
> > > > >
> > > > > The context is not PRI here...
> > > > >
> > > > > It's about if you can stick to wire speed during live migration.
> > > > > Based on the analysis so far, you can't achieve wirespeed and
> > > > > downtime at
> > > the same time.
> > > > > That's why the hypervisor needs to throttle VCPU or devices.
> > > > >
> > > > So?
> > > > Device also may throttle itself.
> > >
> > > That's perfectly fine. We are on the same page, no? It's wrong to
> > > judge the dirty page tracking in the context of live migration by
> > > measuring whether or not the device can work at wire speed.
> > >
> > > >
> > > > > For PRI, it really depends on how you want to use it. E.g if you
> > > > > don't want to pin a page, the performance is the price you must pay.
> > > > PRI without pinning does not make sense for device to make large
> > > > mapping
> > > queries.
> > >
> > > That's also fine. Hypervisors can choose to enable and use PRI
> > > depending on the different cases.
> > >
> > So PRI is not must for device migration.
> 
> I never say it's a must.
> 
> > Device migration must be able to work without PRI enabled, as simple as
> that as first base line.
> 
> My point is that, you need document
> 
> 1) why you think dirty page is a must or not
Explained in the patch already in commit log and in spec theory already.

> 2) why did you choose one of a specific way instead of others
> 
This is not part of the spec anyway. This is already discussed in mailing list here in community.

> >
> > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > So it is unusable.
> > > > > > > > >
> > > > > > > > > It's not about mandating, it's about doing things in the
> > > > > > > > > correct layer. If PRI is slow, PCI can evolve for sure.
> > > > > > > > You should try.
> > > > > > >
> > > > > > > Not my duty, I just want to make sure things are done in the
> > > > > > > correct layer, and once it needs to be done in the virtio,
> > > > > > > there's nothing obviously
> > > > > wrong.
> > > > > > >
> > > > > > At present, it looks all platforms are not equally ready for page
> tracking.
> > > > >
> > > > > That's not an excuse to let virtio support that.
> > > > It is wrong attribution as excuse.
> > > >
> > > > > And we need also to figure out if virtio can do that easily.
> > > > > I've pointed out sufficient issues, I'm pretty sure there would
> > > > > be more as the platform evolves.
> > > > >
> > > > I am not sure if virtio feeds the log into the platform.
> > >
> > > I don't understand the meaning here.
> > >
> > I mistakenly merged two sentences.
> >
> > Virtio feeds the dirty page details to the hypervisor platform which collects
> and merges the page record.
> > So it is platform choice to use iommu based tracking or device based.
> >
> > > >
> > > > > >
> > > > > > > > In the current state, it is mandating.
> > > > > > > > And if you think PRI is the only way,
> > > > > > >
> > > > > > > I don't, it's just an example where virtio can leverage from
> > > > > > > either transport or platform. Or if it's the fault in virtio
> > > > > > > that slows down the PRI, then it is something we can do.
> > > > > > >
> > > > > > Yea, it does not seem to be ready yet.
> > > > > >
> > > > > > > >  than you should propose that in the dirty page tracking
> > > > > > > > series that you listed
> > > > > > > above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > > >
> > > > > > > No, the point is to not duplicate works especially
> > > > > > > considering virtio can't do better than platform or transport.
> > > > > > >
> > > > > > Both the platform and virtio work is ongoing.
> > > > >
> > > > > Why duplicate the work then?
> > > > >
> > > > Not all cpu platforms support as far as I know.
> > >
> > > Yes, but we all know the platform is working to support this.
> > >
> > > Supporting this on the device is hard.
> > >
> > This is optional, whichever device would like to implement it, will support it.
> >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > When one does something in transport, you say,
> > > > > > > > > > > > this is transport specific, do
> > > > > > > > > > > some generic.
> > > > > > > > > > > >
> > > > > > > > > > > > Here the device is being tracked is virtio device.
> > > > > > > > > > > > PCI-SIG has told already that PCIM interface is
> > > > > > > > > > > > outside the scope of
> > > > > it.
> > > > > > > > > > > > Hence, this is done in virtio layer here in abstract way.
> > > > > > > > > > >
> > > > > > > > > > > You will end up with a competition with the
> > > > > > > > > > > platform/transport one that will fail.
> > > > > > > > > > >
> > > > > > > > > > I don’t see a reason. There is no competition.
> > > > > > > > > > Platform always have a choice to not use device side
> > > > > > > > > > page tracking when it is
> > > > > > > > > supported.
> > > > > > > > >
> > > > > > > > > Platform provides a lot of other functionalities for dirty logging:
> > > > > > > > > e.g per PASID, granular, etc. So you want to duplicate
> > > > > > > > > them again in the virtio? If not, why choose this way?
> > > > > > > > >
> > > > > > > > It is optional for the platforms where platform do not have it.
> > > > > > >
> > > > > > > We are developing new virtio functionalities that are
> > > > > > > targeted for future platforms. Otherwise we would end up
> > > > > > > with a feature with a very narrow use case.
> > > > > > In general I agree that platform is an option too.
> > > > > > Hypervisor will be able to make the decision to use platform
> > > > > > when available
> > > > > and fallback to device method when platform does not have it.
> > > > > >
> > > > > > Future and to be equally usable in near term :)
> > > > >
> > > > > Please don't double standard again:
> > > > >
> > > > > When you are talking about TDISP, you want virtio to be designed
> > > > > to fit for the future where the platform is ready in the future
> > > > > When you are talking about dirty tracking, you want it to work
> > > > > now even if
> > > > >
> > > > The proposal of transport VQ is anti-TDISP.
> > >
> > > It's nothing about transport VQ, it's about you're saying the adminq
> > > based device context. There's a comment to point out that the
> > > current TDISP spec forbids modifying device state when TVM is
> > > attached. Then you told us the TDISP may evolve for that.
> > So? That is not double standard.
> > The proposal is based on main principle that it is not depending on
> > hypervisor traping + emulating which is the baseline of TDISP
> >
> > >
> > > > The proposal of dirty tracking is not anti-platform. It is
> > > > optional like rest of the
> > > platform.
> > > >
> > > > > 1) most of the platform is ready now
> > > > Can you list a ARM server CPU in production that has it? (not in
> > > > some pdf
> > > spec).
> > >
> > > Then in the context of a dirty page, I've proved you dirty page
> > > tracking has been supported by all major vendors.
> > Major IP vendor != major cpu chip vendor.
> > I don’t agree with the proof.
> 
> So this will be an endless debate. Did I ever ask you about ETA or any product
> for TDISP?
> 
ETA for TDISP is not relevant.
You claimed for _major_ vendor support based on nonphysical cpu, hence the disagreement.
And that is not the reality.

> >
> > I already acknowledged that I have seen internal test report for dirty tracking
> with one cpu and nic.
> >
> > I just don’t see all cpus have support for it.
> > Hence, this optional feature.
> 
> Repeat myself again.
> 
> If it can be done easily and efficiently in virtio, I agree. But I've pointed out
> several issues where it is not answered.

I have answered most of your questions.

The definition of 'easy' is very subjective.
At one point RSS was also not easy in some devices and IOMMU dirty page tracking was also not easy.

> 
> >
> > > Where you refuse to use the standard you used in explaining adminq
> > > for device context in TDISP.
> > >
> > > So I didn't ask you the ETA of the TDISP support for migration or
> > > adminq, but you want me to give you the production information which is
> pointless.
> > Because you keep claiming that _all_ cpus in the world has support for
> efficient dirty page tracking.
> >
> > > You
> > > might need to ask ARM to get an answer, but a simple google told me
> > > the effort to support dirty page tracking in SMMUv3 could go back to early
> 2021.
> > >
> > To my knowledge ARM do not produce physical chips.
> > Your proposal is to keep those ARM server vendors to not use virtio devices.
> 
> This arbitrary conclusion makes no sense.
> 
Your conclusion about "all" and "major" physical cpu vendor supporting dirty page tracking is equally arbitrary.
So better to not argue on this.

> I know at least one cloud vendor has used a virtio based device for years on
> ARM. And that vendor has posted patches to support dirty page tracking since
> 2020.
> 
> Thanks
> 
> > Does not make sense to me.
> >
> > > https://lore.kernel.org/linux-iommu/56b001fa-b4fe-c595-dc5e-
> > > f362d2f07a19@linux.intel.com/t/
> > >
> > > Why is it not merged? It's simply because we agree to do it in the
> > > layer of IOMMUFD so it needs to wait.
> > >
> > > Thanks
> > >
> > >
> > > >
> > > > > 2) whether or not virtio can log dirty page correctly is still
> > > > > suspicious
> > > > >
> > > > > Thanks
> > > >
> > > > There is no double standard. The feature is optional which
> > > > co-exists as
> > > explained above.
> >


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-21 16:29                                                                           ` Parav Pandit
@ 2023-11-21 21:00                                                                             ` Michael S. Tsirkin
  2023-11-22  3:46                                                                               ` Parav Pandit
  2023-11-22  4:17                                                                             ` Jason Wang
  1 sibling, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-21 21:00 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Tue, Nov 21, 2023 at 04:29:36PM +0000, Parav Pandit wrote:
> Basic test with iperf is not working. Crashing it.
> All of this is complete unrelated discussion to this series to slow down the work.
> I don’t see any value.
> Michael asked to do the test, we did, it does not work. Functionally broken code has no comparison.

It's unfortunate it's unstable for you, if you could show perf
comparison that would be a strong argument for your case. Reporting
Linux/qemu failures to virtio TC is not going to help you though, wrong
forum.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-21  6:55                                                             ` Jason Wang
  2023-11-21 16:30                                                               ` Parav Pandit
@ 2023-11-22  2:31                                                               ` Si-Wei Liu
  2023-11-22  5:31                                                                 ` Jason Wang
  1 sibling, 1 reply; 157+ messages in thread
From: Si-Wei Liu @ 2023-11-22  2:31 UTC (permalink / raw)
  To: Jason Wang, Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan, virtio-comment, cohuck,
	sburla, Shahaf Shuler, Maor Gottlieb, Yishai Hadas, eperezma,
	si-wei.liu

(dropping my personal email abandoned for upstream discussion for now, 
please try to copy my corporate email address for more timely response)

On 11/20/2023 10:55 PM, Jason Wang wrote:
> On Fri, Nov 17, 2023 at 10:48 PM Parav Pandit <parav@nvidia.com> wrote:
>>
>>> From: Michael S. Tsirkin <mst@redhat.com>
>>> Sent: Friday, November 17, 2023 7:31 PM
>>> To: Parav Pandit <parav@nvidia.com>
>>>
>>> On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
>>>>
>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>> Sent: Friday, November 17, 2023 6:02 PM
>>>>>
>>>>> On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
>>>>>>
>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>> Sent: Friday, November 17, 2023 5:35 PM
>>>>>>> To: Parav Pandit <parav@nvidia.com>
>>>>>>>
>>>>>>> On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>> Sent: Friday, November 17, 2023 5:04 PM
>>>>>>>>>
>>>>>>>>> On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
>>>>>>>>>>
>>>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>>> Sent: Friday, November 17, 2023 4:30 PM
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>> Sent: Friday, November 17, 2023 3:30 PM
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>> On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu,
>>>>>>>>>>>>>> Lingshan
>>>>> wrote:
>>>>>>>>>>>>>>> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>>>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav
>>>>>>>>>>>>>>>> Pandit
>>>>> wrote:
>>>>>>>>>>>>>>>>> We should expose a limit of the device in the
>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>> WRITE_RECORD_CAP_QUERY command, that how much
>>> range
>>>>>>>>>>>>> it can
>>>>>>>>> track.
>>>>>>>>>>>>>>>>> So that future provisioning framework can use it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I will cover this in v5 early next week.
>>>>>>>>>>>>>>>> I do worry about how this can even work though.
>>>>>>>>>>>>>>>> If you want a generic device you do not get to
>>>>>>>>>>>>>>>> dictate how much memory VM
>>>>>>>>> has.
>>>>>>>>>>>>>>>> Aren't we talking bit per page? With 1TByte of
>>>>>>>>>>>>>>>> memory to track
>>>>>>>>>>>>>>>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> And you happily say "we'll address this in the future"
>>>>>>>>>>>>>>>> while at the same time fighting tooth and nail
>>>>>>>>>>>>>>>> against adding single bit status registers because
>>> scalability?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have a feeling doing this completely
>>>>>>>>>>>>>>>> theoretical like this is
>>>>>>>>> problematic.
>>>>>>>>>>>>>>>> Maybe you have it all laid out neatly in your
>>>>>>>>>>>>>>>> head but I suspect not all of TC can picture it
>>>>>>>>>>>>>>>> clearly enough based just on spec
>>>>>>>>> text.
>>>>>>>>>>>>>>>> We do sometimes ask for POC implementation in
>>>>>>>>>>>>>>>> linux / qemu to demonstrate how things work
>>>>>>>>>>>>>>>> before merging
>>>>> code.
>>>>>>>>>>>>>>>> We skipped this for admin things so far but I
>>>>>>>>>>>>>>>> think it's a good idea to start doing it here.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What makes me pause a bit before saying please
>>>>>>>>>>>>>>>> do a PoC is all the opposition that seems to
>>>>>>>>>>>>>>>> exist to even using admin commands in the 1st
>>>>>>>>>>>>>>>> place. I think once we finally stop arguing
>>>>>>>>>>>>>>>> about whether to use admin commands at all then
>>>>>>>>>>>>>>>> a PoC will be needed
>>>>>>>>>>> before merging.
>>>>>>>>>>>>>>> We have POR productions that implemented the
>>>>>>>>>>>>>>> approach in my
>>>>>>>>> series.
>>>>>>>>>>>>>>> They are multiple generations of productions in
>>>>>>>>>>>>>>> market and running in customers data centers for years.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Back to 2019 when we start working on vDPA, we
>>>>>>>>>>>>>>> have sent some samples of production(e.g.,
>>>>>>>>>>>>>>> Cascade
>>>>>>>>>>>>>>> Glacier) and the datasheet, you can find live
>>>>>>>>>>>>>>> migration facilities there, includes suspend, vq
>>>>>>>>>>>>>>> state and other
>>>>> features.
>>>>>>>>>>>>>>> And there is an reference in DPDK live migration,
>>>>>>>>>>>>>>> I have provided this page
>>>>>>>>>>>>>>> before:
>>>>>>>>>>>>>>> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.ht
>>>>>>>>>>>>>>> ml, it has been working for long long time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So if we let the facts speak, if we want to see
>>>>>>>>>>>>>>> if the proposal is proven to work, I would
>>>>>>>>>>>>>>> say: They are POR for years, customers already
>>>>>>>>>>>>>>> deployed them for
>>>>>>>>> years.
>>>>>>>>>>>>>> And I guess what you are trying to say is that
>>>>>>>>>>>>>> this patchset we are reviewing here should be help
>>>>>>>>>>>>>> to the same standard and there should be a PoC? Sounds
>>> reasonable.
>>>>>>>>>>>>> Yes and the in-marketing productions are POR, the
>>>>>>>>>>>>> series just improves the design, for example, our
>>>>>>>>>>>>> series also use registers to track vq state, but
>>>>>>>>>>>>> improvements than CG or BSC. So I think they are
>>>>>>>>>>>>> proven
>>>>>>>>>>> to work.
>>>>>>>>>>>> If you prefer to go the route of POR and production
>>>>>>>>>>>> and proven documents
>>>>>>>>>>> etc, there is ton of it of multiple types of products I
>>>>>>>>>>> can dump here with open- source code and documentation and
>>> more.
>>>>>>>>>>>> Let me know what you would like to see.
>>>>>>>>>>>>
>>>>>>>>>>>> Michael has requested some performance comparisons,
>>>>>>>>>>>> not all are ready to
>>>>>>>>>>> share yet.
>>>>>>>>>>>> Some are present that I will share in coming weeks.
>>>>>>>>>>>>
>>>>>>>>>>>> And all the vdpa dpdk you published does not have
>>>>>>>>>>>> basic CVQ support when I
>>>>>>>>>>> last looked at it.
>>>>>>>>>>>> Do you know when was it added?
>>>>>>>>>>> It's good enough for PoC I think, CVQ or not.
>>>>>>>>>>> The problem with CVQ generally, is that VDPA wants to
>>>>>>>>>>> shadow CVQ it at all times because it wants to decode
>>>>>>>>>>> and cache the content. But this problem has nothing to
>>>>>>>>>>> do with dirty tracking even though it also
>>>>>>>>> mentions "shadow":
>>>>>>>>>>> if device can report it's state then there's no need to shadow
>>> CVQ.
>>>>>>>>>> For the performance numbers with the pre-copy and device
>>>>>>>>>> context of
>>>>>>>>> patches posted 1 to 5, the downtime reduction of the VM is
>>>>>>>>> 3.71x with active traffic on 8 RQs at 100Gbps port speed.
>>>>>>>>>
>>>>>>>>> Sounds good can you please post a bit more detail?
>>>>>>>>> which configs are you comparing what was the result on each of
>>> them.
>>>>>>>> Common config: 8+8 tx and rx queues.
>>>>>>>> Port speed: 100Gbps
>>>>>>>> QEMU 8.1
>>>>>>>> Libvirt 7.0
>>>>>>>> GVM: Centos 7.4
>>>>>>>> Device: virtio VF hardware device
>>>>>>>>
>>>>>>>> Config_1: virtio suspend/resume similar to what Lingshan has,
>>>>>>>> largely vdpa stack
>>>>>>>> Config_2: Device context method of admin commands
>>>>>>> OK that sounds good. The weird thing here is that you measure
>>> "downtime".
>>>>>>> What exactly do you mean here?
>>>>>>> I am guessing it's the time to retrieve on source and re-program
>>>>>>> device state on destination? And this is 3.71x out of how long?
>>>>>> Yes. Downtime is the time during which the VM is not responding or
>>>>>> receiving
>>>>> packets, which involves reprogramming the device.
>>>>>> 3.71x is relative time for this discussion.
>>>>> Oh interesting. So VM state movement including reprogramming the CPU
>>>>> is dominated by reprogramming this single NIC, by a factor of almost 4?
>>>> Yes.
>>> Could you post some numbers too then?  I want to know whether that would
>>> imply that VM boot is slowed down significantly too. If yes that's another
>>> motivation for pci transport 2.0.
>> It was 1.8 sec down to 480msec.
> Well, there's work ongoing to reduce the downtime of the shadow virtqueue.
>
> Eugenio or Si-wei may share an exact number, but it should be several
> hundreds of ms.
That was mostly for device teardown time at the the source but there's 
also setup cost at the destination that needs to be counted.
Several hundred of milliseconds would be the ultimate goal I would say 
(right now the numbers from Parav more or less reflects the status quo 
but there's ongoing work to make it further down), and I don't doubt 
several hundreds of ms is possible. But to be fair, on the other hand, 
shadow vq on real vdpa hardware device would need a lot of dedicated 
optimization work across all layers (including hardware or firmware) all 
over the places to achieve what a simple suspend-resume (save/load) 
interface can easily do with VFIO migration.

> But it seems the shadow virtqueue itself is not the major factor but
> the time spent on programming vendor specific mappings for example.
Yep. The slowness on mapping part is mostly due to the artifact of 
software-based implementation. IMHO for live migration p.o.v it's better 
to not involve any mapping operation in the down time path at all.

-Siwei
>
> Thanks
>
>> The time didn't come from pci side or boot side.
>>
>> For pci side of things you would want to compare the pci vs non pci device based VM boot time.
>>
>
> This publicly archived list offers a means to provide input to the
>
> OASIS Virtual I/O Device (VIRTIO) TC.
>
>
>
> In order to verify user consent to the Feedback License terms and
>
> to minimize spam in the list archive, subscription is required
>
> before posting.
>
>
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>
> List help: virtio-comment-help@lists.oasis-open.org
>
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
>
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
>
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
>
> Committee: https://www.oasis-open.org/committees/virtio/
>
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-21 21:00                                                                             ` Michael S. Tsirkin
@ 2023-11-22  3:46                                                                               ` Parav Pandit
  2023-11-22  7:44                                                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-22  3:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, November 22, 2023 2:31 AM
> 
> On Tue, Nov 21, 2023 at 04:29:36PM +0000, Parav Pandit wrote:
> > Basic test with iperf is not working. Crashing it.
> > All of this is complete unrelated discussion to this series to slow down the
> work.
> > I don’t see any value.
> > Michael asked to do the test, we did, it does not work. Functionally broken
> code has no comparison.
> 
> It's unfortunate it's unstable for you, if you could show perf comparison that
> would be a strong argument for your case. Reporting Linux/qemu failures to
> virtio TC is not going to help you though, wrong forum.

As I explained, the basic requirements are not met hence, the comparison is not applicable.
There is no point in discussing specific OS implementation anyway. You asked to remove vfio citations hence we remove other citations as well.

Thanks.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-21 16:24                                           ` [virtio-comment] " Parav Pandit
@ 2023-11-22  4:11                                             ` Jason Wang
  0 siblings, 0 replies; 157+ messages in thread
From: Jason Wang @ 2023-11-22  4:11 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Wed, Nov 22, 2023 at 12:25 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, November 21, 2023 9:52 AM
> >
> > On Thu, Nov 16, 2023 at 2:49 PM Michael S. Tsirkin <mst@redhat.com>
> > wrote:
> > >
> > > On Thu, Nov 16, 2023 at 12:24:27PM +0800, Jason Wang wrote:
> > > > On Thu, Nov 16, 2023 at 1:37 AM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Monday, November 13, 2023 9:11 AM
> > > > > >
> > > > > > On Fri, Nov 10, 2023 at 2:46 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > > Hi Michael,
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Thursday, November 9, 2023 1:29 PM
> > > > > > >
> > > > > > > [..]
> > > > > > > > > Besides the issue of performance, it's also racy, assuming
> > > > > > > > > we are logging
> > > > > > > > IOVA.
> > > > > > > > >
> > > > > > > > > 0) device log IOVA
> > > > > > > > > 1) hypervisor fetches IOVA from log buffer
> > > > > > > > > 2) guest map IOVA to a new GPA
> > > > > > > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > > > > > > >
> > > > > > > > > Then we lost the old GPA.
> > > > > > > >
> > > > > > > > Interesting and a good point. And by the way e.g. vhost has
> > > > > > > > the same issue.  You need to flush dirty tracking info when
> > > > > > > > changing the mappings somehow.  Parav what's the plan for
> > > > > > > > this? Should be addressed in
> > > > > > the spec too.
> > > > > > > >
> > > > > > > As you listed the flush is needed for vhost or device-based DPT.
> > > > > >
> > > > > > What does DPT mean? Device Page Table? Let's not invent
> > > > > > terminology which is not known by others please.
> > > > > >
> > > > > Sorry for using the acronym. I meant dirty page tracking.
> > > > >
> > > > > > We have discussed it many times. You can't just depend on ATS or
> > > > > > reinventing wheels in virtio.
> > > > > The dependency is on the iommu which would have the mapping of
> > GIOVA to GPA like any sw implementation.
> > > > > No dependency on ATS.
> > > > >
> > > > > >
> > > > > > What's more, please try not to give me the impression that the
> > > > > > proposal is optimized for a specific vendor (like device IOMMU stuffs).
> > > > > >
> > > > > You should stop calling this specific vendor thing.
> > > >
> > > > Well, as you have explained, the confusion came from "DPT" ...
> > > >
> > > > > One can equally say that suspend bit proposal is for the sw_vendor
> > device who is forcing virtio hw device to only implement ioqueues + PASID +
> > non_unified interface for PF, VF, SIOVs + non_TDISP based devices.
> > > > >
> > > > > > > The necessary plumbing is already covered for this in the
> > > > > > > query (read and
> > > > > > clear) command of this v3 proposal.
> > > > > >
> > > > > > The issue is logging via IOVA ... I don't see how "read and clear" can
> > help.
> > > > > >
> > > > > Read and clear helps that ensures that all the dirty pages are reported,
> > hence there is no mapping/unmapping race.
> > > >
> > > > Reported as IOVA ...
> > > >
> > > > > As everything is reported.
> > > > >
> > > > > > > It is listed in Device Write Records Read Command.
> > > > > >
> > > > > > Please explain how your proposal can solve the above race.
> > > > > >
> > > > > In below manner.
> > > > > 1. guest has GIOVA to GPA_1 mapping 2. RX packets occurred to
> > > > > GIOVA 3. device reported dirty page log for GIOVA (hypervisor is
> > > > > yet to read) 4. guest requested mapping change from GIOVA to GPA_2
> > > > > 4.1 During this IOTLB is invalidated and dirty page report is
> > > > > queried ensuring, it can change the mapping
> > > >
> > > > It requires
> > > >
> > > > 1) hypervisor traps IOTLB invalidation, which doesn't work when
> > > > nesting could be offloaded (IOMMUFD has started the work to support
> > > > nesting)
> > > > 2) query the device about the dirty page on each IOTLB invalidation which:
> > > > 2.1) A huge round trip: guest IOTLB invalidation -> trapped by
> > > > hypervisor -> start the query from the device -> device return ->
> > > > hypervisor reports IOTLB invalidation is done -> let guest run. Have
> > > > you benchmarked the RTT in this case? There are just too many places
> > > > that cause the delay in the middle.
> > >
> > > To be fair invalidations are already expensive e.g. with vhost iotlb
> > > it requires a slow system call.
> > > This will make them *even more* expensive.
> >
> > Yes, a slow syscall plus a virtqueue query RTT.
> >
> Only during viommu case.

What's worse, modern IOMMU drivers tend to batch the invalidation via
a per domain invalidation. I can easily imagine how slow it is.

For vhost, it's still one syscall. But for your proposal, it needs to
do a query of all the possible IOVA ranges which is horribly slow as
it needs several queries requests or PCI transactions where guest
lockup would not be rare.

> Without this is not applicable.

List is a place to discuss the possible issue, no?

>
> > Need some benchmark. It looks to me currently the invalidation is done via a
> > queued based interface in vtd. So guests may need to spin where it may trigger
> > a lockup in the guest.
> >
>
> > >
> > > Problem for some but not all workloads.  Again I agree motivation,
> > > tradeoffs and comparison with both dirty tracking by iommu and shadow
> > > vq approaches really should be included.
> >
> Dirty tracking is iommu to be considered.
> Shadow vq is not in my scope and it does not fit the basic requirements as explained before.

I don't see a good explanation other than "I meet a bug in Linus tree,
it's unstable so I wouldn't test anymore".

Thanks



> So it is different discussion.
>
> > +1
> >
> > >
> > >
> > > > 2.2) Guest triggerable behaviour, malicious guest can simply do
> > > > endless IOTLB invalidation to DOS the e.g admin virtqueue
> > >
> > > I'm not sure how much to worry about it - just don't allow more than
> > > one in flight per VM.
> >
> > That's fine but it may need a note.
> >
> > Thanks
> >
> >
> > >
> > >
> > >
> > > > >
> > > > > > >
> > > > > > > When the page write record is fully read, it is flushed.
> > > > > > > How/when to use, I think its hypervisor specific, so we
> > > > > > > probably better off not
> > > > > > documenting those details.
> > > > > >
> > > > > > Well, as the author of this proposal, at least you need to know
> > > > > > how a hypervisor can work with your proposal, no?
> > > > > >
> > > > > Likely yes, but it is not the scope of the spec to list those paths etc.
> > > >
> > > > Fine, but as a reviewer I need to know if it can work with a hypervisor well.
> > > >
> > > > >
> > > > > > > May be such read is needed in some other path too depending on
> > > > > > > how
> > > > > > hypervisor implemented.
> > > > > >
> > > > > > What do you mean by "May be ... some other path" here? You're
> > > > > > inventing a mechanism that you don't know how a hypervisor can use?
> > > > >
> > > > > No. I meant hypervisor may have more operations that
> > map/unmap/flush where it may need to implement it.
> > > > > Some one may call it set_map(), some may say dma_map()...
> > > >
> > > > Ok.
> > > >
> > > > Thanks
> > >
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-21 16:26                                                   ` [virtio-comment] " Parav Pandit
@ 2023-11-22  4:14                                                     ` Jason Wang
  2023-11-22  4:19                                                       ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-22  4:14 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Wed, Nov 22, 2023 at 12:26 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, November 21, 2023 9:55 AM
> >
> > On Fri, Nov 17, 2023 at 11:02 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Thursday, November 16, 2023 11:51 PM
> > > >
> > > > On Thu, Nov 16, 2023 at 05:29:49PM +0000, Parav Pandit wrote:
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Thursday, November 16, 2023 10:56 PM
> > > > > >
> > > > > > On Thu, Nov 16, 2023 at 04:26:53PM +0000, Parav Pandit wrote:
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Thursday, November 16, 2023 5:18 PM
> > > > > > > >
> > > > > > > > On Thu, Nov 16, 2023 at 07:40:57AM +0000, Parav Pandit wrote:
> > > > > > > > >
> > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > Sent: Thursday, November 16, 2023 1:06 PM
> > > > > > > > > >
> > > > > > > > > > On Thu, Nov 16, 2023 at 12:51:40AM -0500, Michael S. Tsirkin
> > wrote:
> > > > > > > > > > > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav Pandit
> > wrote:
> > > > > > > > > > > > We should expose a limit of the device in the
> > > > > > > > > > > > proposed
> > > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how much range it
> > > > > > > > > > can
> > > > > > track.
> > > > > > > > > > > > So that future provisioning framework can use it.
> > > > > > > > > > > >
> > > > > > > > > > > > I will cover this in v5 early next week.
> > > > > > > > > > >
> > > > > > > > > > > I do worry about how this can even work though. If you
> > > > > > > > > > > want a generic device you do not get to dictate how
> > > > > > > > > > > much memory
> > > > VM has.
> > > > > > > > > > >
> > > > > > > > > > > Aren't we talking bit per page? With 1TByte of memory
> > > > > > > > > > > to track
> > > > > > > > > > > -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > > >
> > > > > > > > > > Ugh. Actually of course:
> > > > > > > > > > With 1TByte of memory to track -> 256Mbit -> 32Mbit ->
> > > > > > > > > > 8Mbyte per VF
> > > > > > > > > >
> > > > > > > > > > 8Gbyte per *PF* with 1K VFs.
> > > > > > > > > >
> > > > > > > > > Device may not maintain as a bitmap.
> > > > > > > >
> > > > > > > > However you maintain it, there's 256Mega bit of information.
> > > > > > > There may be other data structures that device may deploy as
> > > > > > > for example
> > > > > > hash or tree or something else.
> > > > > >
> > > > > > Point being?
> > > > > The device may have some hashing accelerator or other improvements
> > > > > that
> > > > may perform better than bitmap as many queues in parallel attempt to
> > > > update the shared database.
> > > >
> > > > Maybe, I didn't give this thought.
> > > >
> > > > My point was that to be able to keep all combinations of dirty/non
> > > > dirty page for each 4k page in a 1TByte guest device needs 8MBytes
> > > > of on-device memory per VF. As designed the query also has to report
> > > > it for each VF accurately even if multiple VFs are accessing same guest.
> > > Yes.
> > >
> > > >
> > > > > >
> > > > > > > And this is runtime memory only during the short live
> > > > > > > migration period of
> > > > > > 400msec or less.
> > > > > > > It is not some _always_ resident memory.
> >
> > When developing the spec, we should not have any assumption for the
> > implementation. For example, you can't just assume virtio is always emulated
> > in the software in the DPU.
> >
> There is no such assumption.
> It is supported on non DPU devices too.

You meant e.g a 8MB on-chip resource per VF is good to go?

>
> > How can you make sure you can converge in 400ms without having a interface
> > for the driver to set the correct parameter like dirty rates?
>
> 400msec is also written anywhere as requirement if this is what you want to argue about.

No, the downtime needs to coordinate with the hypervisor, that is what
I want to say. Unfortunately, I don't see any interface in this
series.

> There is nothing prevents to extend the interface to define the SLA as additional commands in the future to improve the solution.
>
> There is no need to boil the ocean now. Once the base infrastructure is built, we will improve it further.
> And proposed patches are reasonably well covered to our knowledge.

Well, it is not me but you that claims it can be done in 400ms. I'm
wondering how and you told me it could be done in the future?

Thanks


>
> >
> > Thanks
> >
> > > > > >
> > > > > > No - write tracking is used in the live phase of migration. It
> > > > > > can be enabled as long as you wish - it's a question of policy.
> > > > > > There actually exist solutions that utilize this phase for
> > > > > > redundancy, permanently
> > > > running in this mode.
> > > > >
> > > > > If such use case exists, one may further improve the device
> > implementation.
> > > >
> > > > Yes such use cases exist, there is no limit on how long migration takes.
> > > > So go ahead and further improve it please. Do not give us "we did
> > > > not get requests for this feature" please.
> > >
> > > Please describe the use case more precisely.
> > > If there is any application or OS API etc exists, please point to it where would
> > you like to fit this dirty page tracking beyond device migration.
> > > We may have to draw a line to have reasonable point and not keep
> > discussing infinitely.
> > >
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-21 16:29                                                                           ` Parav Pandit
  2023-11-21 21:00                                                                             ` Michael S. Tsirkin
@ 2023-11-22  4:17                                                                             ` Jason Wang
  2023-11-22  4:34                                                                               ` Parav Pandit
  1 sibling, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-22  4:17 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Wed, Nov 22, 2023 at 12:29 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, November 21, 2023 10:47 AM
> >
> > On Fri, Nov 17, 2023 at 8:51 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: virtio-comment@lists.oasis-open.org
> > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Michael S.
> > > > Tsirkin
> > > > Sent: Friday, November 17, 2023 6:11 PM
> > > >
> > > > On Fri, Nov 17, 2023 at 12:22:59PM +0000, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 5:03 PM
> > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > >
> > > > > > > Somehow the claim of shadow vq is great without sharing any
> > > > > > > performance
> > > > > > numbers is what I don't agree with.
> > > > > >
> > > > > > It's upstream in QEMU. Test it youself.
> > > > > >
> > > > > We did few minutes back.
> > > > > It results in a call trace.
> > > > > Vhost_vdpa_setup_vq_irq crashes on list corruption on net-next.
> > > >
> > > > Wrong list for this bug report.
> > > >
> > > > > We are stopping any shadow vq tests on unstable stuff.
> > > >
> > > > If you don't want to benchmark against alternatives how are you
> > > > going to prove your stuff is worth everyone's time?
> > >
> > > Comparing performance of the functional things count.
> > > You suggest shadow vq, frankly you should post the grand numbers of
> > shadow vq.
> >
> > We need an apple to apple comparison. Otherwise you may argue with that,
> > no?
> >
> When the requirements are met the comparison can be made of the solution.
> And I don’t see that the basic requirements are matching for two different use cases.
> So no point in discussing one OS specific implementation as reference point.

Shadow virtqueue is not OS specific, it's a common method. If you
disagree, please explain why.

> Otherwise I will end up adding vfio link in the commit log in next version as you are asking similar things here and being non neutral to your ask.

When doing a benchmark, you need to describe your setups, no? So any
benchmark is setup specific, nothing wrong.

It looks to me you claim your method is better, but refuse to give proofs.

>
> Anyway, please bring the perf data whichever you want to compare in another forum. It is not the criteria anyway.

So how can you prove your method is the best one? You have posted the
series for months, and so far I still don't see any rationale about
why you choose to go that way.

This is very odd as we've gone through several methods one or two
years ago when discussing vDPA live migration.

>
> > >
> > > It is really not my role to report bug of unstable stuff and compare the perf
> > against.
> >
> > Qemu/KVM is highly relevant here no? And it's the way to develop the
> > community. The shadow vq code is handy.
> It is relevant for direct mapped device.

Let's focus on the function then discuss the use cases. If you can't
prove your proposal has a proper function, what's the point of
discussing the use cases?

> There is absolutely no point of converting virtio device to another virtualization layer and run again and get another virtio device.
> So for direct mapping use case shadow vq is not relevant.

It is needed because shadow virtqueue is the baseline. Most of the
issues don't exist in the case of shadow virtqueue.

We don't want to end up with a solution that

1) can't outperform shadow virtqueue
2) have more issues than shadow virtqueue

> For other use cases, please continue.
>
> >
> > Just an email to Qemu should be fine, we're not asking you to fix the bug.
> >
> > Btw, how do you define stable? E.g do you think the Linus tree is stable?
> >
> Basic test with iperf is not working. Crashing it.

As a kernel developer, dealing with crashing at any layer is pretty common. No?

Thanks


> All of this is complete unrelated discussion to this series to slow down the work.
> I don’t see any value.
> Michael asked to do the test, we did, it does not work. Functionally broken code has no comparison.
>
> > Thanks
> >
> > >
> > > We propose device context and provided the numbers you asked. Mostly
> > wont be able to go farther than this.
> > >
> > > This publicly archived list offers a means to provide input to the
> > > OASIS Virtual I/O Device (VIRTIO) TC.
> > >
> > > In order to verify user consent to the Feedback License terms and to
> > > minimize spam in the list archive, subscription is required before
> > > posting.
> > >
> > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > > List help: virtio-comment-help@lists.oasis-open.org
> > > List archive: https://lists.oasis-open.org/archives/virtio-comment/
> > > Feedback License:
> > > https://www.oasis-open.org/who/ipr/feedback_license.pdf
> > > List Guidelines:
> > > https://www.oasis-open.org/policies-guidelines/mailing-lists
> > > Committee: https://www.oasis-open.org/committees/virtio/
> > > Join OASIS: https://www.oasis-open.org/join/
> > >
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-22  4:14                                                     ` [virtio-comment] " Jason Wang
@ 2023-11-22  4:19                                                       ` Parav Pandit
  2023-11-24  3:09                                                         ` [virtio-comment] " Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-22  4:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu



> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 22, 2023 9:45 AM
> 
> On Wed, Nov 22, 2023 at 12:26 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, November 21, 2023 9:55 AM
> > >
> > > On Fri, Nov 17, 2023 at 11:02 AM Parav Pandit <parav@nvidia.com>
> wrote:
> > > >
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Thursday, November 16, 2023 11:51 PM
> > > > >
> > > > > On Thu, Nov 16, 2023 at 05:29:49PM +0000, Parav Pandit wrote:
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Thursday, November 16, 2023 10:56 PM
> > > > > > >
> > > > > > > On Thu, Nov 16, 2023 at 04:26:53PM +0000, Parav Pandit wrote:
> > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > Sent: Thursday, November 16, 2023 5:18 PM
> > > > > > > > >
> > > > > > > > > On Thu, Nov 16, 2023 at 07:40:57AM +0000, Parav Pandit
> wrote:
> > > > > > > > > >
> > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > Sent: Thursday, November 16, 2023 1:06 PM
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Nov 16, 2023 at 12:51:40AM -0500, Michael S.
> > > > > > > > > > > Tsirkin
> > > wrote:
> > > > > > > > > > > > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav
> > > > > > > > > > > > Pandit
> > > wrote:
> > > > > > > > > > > > > We should expose a limit of the device in the
> > > > > > > > > > > > > proposed
> > > > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how much
> range
> > > > > > > > > > > it can
> > > > > > > track.
> > > > > > > > > > > > > So that future provisioning framework can use it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I will cover this in v5 early next week.
> > > > > > > > > > > >
> > > > > > > > > > > > I do worry about how this can even work though. If
> > > > > > > > > > > > you want a generic device you do not get to
> > > > > > > > > > > > dictate how much memory
> > > > > VM has.
> > > > > > > > > > > >
> > > > > > > > > > > > Aren't we talking bit per page? With 1TByte of
> > > > > > > > > > > > memory to track
> > > > > > > > > > > > -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > > > >
> > > > > > > > > > > Ugh. Actually of course:
> > > > > > > > > > > With 1TByte of memory to track -> 256Mbit -> 32Mbit
> > > > > > > > > > > -> 8Mbyte per VF
> > > > > > > > > > >
> > > > > > > > > > > 8Gbyte per *PF* with 1K VFs.
> > > > > > > > > > >
> > > > > > > > > > Device may not maintain as a bitmap.
> > > > > > > > >
> > > > > > > > > However you maintain it, there's 256Mega bit of information.
> > > > > > > > There may be other data structures that device may deploy
> > > > > > > > as for example
> > > > > > > hash or tree or something else.
> > > > > > >
> > > > > > > Point being?
> > > > > > The device may have some hashing accelerator or other
> > > > > > improvements that
> > > > > may perform better than bitmap as many queues in parallel
> > > > > attempt to update the shared database.
> > > > >
> > > > > Maybe, I didn't give this thought.
> > > > >
> > > > > My point was that to be able to keep all combinations of
> > > > > dirty/non dirty page for each 4k page in a 1TByte guest device
> > > > > needs 8MBytes of on-device memory per VF. As designed the query
> > > > > also has to report it for each VF accurately even if multiple VFs are
> accessing same guest.
> > > > Yes.
> > > >
> > > > >
> > > > > > >
> > > > > > > > And this is runtime memory only during the short live
> > > > > > > > migration period of
> > > > > > > 400msec or less.
> > > > > > > > It is not some _always_ resident memory.
> > >
> > > When developing the spec, we should not have any assumption for the
> > > implementation. For example, you can't just assume virtio is always
> > > emulated in the software in the DPU.
> > >
> > There is no such assumption.
> > It is supported on non DPU devices too.
> 
> You meant e.g a 8MB on-chip resource per VF is good to go?
>
It is the device implementation detail. Maybe it uses 8MB, may be not.
And if you are going to compare again with slow registers memory, it is not apple to apple comparison anyway.

Non DPU device may have such memory for data path acceleration.
 
> >
> > > How can you make sure you can converge in 400ms without having a
> > > interface for the driver to set the correct parameter like dirty rates?
> >
> > 400msec is also written anywhere as requirement if this is what you want to
> argue about.
> 
> No, the downtime needs to coordinate with the hypervisor, that is what I
> want to say. Unfortunately, I don't see any interface in this series.
> 
What do you mean by coordinated?
This series has mechanism to eliminate the downtime on src and dst side during device migration during pre-copy phase.

> > There is nothing prevents to extend the interface to define the SLA as
> additional commands in the future to improve the solution.
> >
> > There is no need to boil the ocean now. Once the base infrastructure is
> built, we will improve it further.
> > And proposed patches are reasonably well covered to our knowledge.
> 
> Well, it is not me but you that claims it can be done in 400ms. I'm wondering
> how and you told me it could be done in the future?
>
In our tests it is near to this number.
The discussion is about programming the SLA and that can be an extension.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-21 16:30                                                               ` Parav Pandit
@ 2023-11-22  4:19                                                                 ` Jason Wang
  2023-11-22  4:28                                                                   ` Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-22  4:19 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan, virtio-comment, cohuck,
	sburla, Shahaf Shuler, Maor Gottlieb, Yishai Hadas, eperezma,
	Siwei Liu

On Wed, Nov 22, 2023 at 12:30 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, November 21, 2023 12:25 PM
> >
> > On Fri, Nov 17, 2023 at 10:48 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 7:31 PM
> > > > To: Parav Pandit <parav@nvidia.com>
> > > >
> > > > On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 6:02 PM
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Friday, November 17, 2023 5:35 PM
> > > > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > > > >
> > > > > > > > On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> > > > > > > > >
> > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > Sent: Friday, November 17, 2023 5:04 PM
> > > > > > > > > >
> > > > > > > > > > On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > Sent: Friday, November 17, 2023 4:30 PM
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit
> > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu,
> > > > > > > > > > > > > > > Lingshan
> > > > > > wrote:
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000,
> > > > > > > > > > > > > > >>> Parav Pandit
> > > > > > wrote:
> > > > > > > > > > > > > > >>>> We should expose a limit of the device in
> > > > > > > > > > > > > > >>>> the proposed
> > > > > > > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how much
> > > > range
> > > > > > > > > > > > > > it can
> > > > > > > > > > track.
> > > > > > > > > > > > > > >>>> So that future provisioning framework can use it.
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >>>> I will cover this in v5 early next week.
> > > > > > > > > > > > > > >>> I do worry about how this can even work though.
> > > > > > > > > > > > > > >>> If you want a generic device you do not get
> > > > > > > > > > > > > > >>> to dictate how much memory VM
> > > > > > > > > > has.
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>> Aren't we talking bit per page? With 1TByte
> > > > > > > > > > > > > > >>> of memory to track
> > > > > > > > > > > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>> And you happily say "we'll address this in the future"
> > > > > > > > > > > > > > >>> while at the same time fighting tooth and
> > > > > > > > > > > > > > >>> nail against adding single bit status
> > > > > > > > > > > > > > >>> registers because
> > > > scalability?
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>> I have a feeling doing this completely
> > > > > > > > > > > > > > >>> theoretical like this is
> > > > > > > > > > problematic.
> > > > > > > > > > > > > > >>> Maybe you have it all laid out neatly in
> > > > > > > > > > > > > > >>> your head but I suspect not all of TC can
> > > > > > > > > > > > > > >>> picture it clearly enough based just on spec
> > > > > > > > > > text.
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>> We do sometimes ask for POC implementation
> > > > > > > > > > > > > > >>> in linux / qemu to demonstrate how things
> > > > > > > > > > > > > > >>> work before merging
> > > > > > code.
> > > > > > > > > > > > > > >>> We skipped this for admin things so far but
> > > > > > > > > > > > > > >>> I think it's a good idea to start doing it here.
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>> What makes me pause a bit before saying
> > > > > > > > > > > > > > >>> please do a PoC is all the opposition that
> > > > > > > > > > > > > > >>> seems to exist to even using admin commands
> > > > > > > > > > > > > > >>> in the 1st place. I think once we finally
> > > > > > > > > > > > > > >>> stop arguing about whether to use admin
> > > > > > > > > > > > > > >>> commands at all then a PoC will be needed
> > > > > > > > > > > > before merging.
> > > > > > > > > > > > > > >> We have POR productions that implemented the
> > > > > > > > > > > > > > >> approach in my
> > > > > > > > > > series.
> > > > > > > > > > > > > > >> They are multiple generations of productions
> > > > > > > > > > > > > > >> in market and running in customers data centers for
> > years.
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> Back to 2019 when we start working on vDPA,
> > > > > > > > > > > > > > >> we have sent some samples of production(e.g.,
> > > > > > > > > > > > > > >> Cascade
> > > > > > > > > > > > > > >> Glacier) and the datasheet, you can find live
> > > > > > > > > > > > > > >> migration facilities there, includes suspend,
> > > > > > > > > > > > > > >> vq state and other
> > > > > > features.
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> And there is an reference in DPDK live
> > > > > > > > > > > > > > >> migration, I have provided this page
> > > > > > > > > > > > > > >> before:
> > > > > > > > > > > > > > >> https://doc.dpdk.org/guides-21.11/vdpadevs/if
> > > > > > > > > > > > > > >> c.ht ml, it has been working for long long
> > > > > > > > > > > > > > >> time.
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> So if we let the facts speak, if we want to
> > > > > > > > > > > > > > >> see if the proposal is proven to work, I
> > > > > > > > > > > > > > >> would
> > > > > > > > > > > > > > >> say: They are POR for years, customers
> > > > > > > > > > > > > > >> already deployed them for
> > > > > > > > > > years.
> > > > > > > > > > > > > > > And I guess what you are trying to say is that
> > > > > > > > > > > > > > > this patchset we are reviewing here should be
> > > > > > > > > > > > > > > help to the same standard and there should be
> > > > > > > > > > > > > > > a PoC? Sounds
> > > > reasonable.
> > > > > > > > > > > > > > Yes and the in-marketing productions are POR,
> > > > > > > > > > > > > > the series just improves the design, for
> > > > > > > > > > > > > > example, our series also use registers to track
> > > > > > > > > > > > > > vq state, but improvements than CG or BSC. So I
> > > > > > > > > > > > > > think they are proven
> > > > > > > > > > > > to work.
> > > > > > > > > > > > >
> > > > > > > > > > > > > If you prefer to go the route of POR and
> > > > > > > > > > > > > production and proven documents
> > > > > > > > > > > > etc, there is ton of it of multiple types of
> > > > > > > > > > > > products I can dump here with open- source code and
> > > > > > > > > > > > documentation and
> > > > more.
> > > > > > > > > > > > > Let me know what you would like to see.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Michael has requested some performance
> > > > > > > > > > > > > comparisons, not all are ready to
> > > > > > > > > > > > share yet.
> > > > > > > > > > > > > Some are present that I will share in coming weeks.
> > > > > > > > > > > > >
> > > > > > > > > > > > > And all the vdpa dpdk you published does not have
> > > > > > > > > > > > > basic CVQ support when I
> > > > > > > > > > > > last looked at it.
> > > > > > > > > > > > > Do you know when was it added?
> > > > > > > > > > > >
> > > > > > > > > > > > It's good enough for PoC I think, CVQ or not.
> > > > > > > > > > > > The problem with CVQ generally, is that VDPA wants
> > > > > > > > > > > > to shadow CVQ it at all times because it wants to
> > > > > > > > > > > > decode and cache the content. But this problem has
> > > > > > > > > > > > nothing to do with dirty tracking even though it
> > > > > > > > > > > > also
> > > > > > > > > > mentions "shadow":
> > > > > > > > > > > > if device can report it's state then there's no need
> > > > > > > > > > > > to shadow
> > > > CVQ.
> > > > > > > > > > >
> > > > > > > > > > > For the performance numbers with the pre-copy and
> > > > > > > > > > > device context of
> > > > > > > > > > patches posted 1 to 5, the downtime reduction of the VM
> > > > > > > > > > is 3.71x with active traffic on 8 RQs at 100Gbps port speed.
> > > > > > > > > >
> > > > > > > > > > Sounds good can you please post a bit more detail?
> > > > > > > > > > which configs are you comparing what was the result on
> > > > > > > > > > each of
> > > > them.
> > > > > > > > >
> > > > > > > > > Common config: 8+8 tx and rx queues.
> > > > > > > > > Port speed: 100Gbps
> > > > > > > > > QEMU 8.1
> > > > > > > > > Libvirt 7.0
> > > > > > > > > GVM: Centos 7.4
> > > > > > > > > Device: virtio VF hardware device
> > > > > > > > >
> > > > > > > > > Config_1: virtio suspend/resume similar to what Lingshan
> > > > > > > > > has, largely vdpa stack
> > > > > > > > > Config_2: Device context method of admin commands
> > > > > > > >
> > > > > > > > OK that sounds good. The weird thing here is that you
> > > > > > > > measure
> > > > "downtime".
> > > > > > > > What exactly do you mean here?
> > > > > > > > I am guessing it's the time to retrieve on source and
> > > > > > > > re-program device state on destination? And this is 3.71x out of
> > how long?
> > > > > > > Yes. Downtime is the time during which the VM is not
> > > > > > > responding or receiving
> > > > > > packets, which involves reprogramming the device.
> > > > > > > 3.71x is relative time for this discussion.
> > > > > >
> > > > > > Oh interesting. So VM state movement including reprogramming the
> > > > > > CPU is dominated by reprogramming this single NIC, by a factor of
> > almost 4?
> > > > > Yes.
> > > >
> > > > Could you post some numbers too then?  I want to know whether that
> > > > would imply that VM boot is slowed down significantly too. If yes
> > > > that's another motivation for pci transport 2.0.
> > > It was 1.8 sec down to 480msec.
> >
> > Well, there's work ongoing to reduce the downtime of the shadow virtqueue.
> >
> > Eugenio or Si-wei may share an exact number, but it should be several
> > hundreds of ms.
> >
> Shadow vq is not applicable at all as comparison point because there is no virtio specific qemu etc software involved here.

I don't get the point.

Shadow virtqueue is virtio specific for sure and the core logic is
decoupled of the vDPA logic. If not, it's bug and we need to fix.

Thanks


>
> Anyways, the requested numbers are supplied for the device context based migration over admin vq proposed here.
>
>
> > But it seems the shadow virtqueue itself is not the major factor but the time
> > spent on programming vendor specific mappings for example.
> >
> > Thanks
> >
> > > The time didn't come from pci side or boot side.
> > >
> > > For pci side of things you would want to compare the pci vs non pci device
> > based VM boot time.
> > >
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-22  4:19                                                                 ` Jason Wang
@ 2023-11-22  4:28                                                                   ` Parav Pandit
  2023-11-24  3:08                                                                     ` Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-22  4:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan, virtio-comment, cohuck,
	sburla, Shahaf Shuler, Maor Gottlieb, Yishai Hadas, eperezma,
	Siwei Liu



> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 22, 2023 9:50 AM
> 
> On Wed, Nov 22, 2023 at 12:30 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, November 21, 2023 12:25 PM
> > >
> > > On Fri, Nov 17, 2023 at 10:48 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 7:31 PM
> > > > > To: Parav Pandit <parav@nvidia.com>
> > > > >
> > > > > On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
> > > > > >
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 6:02 PM
> > > > > > >
> > > > > > > On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > Sent: Friday, November 17, 2023 5:35 PM
> > > > > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > > > > >
> > > > > > > > > On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> > > > > > > > > >
> > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > Sent: Friday, November 17, 2023 5:04 PM
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit
> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > > Sent: Friday, November 17, 2023 4:30 PM
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav
> > > > > > > > > > > > > Pandit
> > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800,
> > > > > > > > > > > > > > > > Zhu, Lingshan
> > > > > > > wrote:
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM
> > > > > > > > > > > > > > > >>> +0000, Parav Pandit
> > > > > > > wrote:
> > > > > > > > > > > > > > > >>>> We should expose a limit of the device
> > > > > > > > > > > > > > > >>>> in the proposed
> > > > > > > > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how
> > > > > > > > > > > > > > > much
> > > > > range
> > > > > > > > > > > > > > > it can
> > > > > > > > > > > track.
> > > > > > > > > > > > > > > >>>> So that future provisioning framework can use
> it.
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> I will cover this in v5 early next week.
> > > > > > > > > > > > > > > >>> I do worry about how this can even work
> though.
> > > > > > > > > > > > > > > >>> If you want a generic device you do not
> > > > > > > > > > > > > > > >>> get to dictate how much memory VM
> > > > > > > > > > > has.
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> Aren't we talking bit per page? With
> > > > > > > > > > > > > > > >>> 1TByte of memory to track
> > > > > > > > > > > > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> And you happily say "we'll address this in the
> future"
> > > > > > > > > > > > > > > >>> while at the same time fighting tooth
> > > > > > > > > > > > > > > >>> and nail against adding single bit
> > > > > > > > > > > > > > > >>> status registers because
> > > > > scalability?
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> I have a feeling doing this completely
> > > > > > > > > > > > > > > >>> theoretical like this is
> > > > > > > > > > > problematic.
> > > > > > > > > > > > > > > >>> Maybe you have it all laid out neatly in
> > > > > > > > > > > > > > > >>> your head but I suspect not all of TC
> > > > > > > > > > > > > > > >>> can picture it clearly enough based just
> > > > > > > > > > > > > > > >>> on spec
> > > > > > > > > > > text.
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> We do sometimes ask for POC
> > > > > > > > > > > > > > > >>> implementation in linux / qemu to
> > > > > > > > > > > > > > > >>> demonstrate how things work before
> > > > > > > > > > > > > > > >>> merging
> > > > > > > code.
> > > > > > > > > > > > > > > >>> We skipped this for admin things so far
> > > > > > > > > > > > > > > >>> but I think it's a good idea to start doing it here.
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> What makes me pause a bit before saying
> > > > > > > > > > > > > > > >>> please do a PoC is all the opposition
> > > > > > > > > > > > > > > >>> that seems to exist to even using admin
> > > > > > > > > > > > > > > >>> commands in the 1st place. I think once
> > > > > > > > > > > > > > > >>> we finally stop arguing about whether to
> > > > > > > > > > > > > > > >>> use admin commands at all then a PoC
> > > > > > > > > > > > > > > >>> will be needed
> > > > > > > > > > > > > before merging.
> > > > > > > > > > > > > > > >> We have POR productions that implemented
> > > > > > > > > > > > > > > >> the approach in my
> > > > > > > > > > > series.
> > > > > > > > > > > > > > > >> They are multiple generations of
> > > > > > > > > > > > > > > >> productions in market and running in
> > > > > > > > > > > > > > > >> customers data centers for
> > > years.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> Back to 2019 when we start working on
> > > > > > > > > > > > > > > >> vDPA, we have sent some samples of
> > > > > > > > > > > > > > > >> production(e.g., Cascade
> > > > > > > > > > > > > > > >> Glacier) and the datasheet, you can find
> > > > > > > > > > > > > > > >> live migration facilities there, includes
> > > > > > > > > > > > > > > >> suspend, vq state and other
> > > > > > > features.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> And there is an reference in DPDK live
> > > > > > > > > > > > > > > >> migration, I have provided this page
> > > > > > > > > > > > > > > >> before:
> > > > > > > > > > > > > > > >> https://doc.dpdk.org/guides-21.11/vdpadev
> > > > > > > > > > > > > > > >> s/if c.ht ml, it has been working for
> > > > > > > > > > > > > > > >> long long time.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> So if we let the facts speak, if we want
> > > > > > > > > > > > > > > >> to see if the proposal is proven to work,
> > > > > > > > > > > > > > > >> I would
> > > > > > > > > > > > > > > >> say: They are POR for years, customers
> > > > > > > > > > > > > > > >> already deployed them for
> > > > > > > > > > > years.
> > > > > > > > > > > > > > > > And I guess what you are trying to say is
> > > > > > > > > > > > > > > > that this patchset we are reviewing here
> > > > > > > > > > > > > > > > should be help to the same standard and
> > > > > > > > > > > > > > > > there should be a PoC? Sounds
> > > > > reasonable.
> > > > > > > > > > > > > > > Yes and the in-marketing productions are
> > > > > > > > > > > > > > > POR, the series just improves the design,
> > > > > > > > > > > > > > > for example, our series also use registers
> > > > > > > > > > > > > > > to track vq state, but improvements than CG
> > > > > > > > > > > > > > > or BSC. So I think they are proven
> > > > > > > > > > > > > to work.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If you prefer to go the route of POR and
> > > > > > > > > > > > > > production and proven documents
> > > > > > > > > > > > > etc, there is ton of it of multiple types of
> > > > > > > > > > > > > products I can dump here with open- source code
> > > > > > > > > > > > > and documentation and
> > > > > more.
> > > > > > > > > > > > > > Let me know what you would like to see.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Michael has requested some performance
> > > > > > > > > > > > > > comparisons, not all are ready to
> > > > > > > > > > > > > share yet.
> > > > > > > > > > > > > > Some are present that I will share in coming weeks.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > And all the vdpa dpdk you published does not
> > > > > > > > > > > > > > have basic CVQ support when I
> > > > > > > > > > > > > last looked at it.
> > > > > > > > > > > > > > Do you know when was it added?
> > > > > > > > > > > > >
> > > > > > > > > > > > > It's good enough for PoC I think, CVQ or not.
> > > > > > > > > > > > > The problem with CVQ generally, is that VDPA
> > > > > > > > > > > > > wants to shadow CVQ it at all times because it
> > > > > > > > > > > > > wants to decode and cache the content. But this
> > > > > > > > > > > > > problem has nothing to do with dirty tracking
> > > > > > > > > > > > > even though it also
> > > > > > > > > > > mentions "shadow":
> > > > > > > > > > > > > if device can report it's state then there's no
> > > > > > > > > > > > > need to shadow
> > > > > CVQ.
> > > > > > > > > > > >
> > > > > > > > > > > > For the performance numbers with the pre-copy and
> > > > > > > > > > > > device context of
> > > > > > > > > > > patches posted 1 to 5, the downtime reduction of the
> > > > > > > > > > > VM is 3.71x with active traffic on 8 RQs at 100Gbps port
> speed.
> > > > > > > > > > >
> > > > > > > > > > > Sounds good can you please post a bit more detail?
> > > > > > > > > > > which configs are you comparing what was the result
> > > > > > > > > > > on each of
> > > > > them.
> > > > > > > > > >
> > > > > > > > > > Common config: 8+8 tx and rx queues.
> > > > > > > > > > Port speed: 100Gbps
> > > > > > > > > > QEMU 8.1
> > > > > > > > > > Libvirt 7.0
> > > > > > > > > > GVM: Centos 7.4
> > > > > > > > > > Device: virtio VF hardware device
> > > > > > > > > >
> > > > > > > > > > Config_1: virtio suspend/resume similar to what
> > > > > > > > > > Lingshan has, largely vdpa stack
> > > > > > > > > > Config_2: Device context method of admin commands
> > > > > > > > >
> > > > > > > > > OK that sounds good. The weird thing here is that you
> > > > > > > > > measure
> > > > > "downtime".
> > > > > > > > > What exactly do you mean here?
> > > > > > > > > I am guessing it's the time to retrieve on source and
> > > > > > > > > re-program device state on destination? And this is
> > > > > > > > > 3.71x out of
> > > how long?
> > > > > > > > Yes. Downtime is the time during which the VM is not
> > > > > > > > responding or receiving
> > > > > > > packets, which involves reprogramming the device.
> > > > > > > > 3.71x is relative time for this discussion.
> > > > > > >
> > > > > > > Oh interesting. So VM state movement including reprogramming
> > > > > > > the CPU is dominated by reprogramming this single NIC, by a
> > > > > > > factor of
> > > almost 4?
> > > > > > Yes.
> > > > >
> > > > > Could you post some numbers too then?  I want to know whether
> > > > > that would imply that VM boot is slowed down significantly too.
> > > > > If yes that's another motivation for pci transport 2.0.
> > > > It was 1.8 sec down to 480msec.
> > >
> > > Well, there's work ongoing to reduce the downtime of the shadow
> virtqueue.
> > >
> > > Eugenio or Si-wei may share an exact number, but it should be
> > > several hundreds of ms.
> > >
> > Shadow vq is not applicable at all as comparison point because there is no
> virtio specific qemu etc software involved here.
> 
> I don't get the point.
> 
> Shadow virtqueue is virtio specific for sure and the core logic is decoupled of
> the vDPA logic. If not, it's bug and we need to fix.
>
The base requirement is that the software is not mediating any virtio interfaces (config, cvq, data vqs).
Hence, for direct mapped device shadow vq is not appliable at all, hence there is no comparison point.
 
> Thanks
> 
> 
> >
> > Anyways, the requested numbers are supplied for the device context based
> migration over admin vq proposed here.
> >
> >
> > > But it seems the shadow virtqueue itself is not the major factor but
> > > the time spent on programming vendor specific mappings for example.
> > >
> > > Thanks
> > >
> > > > The time didn't come from pci side or boot side.
> > > >
> > > > For pci side of things you would want to compare the pci vs non
> > > > pci device
> > > based VM boot time.
> > > >
> >


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-21 16:31                                 ` [virtio-comment] " Parav Pandit
@ 2023-11-22  4:28                                   ` Jason Wang
  2023-11-22  6:41                                     ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-22  4:28 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Wed, Nov 22, 2023 at 12:31 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, November 21, 2023 12:45 PM
> >
> > On Thu, Nov 16, 2023 at 1:30 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Thursday, November 16, 2023 9:54 AM
> > > >
> > > > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Monday, November 13, 2023 9:07 AM
> > > > > >
> > > > > > On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Tuesday, November 7, 2023 9:34 AM
> > > > > > > >
> > > > > > > > On Mon, Nov 6, 2023 at 2:54 PM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > Sent: Monday, November 6, 2023 12:04 PM
> > > > > > > > > >
> > > > > > > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit
> > > > > > > > > > <parav@nvidia.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > Sent: Thursday, November 2, 2023 9:54 AM
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit
> > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > Sent: Wednesday, November 1, 2023 6:00 AM
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Oct 31, 2023 at 11:27 AM Parav Pandit
> > > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Mon, Oct 30, 2023 at 9:21 PM Parav Pandit
> > > > > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > During a device migration flow (typically
> > > > > > > > > > > > > > > > > in a precopy phase of the live migration),
> > > > > > > > > > > > > > > > > a device may write to the guest memory.
> > > > > > > > > > > > > > > > > Some iommu/hypervisor may not be able to
> > > > > > > > > > > > > > > > > track these
> > > > > > > > > > > > written pages.
> > > > > > > > > > > > > > > > > These pages to be migrated from source to
> > > > > > > > > > > > > > > > > destination
> > > > > > > > hypervisor.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > A device which writes to these pages,
> > > > > > > > > > > > > > > > > provides the page address record of the to the owner
> > device.
> > > > > > > > > > > > > > > > > The owner device starts write recording
> > > > > > > > > > > > > > > > > for the device and queries all the page
> > > > > > > > > > > > > > > > > addresses written by the
> > > > > > device.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Fixes:
> > > > > > > > > > > > > > > > > https://github.com/oasis-tcs/virtio-spec/i
> > > > > > > > > > > > > > > > > ssue
> > > > > > > > > > > > > > > > > s/17
> > > > > > > > > > > > > > > > > 6
> > > > > > > > > > > > > > > > > Signed-off-by: Parav Pandit
> > > > > > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > > > > > > > > > Signed-off-by: Satananda Burla
> > > > > > > > > > > > > > > > > <sburla@marvell.com>
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > changelog:
> > > > > > > > > > > > > > > > > v1->v2:
> > > > > > > > > > > > > > > > > - addressed comments from Michael
> > > > > > > > > > > > > > > > > - replaced iova with physical address
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > >  admin-cmds-device-migration.tex | 15
> > > > > > > > > > > > > > > > > +++++++++++++++
> > > > > > > > > > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > > > b/admin-cmds-device-migration.tex index
> > > > > > > > > > > > > > > > > ed911e4..2e32f2c
> > > > > > > > > > > > > > > > > 100644
> > > > > > > > > > > > > > > > > --- a/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > > > +++ b/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > > > @@ -95,6 +95,21 @@ \subsubsection{Device
> > > > > > > > > > > > > > > > > Migration}\label{sec:Basic Facilities of a
> > > > > > > > > > > > > > > > > Virtio Device / The owner driver can
> > > > > > > > > > > > > > > > > discard any partially read or written
> > > > > > > > > > > > > > > > > device context when  any of the device
> > > > > > > > > > > > > > > > > migration flow
> > > > > > > > > > > > > > > > should be aborted.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > +During the device migration flow, a
> > > > > > > > > > > > > > > > > +passthrough device may write data to the
> > > > > > > > > > > > > > > > > +guest virtual machine's memory, a source
> > > > > > > > > > > > > > > > > +hypervisor needs to keep track of these
> > > > > > > > > > > > > > > > > +written memory to migrate such memory to
> > > > > > > > > > > > > > > > > +destination
> > > > > > > > > > > > > > > > hypervisor.
> > > > > > > > > > > > > > > > > +Some systems may not be able to keep
> > > > > > > > > > > > > > > > > +track of such memory write addresses at hypervisor
> > level.
> > > > > > > > > > > > > > > > > +In such a scenario, a device records and
> > > > > > > > > > > > > > > > > +reports these written memory addresses to
> > > > > > > > > > > > > > > > > +the owner device. The owner driver
> > > > > > > > > > > > > > > > > +enables write recording for one or more
> > > > > > > > > > > > > > > > > +physical address ranges per device during
> > > > > > > > > > > > > > > > > +device
> > > > > > > > > > > > migration flow.
> > > > > > > > > > > > > > > > > +The owner driver periodically queries
> > > > > > > > > > > > > > > > > +these written physical address
> > > > > > > > > > > > > > records from the device.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I wonder how PA works in this case. Device
> > > > > > > > > > > > > > > > uses untranslated requests so it can only see IOVA.
> > > > > > > > > > > > > > > > We can't mandate
> > > > > > > > ATS anyhow.
> > > > > > > > > > > > > > > Michael suggested to keep the language uniform
> > > > > > > > > > > > > > > as PA as this is ultimately
> > > > > > > > > > > > > > what the guest driver is supplying during vq
> > > > > > > > > > > > > > creation and in posting buffers as physical address.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This seems to need some work. And, can you show
> > > > > > > > > > > > > > me how it can
> > > > > > > > work?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1) e.g if GAW is 48 bit, is the hypervisor
> > > > > > > > > > > > > > expected to do a bisection of the whole range?
> > > > > > > > > > > > > > 2) does the device need to reserve sufficient
> > > > > > > > > > > > > > internal resources for logging the dirty page and why
> > (not)?
> > > > > > > > > > > > > No when dirty page logging starts, only at that
> > > > > > > > > > > > > time, device will reserve
> > > > > > > > > > > > enough resources.
> > > > > > > > > > > >
> > > > > > > > > > > > GAW is 48bit, how large would it have then?
> > > > > > > > > > > Dirty page tracking is not dependent on the size of the GAW.
> > > > > > > > > > > It is function of address ranges for the amount of
> > > > > > > > > > > guest memory regardless of
> > > > > > > > > > GAW.
> > > > > > > > > >
> > > > > > > > > > The problem is, e.g when vIOMMU is enabled, you can't
> > > > > > > > > > know which IOVA is actually used by guests. And even for
> > > > > > > > > > the case when vIOMMU is not enabled, the guest may have
> > several TBs.
> > > > > > > > > > Is it easy to reserve sufficient resources by the device itself?
> > > > > > > > > >
> > > > > > > > > When page tracking is enabled per device, it knows about
> > > > > > > > > the range and it can
> > > > > > > > reserve certain resource.
> > > > > > > >
> > > > > > > > I didn't see such an interface in this series. Anything I miss?
> > > > > > > >
> > > > > > > Yes, this patch and the next patch is covering the page
> > > > > > > tracking start,stop and
> > > > > > query commands.
> > > > > > > They are named as write recording commands.
> > > > > >
> > > > > > So I still don't see how the device can reserve sufficient resources?
> > > > > > Guests may map a very large area of memory to IOMMU (or when
> > > > > > vIOMMU is disabled, GPA is used). It would be several TBs, how
> > > > > > can the device reserve sufficient resources in this case?
> > > > > When the map is established, the ranges are supplied to the device
> > > > > to know
> > > > how much to reserve.
> > > > > If device does not have enough resource, it fails the command.
> > > > >
> > > > > One can advance it further to provision for the desired range..
> > > >
> > > > Well, I think I've asked whether or not a bisection is needed, and
> > > > you told me not ...
> > > >
> > > > But at least we need to document this in the proposal, no?
> > > >
> > > We should expose a limit of the device in the proposed
> > WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > > So that future provisioning framework can use it.
> > >
> > > I will cover this in v5 early next week.
> > >
> > > > > >
> > > > > > >
> > > > > > > > Btw, the IOVA is allocated by the guest actually, how can we
> > > > > > > > know the
> > > > > > range?
> > > > > > > > (or using the host range?)
> > > > > > > >
> > > > > > > Hypervisor would have mapping translation.
> > > > > >
> > > > > > That's really tricky and can only work in some cases:
> > > > > >
> > > > > > 1) It requires the hypervisor to traverse the guest I/O page
> > > > > > tables which could be very large range
> > > > > > 2) It requests the hypervisor to trap the modification of guest
> > > > > > I/O page tables and synchronize with the range changes, which is
> > > > > > inefficient and can only be done when we are doing shadow PTEs.
> > > > > > It won't work when the nesting translation could be offloaded to
> > > > > > the hardware
> > > > > > 3) It is racy with the guest modification of I/O page tables
> > > > > > which is explained in another thread
> > > > > Mapping changes with more hw mmu's is not a frequent event and
> > > > > IOTLB
> > > > flush is done using querying the dirty log for the smaller range.
> > > > >
> > > > > > 4) No aware of new features like PASID which has been explained
> > > > > > in another thread
> > > > > For all the pinned work with non sw based IOMMU, it is typically small
> > subset.
> > > > > PASID is guest controlled.
> > > >
> > > > Let's repeat my points:
> > > >
> > > > 1) vq1 use untranslated request with PASID1
> > > > 2) vq2 use untranslated request with PASID2
> > > >
> > > > Shouldn't we log PASID as well?
> > > >
> > > Possibly yes, either to request the tracking per PASID or to log the PASID.
> > > When in future PASID based VQ are supported, this part should be
> > extended.
> >
> > Who is going to do the extension? They are orthogonal features for sure.
> Whoever extends the VQ for PASID programming.
>
> I plan to have generic command for VQ creation over CVQ

Another unrelated issue.

> for the wider use cases we discussed.

CVQ might want a dedicated PASID.

> It can have PASID parameter in future when one wants to add it.
>
> >
> > >
> > > > And
> > > >
> > > > 1) vq1 is using translated request
> > > > 2) vq2 is using untranslated request
> > > >
> >
> > How about this?
> How did driver program the device for vq1 to translated request and vq2 to not.
> And for which use case?

Again, it is allowed by the PCI spec, no? You've explained yourself
that your design needs to obey PCI spec.

And, if you want to ask. for use case, there are handy:

- ATS
- When IOMMU_PLATFORM is not negotiated
- MSI

Let's make sure the function of your proposal is correct before
talking about any use cases.

>
> >
> > >
> > > > How could we differ?
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > Host should always have more resources than device, in
> > > > > > > > > > that sense there could be several methods that tries to
> > > > > > > > > > utilize host memory instead of the one in the device. I
> > > > > > > > > > think we've discussed this when going through the doc prepared
> > by Eugenio.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > What happens if we're trying to migrate more than 1 device?
> > > > > > > > > > > >
> > > > > > > > > > > That is perfectly fine.
> > > > > > > > > > > Each device is updating its log of pages it wrote.
> > > > > > > > > > > The hypervisor is collecting their sum.
> > > > > > > > > >
> > > > > > > > > > See above.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 3) DMA is part of the transport, it's natural to
> > > > > > > > > > > > > > do logging there, why duplicate efforts in the virtio layer?
> > > > > > > > > > > > > He he, you have funny comment.
> > > > > > > > > > > > > When an abstract facility is added to virtio you
> > > > > > > > > > > > > say to do in
> > > > > > transport.
> > > > > > > > > > > >
> > > > > > > > > > > > So it's not done in the general facility but tied to the admin
> > part.
> > > > > > > > > > > > And we all know dirty page tracking is a challenge
> > > > > > > > > > > > and Eugenio has a good summary of pros/cons. A
> > > > > > > > > > > > revisit of those docs make me think virtio is not
> > > > > > > > > > > > the good place for doing that for
> > > > > > may reasons:
> > > > > > > > > > > >
> > > > > > > > > > > > 1) as stated, platform will evolve to be able to
> > > > > > > > > > > > tracking dirty pages, actually, it has been
> > > > > > > > > > > > supported by a lot of major IOMMU vendors
> > > > > > > > > > >
> > > > > > > > > > > This is optional facility in virtio.
> > > > > > > > > > > Can you please point to the references? I don’t see it
> > > > > > > > > > > in the common Linux
> > > > > > > > > > kernel support for it.
> > > > > > > > > >
> > > > > > > > > > Note that when IOMMUFD is being proposed, dirty page
> > > > > > > > > > tracking is one of the major considerations.
> > > > > > > > > >
> > > > > > > > > > This is one recent proposal:
> > > > > > > > > >
> > > > > > > > > > https://www.spinics.net/lists/kvm/msg330894.html
> > > > > > > > > >
> > > > > > > > > Sure, so if platform supports it. it can be used from the platform.
> > > > > > > > > If it does not, the device supplies it.
> > > > > > > > >
> > > > > > > > > > > Instead Linux kernel choose to extend to the devices.
> > > > > > > > > >
> > > > > > > > > > Well, as I stated, tracking dirty pages is challenging
> > > > > > > > > > if you want to do it on a device, and you can't simply
> > > > > > > > > > invent dirty page tracking for each type of the devices.
> > > > > > > > > >
> > > > > > > > > It is not invented.
> > > > > > > > > It is generic framework for all virtio device types as proposed here.
> > > > > > > > > Keep in mind, that it is optional already in v3 series.
> > > > > > > > >
> > > > > > > > > > > At least not seen to arrive this in any near term in
> > > > > > > > > > > start of
> > > > > > > > > > > 2024 which is
> > > > > > > > > > where users must use this.
> > > > > > > > > > >
> > > > > > > > > > > > 2) you can't assume virtio is the only device that
> > > > > > > > > > > > can be used by the guest, having dirty pages
> > > > > > > > > > > > tracking to be implemented in each type of device is
> > > > > > > > > > > > unrealistic
> > > > > > > > > > > Of course, there is no such assumption made. Where did
> > > > > > > > > > > you see a text that
> > > > > > > > > > made such assumption?
> > > > > > > > > >
> > > > > > > > > > So what happens if you have a guest with virtio and
> > > > > > > > > > other devices
> > > > > > assigned?
> > > > > > > > > >
> > > > > > > > > What happens? Each device type would do its own dirty page
> > tracking.
> > > > > > > > > And if all devices does not have support, hypervisor knows
> > > > > > > > > to fall back to
> > > > > > > > platform iommu or its own.
> > > > > > > > >
> > > > > > > > > > > Each virtio and non virtio devices who wants to report
> > > > > > > > > > > their dirty page report,
> > > > > > > > > > will do their way.
> > > > > > > > > > >
> > > > > > > > > > > > 3) inventing it in the virtio layer will be
> > > > > > > > > > > > deprecated in the future for sure, as platform will
> > > > > > > > > > > > provide much rich features for logging e.g it can do
> > > > > > > > > > > > it per PASID etc, I don't see any reason virtio need
> > > > > > > > > > > > to compete with the features that will be provided
> > > > > > > > > > > > by the platform
> > > > > > > > > > > Can you bring the cpu vendors and committement to
> > > > > > > > > > > virtio tc with timelines
> > > > > > > > > > so that virtio TC can omit?
> > > > > > > > > >
> > > > > > > > > > Why do we need to bring CPU vendors in the virtio TC?
> > > > > > > > > > Virtio needs to be built on top of transport or
> > > > > > > > > > platform. There's no need to duplicate
> > > > > > > > their job.
> > > > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > > > >
> > > > > > > > > I wanted to see a strong commitment for the cpu vendors to
> > > > > > > > > support dirty
> > > > > > > > page tracking.
> > > > > > > >
> > > > > > > > The RFC of IOMMUFD support can go back to early 2022. Intel,
> > > > > > > > AMD and ARM are all supporting that now.
> > > > > > > >
> > > > > > > > > And the work seems to have started for some platforms.
> > > > > > > >
> > > > > > > > Let me quote from the above link:
> > > > > > > >
> > > > > > > > """
> > > > > > > > Today, AMD Milan (or more recent) supports it while ARM
> > > > > > > > SMMUv3.2 alongside VT-D rev3.x also do support.
> > > > > > > > """
> > > > > > > >
> > > > > > > > > Without such platform commitment, virtio also skipping it
> > > > > > > > > would not
> > > > work.
> > > > > > > >
> > > > > > > > Is the above sufficient? I'm a little bit more familiar with
> > > > > > > > vtd, the hw feature has been there for years.
> > > > > > > >
> > > > > > > Vtd has a sticky D bit that requires synchronization with
> > > > > > > IOPTE page caches
> > > > > > when sw wants to clear it.
> > > > > >
> > > > > > This is by design.
> > > > > >
> > > > > > > Do you know if is it reliable when device does multiple
> > > > > > > writes, ie,
> > > > > > >
> > > > > > > a. iommu write D bit
> > > > > > > b. software read it
> > > > > > > c. sw synchronize cache
> > > > > > > d. iommu write D bit on next write by device
> > > > > >
> > > > > > What issue did you see here? But that's not even an excuse, if
> > > > > > there's a bug, let's report it to IOMMU vendors and let them fix it.
> > > > > > The thread I point to you is actually a good space.
> > > > > >
> > > > > So we cannot claim that it is there in the platform.
> > > >
> > > > I'm confused, the thread I point to you did the cache
> > > > synchronization which has been explained in the changelog, so what's the
> > issue?
> > > >
> > > If the ask is for IOMMU chip to fix something, we cannot claim that dirty
> > page tracking is available already in platform.
> >
> > Again, can you describe the issue? Why do you think the sticky part is an
> > issue? IOTLB needs to be sync with IO page tables, what's wrong with this?
> Nothing wrong with it.
> The text is not affirmative to say it works if the sw clears it.
>
> >
> > >
> > > > >
> > > > > > Again, the point is to let the correct role play.
> > > > > >
> > > > > How many more years should we block the virtio device migration
> > > > > when
> > > > platform do not have it?
> > > >
> > > > At least for VT-D, it has been used for years.
> > > Is this device written pages tracked by KVM for VT-d as dirty page log,
> > instead through vfio?
> >
> > I don't get this question.
> You said the VT-d has dirty page tracking for years so it must be used by the sw during device migration.

It's the best way if the platform has the support for that.

> And if that is there, how is these dirty pages of iommu are merged with the cpu side?
> Is this done by KVM for passthrough devices for vfio?

I don't see how it is related to the discussion here. IOMMU support is
sufficient as a start. If you requires CPU support, virtio is clearly
the wrong forum.

>
> >
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > ARM SMMU based servers to be present with D bit tracking.
> > > > > > > It is still early to say platform is ready.
> > > > > >
> > > > > > This is not what I read from both the series I posted and the
> > > > > > spec, dirty bit has been supported several years ago at least for vtd.
> > > > > Supported, but spec listed it as sticky bit that may require special
> > handling.
> > > >
> > > > Please explain why this is "special handling". IOMMU has several
> > > > different layers of caching, by design, it can't just open a window for D bit.
> > > >
> > > > > May be it is working, but not all cpu platforms have it.
> > > >
> > > > I don't see the point. Migration is not supported for virito as well.
> > > >
> > > I don’t see a point either to discuss.
> > >
> > > I already acked that platform may have support as well, and not all platform
> > has it.
> > > So the device feeds the data and its platform's choice to enable/disable.
> >
> > I've pointed out sufficient issues and I don't want to repeat them.
> There does not seem to be any that is critical enough for non viommu case.

No, see above.

> Viommu needs to flush the iotlb anyway.

I've explained it in antoher thread.

>
> >
> > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > It is optional so whichever has the support it will be used.
> > > > > >
> > > > > > I can't see the point of this, it is already available. And
> > > > > > migration doesn't exist in virtio spec yet.
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > > >
> > > > > > > > > > Why does it matter in 2024?
> > > > > > > > > Because users needs to use it now.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > If not, we are better off to offer this, and when/if
> > > > > > > > > > > platform support is, sure,
> > > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > > >
> > > > > > > > > > > > 4) if the platform support is missing, we can use
> > > > > > > > > > > > software or leverage transport for assistance like
> > > > > > > > > > > > PRI
> > > > > > > > > > > All of these are in theory.
> > > > > > > > > > > Our experiment shows PRI performance is 21x slower
> > > > > > > > > > > than page fault rate
> > > > > > > > > > done by the cpu.
> > > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > > >
> > > > > > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > > > > > Do you have perf data for this?
> > > > > > > >
> > > > > > > > No, but it's not hard to imagine the worst case. Wrote a
> > > > > > > > small program that dirty every page by a NIC.
> > > > > > > >
> > > > > > > > > In the internal tests we don’t see this happening.
> > > > > > > >
> > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > > >
> > > > > > > > So if we get very high dirty rates (e.g by a high speed
> > > > > > > > NIC), we can't satisfy the requirement of the downtime. Or
> > > > > > > > if you see the converge, you might get help from the auto
> > > > > > > > converge support by the hypervisors like KVM where it tries
> > > > > > > > to throttle the VCPU then you can't reach
> > > > > > the wire speed.
> > > > > > > >
> > > > > > > Once PRI is enabled, even without migration, there is basic perf issues.
> > > > > >
> > > > > > The context is not PRI here...
> > > > > >
> > > > > > It's about if you can stick to wire speed during live migration.
> > > > > > Based on the analysis so far, you can't achieve wirespeed and
> > > > > > downtime at
> > > > the same time.
> > > > > > That's why the hypervisor needs to throttle VCPU or devices.
> > > > > >
> > > > > So?
> > > > > Device also may throttle itself.
> > > >
> > > > That's perfectly fine. We are on the same page, no? It's wrong to
> > > > judge the dirty page tracking in the context of live migration by
> > > > measuring whether or not the device can work at wire speed.
> > > >
> > > > >
> > > > > > For PRI, it really depends on how you want to use it. E.g if you
> > > > > > don't want to pin a page, the performance is the price you must pay.
> > > > > PRI without pinning does not make sense for device to make large
> > > > > mapping
> > > > queries.
> > > >
> > > > That's also fine. Hypervisors can choose to enable and use PRI
> > > > depending on the different cases.
> > > >
> > > So PRI is not must for device migration.
> >
> > I never say it's a must.
> >
> > > Device migration must be able to work without PRI enabled, as simple as
> > that as first base line.
> >
> > My point is that, you need document
> >
> > 1) why you think dirty page is a must or not
> Explained in the patch already in commit log and in spec theory already.
>
> > 2) why did you choose one of a specific way instead of others
> >
> This is not part of the spec anyway. This is already discussed in mailing list here in community.

It helps the reviewers, it doesn't harm to have a summary in the
changelog. Or people may ask the same questions endlessly.

>
> > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > > So it is unusable.
> > > > > > > > > >
> > > > > > > > > > It's not about mandating, it's about doing things in the
> > > > > > > > > > correct layer. If PRI is slow, PCI can evolve for sure.
> > > > > > > > > You should try.
> > > > > > > >
> > > > > > > > Not my duty, I just want to make sure things are done in the
> > > > > > > > correct layer, and once it needs to be done in the virtio,
> > > > > > > > there's nothing obviously
> > > > > > wrong.
> > > > > > > >
> > > > > > > At present, it looks all platforms are not equally ready for page
> > tracking.
> > > > > >
> > > > > > That's not an excuse to let virtio support that.
> > > > > It is wrong attribution as excuse.
> > > > >
> > > > > > And we need also to figure out if virtio can do that easily.
> > > > > > I've pointed out sufficient issues, I'm pretty sure there would
> > > > > > be more as the platform evolves.
> > > > > >
> > > > > I am not sure if virtio feeds the log into the platform.
> > > >
> > > > I don't understand the meaning here.
> > > >
> > > I mistakenly merged two sentences.
> > >
> > > Virtio feeds the dirty page details to the hypervisor platform which collects
> > and merges the page record.
> > > So it is platform choice to use iommu based tracking or device based.
> > >
> > > > >
> > > > > > >
> > > > > > > > > In the current state, it is mandating.
> > > > > > > > > And if you think PRI is the only way,
> > > > > > > >
> > > > > > > > I don't, it's just an example where virtio can leverage from
> > > > > > > > either transport or platform. Or if it's the fault in virtio
> > > > > > > > that slows down the PRI, then it is something we can do.
> > > > > > > >
> > > > > > > Yea, it does not seem to be ready yet.
> > > > > > >
> > > > > > > > >  than you should propose that in the dirty page tracking
> > > > > > > > > series that you listed
> > > > > > > > above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > > > >
> > > > > > > > No, the point is to not duplicate works especially
> > > > > > > > considering virtio can't do better than platform or transport.
> > > > > > > >
> > > > > > > Both the platform and virtio work is ongoing.
> > > > > >
> > > > > > Why duplicate the work then?
> > > > > >
> > > > > Not all cpu platforms support as far as I know.
> > > >
> > > > Yes, but we all know the platform is working to support this.
> > > >
> > > > Supporting this on the device is hard.
> > > >
> > > This is optional, whichever device would like to implement it, will support it.
> > >
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > When one does something in transport, you say,
> > > > > > > > > > > > > this is transport specific, do
> > > > > > > > > > > > some generic.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Here the device is being tracked is virtio device.
> > > > > > > > > > > > > PCI-SIG has told already that PCIM interface is
> > > > > > > > > > > > > outside the scope of
> > > > > > it.
> > > > > > > > > > > > > Hence, this is done in virtio layer here in abstract way.
> > > > > > > > > > > >
> > > > > > > > > > > > You will end up with a competition with the
> > > > > > > > > > > > platform/transport one that will fail.
> > > > > > > > > > > >
> > > > > > > > > > > I don’t see a reason. There is no competition.
> > > > > > > > > > > Platform always have a choice to not use device side
> > > > > > > > > > > page tracking when it is
> > > > > > > > > > supported.
> > > > > > > > > >
> > > > > > > > > > Platform provides a lot of other functionalities for dirty logging:
> > > > > > > > > > e.g per PASID, granular, etc. So you want to duplicate
> > > > > > > > > > them again in the virtio? If not, why choose this way?
> > > > > > > > > >
> > > > > > > > > It is optional for the platforms where platform do not have it.
> > > > > > > >
> > > > > > > > We are developing new virtio functionalities that are
> > > > > > > > targeted for future platforms. Otherwise we would end up
> > > > > > > > with a feature with a very narrow use case.
> > > > > > > In general I agree that platform is an option too.
> > > > > > > Hypervisor will be able to make the decision to use platform
> > > > > > > when available
> > > > > > and fallback to device method when platform does not have it.
> > > > > > >
> > > > > > > Future and to be equally usable in near term :)
> > > > > >
> > > > > > Please don't double standard again:
> > > > > >
> > > > > > When you are talking about TDISP, you want virtio to be designed
> > > > > > to fit for the future where the platform is ready in the future
> > > > > > When you are talking about dirty tracking, you want it to work
> > > > > > now even if
> > > > > >
> > > > > The proposal of transport VQ is anti-TDISP.
> > > >
> > > > It's nothing about transport VQ, it's about you're saying the adminq
> > > > based device context. There's a comment to point out that the
> > > > current TDISP spec forbids modifying device state when TVM is
> > > > attached. Then you told us the TDISP may evolve for that.
> > > So? That is not double standard.
> > > The proposal is based on main principle that it is not depending on
> > > hypervisor traping + emulating which is the baseline of TDISP
> > >
> > > >
> > > > > The proposal of dirty tracking is not anti-platform. It is
> > > > > optional like rest of the
> > > > platform.
> > > > >
> > > > > > 1) most of the platform is ready now
> > > > > Can you list a ARM server CPU in production that has it? (not in
> > > > > some pdf
> > > > spec).
> > > >
> > > > Then in the context of a dirty page, I've proved you dirty page
> > > > tracking has been supported by all major vendors.
> > > Major IP vendor != major cpu chip vendor.
> > > I don’t agree with the proof.
> >
> > So this will be an endless debate. Did I ever ask you about ETA or any product
> > for TDISP?
> >
> ETA for TDISP is not relevant.
> You claimed for _major_ vendor support based on nonphysical cpu, hence the disagreement.

How did you define "support"?

Dirty tracking has been wroted into the IOMMU manual for Intel, AMD
and ARM for years. So you think it's not supported now? I've told you
it has been shipped by Intel at least then you ask me which ARM vendor
ships those vIOMMU.

For TDISP live migration, PCI doesn't even have a draft, no? I never
ask which chip vendor ships the platform.

You want to support dirty page tracking in virtio and keep asking when
it is supported by all platform vendors.

You want to prove your proposal can work for TDISP and TDISP migration
but never explain when it would be supported by at least one vendor.

Let's have a unified standard please.

> And that is not the reality.
>
> > >
> > > I already acknowledged that I have seen internal test report for dirty tracking
> > with one cpu and nic.
> > >
> > > I just don’t see all cpus have support for it.
> > > Hence, this optional feature.
> >
> > Repeat myself again.
> >
> > If it can be done easily and efficiently in virtio, I agree. But I've pointed out
> > several issues where it is not answered.
>
> I have answered most of your questions.
>
> The definition of 'easy' is very subjective.

The reason why I don't think it is easy is because I can easily see
several issues that can't be solved easily.

> At one point RSS was also not easy in some devices and IOMMU dirty page tracking was also not easy.

Yes, but we can offload the IOMMU part to the vendor. Virtio can't do
anything especially the part that duplicates with the function
provided by the transport or platform.

>
> >
> > >
> > > > Where you refuse to use the standard you used in explaining adminq
> > > > for device context in TDISP.
> > > >
> > > > So I didn't ask you the ETA of the TDISP support for migration or
> > > > adminq, but you want me to give you the production information which is
> > pointless.
> > > Because you keep claiming that _all_ cpus in the world has support for
> > efficient dirty page tracking.
> > >
> > > > You
> > > > might need to ask ARM to get an answer, but a simple google told me
> > > > the effort to support dirty page tracking in SMMUv3 could go back to early
> > 2021.
> > > >
> > > To my knowledge ARM do not produce physical chips.
> > > Your proposal is to keep those ARM server vendors to not use virtio devices.
> >
> > This arbitrary conclusion makes no sense.
> >
> Your conclusion about "all" and "major" physical cpu vendor supporting dirty page tracking is equally arbitrary.
> So better to not argue on this.

See above.

Thanks


>
> > I know at least one cloud vendor has used a virtio based device for years on
> > ARM. And that vendor has posted patches to support dirty page tracking since
> > 2020.
> >
> > Thanks
> >
> > > Does not make sense to me.
> > >
> > > > https://lore.kernel.org/linux-iommu/56b001fa-b4fe-c595-dc5e-
> > > > f362d2f07a19@linux.intel.com/t/
> > > >
> > > > Why is it not merged? It's simply because we agree to do it in the
> > > > layer of IOMMUFD so it needs to wait.
> > > >
> > > > Thanks
> > > >
> > > >
> > > > >
> > > > > > 2) whether or not virtio can log dirty page correctly is still
> > > > > > suspicious
> > > > > >
> > > > > > Thanks
> > > > >
> > > > > There is no double standard. The feature is optional which
> > > > > co-exists as
> > > > explained above.
> > >
>



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-22  4:17                                                                             ` Jason Wang
@ 2023-11-22  4:34                                                                               ` Parav Pandit
  2023-11-24  3:15                                                                                 ` Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-22  4:34 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu



> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 22, 2023 9:48 AM
> 
> On Wed, Nov 22, 2023 at 12:29 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, November 21, 2023 10:47 AM
> > >
> > > On Fri, Nov 17, 2023 at 8:51 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: virtio-comment@lists.oasis-open.org
> > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Michael S.
> > > > > Tsirkin
> > > > > Sent: Friday, November 17, 2023 6:11 PM
> > > > >
> > > > > On Fri, Nov 17, 2023 at 12:22:59PM +0000, Parav Pandit wrote:
> > > > > >
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 5:03 PM
> > > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > >
> > > > > > > > Somehow the claim of shadow vq is great without sharing
> > > > > > > > any performance
> > > > > > > numbers is what I don't agree with.
> > > > > > >
> > > > > > > It's upstream in QEMU. Test it youself.
> > > > > > >
> > > > > > We did few minutes back.
> > > > > > It results in a call trace.
> > > > > > Vhost_vdpa_setup_vq_irq crashes on list corruption on net-next.
> > > > >
> > > > > Wrong list for this bug report.
> > > > >
> > > > > > We are stopping any shadow vq tests on unstable stuff.
> > > > >
> > > > > If you don't want to benchmark against alternatives how are you
> > > > > going to prove your stuff is worth everyone's time?
> > > >
> > > > Comparing performance of the functional things count.
> > > > You suggest shadow vq, frankly you should post the grand numbers
> > > > of
> > > shadow vq.
> > >
> > > We need an apple to apple comparison. Otherwise you may argue with
> > > that, no?
> > >
> > When the requirements are met the comparison can be made of the
> solution.
> > And I don’t see that the basic requirements are matching for two different
> use cases.
> > So no point in discussing one OS specific implementation as reference
> point.
> 
> Shadow virtqueue is not OS specific, it's a common method. If you disagree,
> please explain why.
>
As you claim shadow virtqueue is generic not dependent on OS, how does I benchmark on QNX today?

> > Otherwise I will end up adding vfio link in the commit log in next version as
> you are asking similar things here and being non neutral to your ask.
> 
> When doing a benchmark, you need to describe your setups, no? So any
> benchmark is setup specific, nothing wrong.
> 
> It looks to me you claim your method is better, but refuse to give proofs.
>
I gave details to Michael in the email. Please refer.
 
> >
> > Anyway, please bring the perf data whichever you want to compare in
> another forum. It is not the criteria anyway.
> 
> So how can you prove your method is the best one? You have posted the
> series for months, and so far I still don't see any rationale about why you
> choose to go that way.
It is explained in theory of operation.
You refuse to read it.

> 
> This is very odd as we've gone through several methods one or two years ago
> when discussing vDPA live migration.
> 
It does not matter as this is not vdpa forum.

> >
> > > >
> > > > It is really not my role to report bug of unstable stuff and
> > > > compare the perf
> > > against.
> > >
> > > Qemu/KVM is highly relevant here no? And it's the way to develop the
> > > community. The shadow vq code is handy.
> > It is relevant for direct mapped device.
> 
> Let's focus on the function then discuss the use cases. If you can't prove your
> proposal has a proper function, what's the point of discussing the use cases?
> 
The proper function is described.
You choose to not accept in favour of considering on the vdpa.

> > There is absolutely no point of converting virtio device to another
> virtualization layer and run again and get another virtio device.
> > So for direct mapping use case shadow vq is not relevant.
> 
> It is needed because shadow virtqueue is the baseline. Most of the issues
> don't exist in the case of shadow virtqueue.
> 
I disagree.
For direct mapping there is no virtio specific OS layer involved.
Hence shadow vq specific implementation is not appliable.

> We don't want to end up with a solution that
> 
> 1) can't outperform shadow virtqueue
Disagree. There is no shadow vq in direct mapping. No comparison.

> 2) have more issues than shadow virtqueue
>
There are none.
 
> > For other use cases, please continue.
> >
> > >
> > > Just an email to Qemu should be fine, we're not asking you to fix the bug.
> > >
> > > Btw, how do you define stable? E.g do you think the Linus tree is stable?
> > >
> > Basic test with iperf is not working. Crashing it.
> 
> As a kernel developer, dealing with crashing at any layer is pretty common.
> No?
> 
So, kernel developers do not ask to compare the crashing code.

> Thanks
> 
> 
> > All of this is complete unrelated discussion to this series to slow down the
> work.
> > I don’t see any value.
> > Michael asked to do the test, we did, it does not work. Functionally broken
> code has no comparison.
> >
> > > Thanks
> > >
> > > >
> > > > We propose device context and provided the numbers you asked.
> > > > Mostly
> > > wont be able to go farther than this.
> > > >
> > > > This publicly archived list offers a means to provide input to the
> > > > OASIS Virtual I/O Device (VIRTIO) TC.
> > > >
> > > > In order to verify user consent to the Feedback License terms and
> > > > to minimize spam in the list archive, subscription is required
> > > > before posting.
> > > >
> > > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > > > List help: virtio-comment-help@lists.oasis-open.org
> > > > List archive:
> > > > https://lists.oasis-open.org/archives/virtio-comment/
> > > > Feedback License:
> > > > https://www.oasis-open.org/who/ipr/feedback_license.pdf
> > > > List Guidelines:
> > > > https://www.oasis-open.org/policies-guidelines/mailing-lists
> > > > Committee: https://www.oasis-open.org/committees/virtio/
> > > > Join OASIS: https://www.oasis-open.org/join/
> > > >
> >


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-22  2:31                                                               ` Si-Wei Liu
@ 2023-11-22  5:31                                                                 ` Jason Wang
  2023-11-23 13:19                                                                   ` Si-Wei Liu
  0 siblings, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-22  5:31 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Parav Pandit, Michael S. Tsirkin, Zhu, Lingshan, virtio-comment,
	cohuck, sburla, Shahaf Shuler, Maor Gottlieb, Yishai Hadas,
	eperezma

On Wed, Nov 22, 2023 at 10:31 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
> (dropping my personal email abandoned for upstream discussion for now,
> please try to copy my corporate email address for more timely response)
>
> On 11/20/2023 10:55 PM, Jason Wang wrote:
> > On Fri, Nov 17, 2023 at 10:48 PM Parav Pandit <parav@nvidia.com> wrote:
> >>
> >>> From: Michael S. Tsirkin <mst@redhat.com>
> >>> Sent: Friday, November 17, 2023 7:31 PM
> >>> To: Parav Pandit <parav@nvidia.com>
> >>>
> >>> On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
> >>>>
> >>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>> Sent: Friday, November 17, 2023 6:02 PM
> >>>>>
> >>>>> On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> >>>>>>
> >>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>> Sent: Friday, November 17, 2023 5:35 PM
> >>>>>>> To: Parav Pandit <parav@nvidia.com>
> >>>>>>>
> >>>>>>> On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> >>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>> Sent: Friday, November 17, 2023 5:04 PM
> >>>>>>>>>
> >>>>>>>>> On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> >>>>>>>>>>
> >>>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>> Sent: Friday, November 17, 2023 4:30 PM
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>>> Sent: Friday, November 17, 2023 3:30 PM
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>> On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu,
> >>>>>>>>>>>>>> Lingshan
> >>>>> wrote:
> >>>>>>>>>>>>>>> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>>>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav
> >>>>>>>>>>>>>>>> Pandit
> >>>>> wrote:
> >>>>>>>>>>>>>>>>> We should expose a limit of the device in the
> >>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>> WRITE_RECORD_CAP_QUERY command, that how much
> >>> range
> >>>>>>>>>>>>> it can
> >>>>>>>>> track.
> >>>>>>>>>>>>>>>>> So that future provisioning framework can use it.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I will cover this in v5 early next week.
> >>>>>>>>>>>>>>>> I do worry about how this can even work though.
> >>>>>>>>>>>>>>>> If you want a generic device you do not get to
> >>>>>>>>>>>>>>>> dictate how much memory VM
> >>>>>>>>> has.
> >>>>>>>>>>>>>>>> Aren't we talking bit per page? With 1TByte of
> >>>>>>>>>>>>>>>> memory to track
> >>>>>>>>>>>>>>>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> And you happily say "we'll address this in the future"
> >>>>>>>>>>>>>>>> while at the same time fighting tooth and nail
> >>>>>>>>>>>>>>>> against adding single bit status registers because
> >>> scalability?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I have a feeling doing this completely
> >>>>>>>>>>>>>>>> theoretical like this is
> >>>>>>>>> problematic.
> >>>>>>>>>>>>>>>> Maybe you have it all laid out neatly in your
> >>>>>>>>>>>>>>>> head but I suspect not all of TC can picture it
> >>>>>>>>>>>>>>>> clearly enough based just on spec
> >>>>>>>>> text.
> >>>>>>>>>>>>>>>> We do sometimes ask for POC implementation in
> >>>>>>>>>>>>>>>> linux / qemu to demonstrate how things work
> >>>>>>>>>>>>>>>> before merging
> >>>>> code.
> >>>>>>>>>>>>>>>> We skipped this for admin things so far but I
> >>>>>>>>>>>>>>>> think it's a good idea to start doing it here.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> What makes me pause a bit before saying please
> >>>>>>>>>>>>>>>> do a PoC is all the opposition that seems to
> >>>>>>>>>>>>>>>> exist to even using admin commands in the 1st
> >>>>>>>>>>>>>>>> place. I think once we finally stop arguing
> >>>>>>>>>>>>>>>> about whether to use admin commands at all then
> >>>>>>>>>>>>>>>> a PoC will be needed
> >>>>>>>>>>> before merging.
> >>>>>>>>>>>>>>> We have POR productions that implemented the
> >>>>>>>>>>>>>>> approach in my
> >>>>>>>>> series.
> >>>>>>>>>>>>>>> They are multiple generations of productions in
> >>>>>>>>>>>>>>> market and running in customers data centers for years.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Back to 2019 when we start working on vDPA, we
> >>>>>>>>>>>>>>> have sent some samples of production(e.g.,
> >>>>>>>>>>>>>>> Cascade
> >>>>>>>>>>>>>>> Glacier) and the datasheet, you can find live
> >>>>>>>>>>>>>>> migration facilities there, includes suspend, vq
> >>>>>>>>>>>>>>> state and other
> >>>>> features.
> >>>>>>>>>>>>>>> And there is an reference in DPDK live migration,
> >>>>>>>>>>>>>>> I have provided this page
> >>>>>>>>>>>>>>> before:
> >>>>>>>>>>>>>>> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.ht
> >>>>>>>>>>>>>>> ml, it has been working for long long time.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> So if we let the facts speak, if we want to see
> >>>>>>>>>>>>>>> if the proposal is proven to work, I would
> >>>>>>>>>>>>>>> say: They are POR for years, customers already
> >>>>>>>>>>>>>>> deployed them for
> >>>>>>>>> years.
> >>>>>>>>>>>>>> And I guess what you are trying to say is that
> >>>>>>>>>>>>>> this patchset we are reviewing here should be help
> >>>>>>>>>>>>>> to the same standard and there should be a PoC? Sounds
> >>> reasonable.
> >>>>>>>>>>>>> Yes and the in-marketing productions are POR, the
> >>>>>>>>>>>>> series just improves the design, for example, our
> >>>>>>>>>>>>> series also use registers to track vq state, but
> >>>>>>>>>>>>> improvements than CG or BSC. So I think they are
> >>>>>>>>>>>>> proven
> >>>>>>>>>>> to work.
> >>>>>>>>>>>> If you prefer to go the route of POR and production
> >>>>>>>>>>>> and proven documents
> >>>>>>>>>>> etc, there is ton of it of multiple types of products I
> >>>>>>>>>>> can dump here with open- source code and documentation and
> >>> more.
> >>>>>>>>>>>> Let me know what you would like to see.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Michael has requested some performance comparisons,
> >>>>>>>>>>>> not all are ready to
> >>>>>>>>>>> share yet.
> >>>>>>>>>>>> Some are present that I will share in coming weeks.
> >>>>>>>>>>>>
> >>>>>>>>>>>> And all the vdpa dpdk you published does not have
> >>>>>>>>>>>> basic CVQ support when I
> >>>>>>>>>>> last looked at it.
> >>>>>>>>>>>> Do you know when was it added?
> >>>>>>>>>>> It's good enough for PoC I think, CVQ or not.
> >>>>>>>>>>> The problem with CVQ generally, is that VDPA wants to
> >>>>>>>>>>> shadow CVQ it at all times because it wants to decode
> >>>>>>>>>>> and cache the content. But this problem has nothing to
> >>>>>>>>>>> do with dirty tracking even though it also
> >>>>>>>>> mentions "shadow":
> >>>>>>>>>>> if device can report it's state then there's no need to shadow
> >>> CVQ.
> >>>>>>>>>> For the performance numbers with the pre-copy and device
> >>>>>>>>>> context of
> >>>>>>>>> patches posted 1 to 5, the downtime reduction of the VM is
> >>>>>>>>> 3.71x with active traffic on 8 RQs at 100Gbps port speed.
> >>>>>>>>>
> >>>>>>>>> Sounds good can you please post a bit more detail?
> >>>>>>>>> which configs are you comparing what was the result on each of
> >>> them.
> >>>>>>>> Common config: 8+8 tx and rx queues.
> >>>>>>>> Port speed: 100Gbps
> >>>>>>>> QEMU 8.1
> >>>>>>>> Libvirt 7.0
> >>>>>>>> GVM: Centos 7.4
> >>>>>>>> Device: virtio VF hardware device
> >>>>>>>>
> >>>>>>>> Config_1: virtio suspend/resume similar to what Lingshan has,
> >>>>>>>> largely vdpa stack
> >>>>>>>> Config_2: Device context method of admin commands
> >>>>>>> OK that sounds good. The weird thing here is that you measure
> >>> "downtime".
> >>>>>>> What exactly do you mean here?
> >>>>>>> I am guessing it's the time to retrieve on source and re-program
> >>>>>>> device state on destination? And this is 3.71x out of how long?
> >>>>>> Yes. Downtime is the time during which the VM is not responding or
> >>>>>> receiving
> >>>>> packets, which involves reprogramming the device.
> >>>>>> 3.71x is relative time for this discussion.
> >>>>> Oh interesting. So VM state movement including reprogramming the CPU
> >>>>> is dominated by reprogramming this single NIC, by a factor of almost 4?
> >>>> Yes.
> >>> Could you post some numbers too then?  I want to know whether that would
> >>> imply that VM boot is slowed down significantly too. If yes that's another
> >>> motivation for pci transport 2.0.
> >> It was 1.8 sec down to 480msec.
> > Well, there's work ongoing to reduce the downtime of the shadow virtqueue.
> >
> > Eugenio or Si-wei may share an exact number, but it should be several
> > hundreds of ms.
> That was mostly for device teardown time at the the source but there's
> also setup cost at the destination that needs to be counted.
> Several hundred of milliseconds would be the ultimate goal I would say
> (right now the numbers from Parav more or less reflects the status quo
> but there's ongoing work to make it further down), and I don't doubt
> several hundreds of ms is possible. But to be fair, on the other hand,
> shadow vq on real vdpa hardware device would need a lot of dedicated
> optimization work across all layers (including hardware or firmware) all
> over the places to achieve what a simple suspend-resume (save/load)
> interface can easily do with VFIO migration.

That's fine. Just to clairfy, shadow virtqueue here doesn't mean it
can't save/load. We want to see how it is useful for dirty page
tracking since tracking dirty pages by device itself seems problematic
at least from my point of view.

Shadow virtqueue can be used with a save/load model for device state
recovery for sure.

>
> > But it seems the shadow virtqueue itself is not the major factor but
> > the time spent on programming vendor specific mappings for example.
> Yep. The slowness on mapping part is mostly due to the artifact of
> software-based implementation. IMHO for live migration p.o.v it's better
> to not involve any mapping operation in the down time path at all.

Yes.

Thanks

>
> -Siwei
> >
> > Thanks
> >
> >> The time didn't come from pci side or boot side.
> >>
> >> For pci side of things you would want to compare the pci vs non pci device based VM boot time.
> >>
> >
> > This publicly archived list offers a means to provide input to the
> >
> > OASIS Virtual I/O Device (VIRTIO) TC.
> >
> >
> >
> > In order to verify user consent to the Feedback License terms and
> >
> > to minimize spam in the list archive, subscription is required
> >
> > before posting.
> >
> >
> >
> > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >
> > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >
> > List help: virtio-comment-help@lists.oasis-open.org
> >
> > List archive: https://lists.oasis-open.org/archives/virtio-comment/
> >
> > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> >
> > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> >
> > Committee: https://www.oasis-open.org/committees/virtio/
> >
> > Join OASIS: https://www.oasis-open.org/join/
> >
>
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-22  4:28                                   ` [virtio-comment] " Jason Wang
@ 2023-11-22  6:41                                     ` Parav Pandit
  2023-11-24  3:06                                       ` [virtio-comment] " Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Parav Pandit @ 2023-11-22  6:41 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu


> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 22, 2023 9:59 AM
> 
> On Wed, Nov 22, 2023 at 12:31 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, November 21, 2023 12:45 PM
> > >
> > > On Thu, Nov 16, 2023 at 1:30 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Thursday, November 16, 2023 9:54 AM
> > > > >
> > > > > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com>
> > > wrote:
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Monday, November 13, 2023 9:07 AM
> > > > > > >
> > > > > > > On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Sent: Tuesday, November 7, 2023 9:34 AM
> > > > > > > > >
> > > > > > > > > On Mon, Nov 6, 2023 at 2:54 PM Parav Pandit
> > > > > > > > > <parav@nvidia.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > Sent: Monday, November 6, 2023 12:04 PM
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit
> > > > > > > > > > > <parav@nvidia.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > Sent: Thursday, November 2, 2023 9:54 AM
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit
> > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > Sent: Wednesday, November 1, 2023 6:00 AM
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Oct 31, 2023 at 11:27 AM Parav
> > > > > > > > > > > > > > > Pandit <parav@nvidia.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Mon, Oct 30, 2023 at 9:21 PM Parav
> > > > > > > > > > > > > > > > > Pandit <parav@nvidia.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > During a device migration flow
> > > > > > > > > > > > > > > > > > (typically in a precopy phase of the
> > > > > > > > > > > > > > > > > > live migration), a device may write to the guest
> memory.
> > > > > > > > > > > > > > > > > > Some iommu/hypervisor may not be able
> > > > > > > > > > > > > > > > > > to track these
> > > > > > > > > > > > > written pages.
> > > > > > > > > > > > > > > > > > These pages to be migrated from source
> > > > > > > > > > > > > > > > > > to destination
> > > > > > > > > hypervisor.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > A device which writes to these pages,
> > > > > > > > > > > > > > > > > > provides the page address record of
> > > > > > > > > > > > > > > > > > the to the owner
> > > device.
> > > > > > > > > > > > > > > > > > The owner device starts write
> > > > > > > > > > > > > > > > > > recording for the device and queries
> > > > > > > > > > > > > > > > > > all the page addresses written by the
> > > > > > > device.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Fixes:
> > > > > > > > > > > > > > > > > > https://github.com/oasis-tcs/virtio-sp
> > > > > > > > > > > > > > > > > > ec/i
> > > > > > > > > > > > > > > > > > ssue
> > > > > > > > > > > > > > > > > > s/17
> > > > > > > > > > > > > > > > > > 6
> > > > > > > > > > > > > > > > > > Signed-off-by: Parav Pandit
> > > > > > > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > > > > > > > > > > Signed-off-by: Satananda Burla
> > > > > > > > > > > > > > > > > > <sburla@marvell.com>
> > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > changelog:
> > > > > > > > > > > > > > > > > > v1->v2:
> > > > > > > > > > > > > > > > > > - addressed comments from Michael
> > > > > > > > > > > > > > > > > > - replaced iova with physical address
> > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > >  admin-cmds-device-migration.tex | 15
> > > > > > > > > > > > > > > > > > +++++++++++++++
> > > > > > > > > > > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > > a/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > > > > b/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > > > > index ed911e4..2e32f2c
> > > > > > > > > > > > > > > > > > 100644
> > > > > > > > > > > > > > > > > > --- a/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > > > > +++ b/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > > > > @@ -95,6 +95,21 @@
> > > > > > > > > > > > > > > > > > \subsubsection{Device
> > > > > > > > > > > > > > > > > > Migration}\label{sec:Basic Facilities
> > > > > > > > > > > > > > > > > > of a Virtio Device / The owner driver
> > > > > > > > > > > > > > > > > > can discard any partially read or
> > > > > > > > > > > > > > > > > > written device context when  any of
> > > > > > > > > > > > > > > > > > the device migration flow
> > > > > > > > > > > > > > > > > should be aborted.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > +During the device migration flow, a
> > > > > > > > > > > > > > > > > > +passthrough device may write data to
> > > > > > > > > > > > > > > > > > +the guest virtual machine's memory, a
> > > > > > > > > > > > > > > > > > +source hypervisor needs to keep track
> > > > > > > > > > > > > > > > > > +of these written memory to migrate
> > > > > > > > > > > > > > > > > > +such memory to destination
> > > > > > > > > > > > > > > > > hypervisor.
> > > > > > > > > > > > > > > > > > +Some systems may not be able to keep
> > > > > > > > > > > > > > > > > > +track of such memory write addresses
> > > > > > > > > > > > > > > > > > +at hypervisor
> > > level.
> > > > > > > > > > > > > > > > > > +In such a scenario, a device records
> > > > > > > > > > > > > > > > > > +and reports these written memory
> > > > > > > > > > > > > > > > > > +addresses to the owner device. The
> > > > > > > > > > > > > > > > > > +owner driver enables write recording
> > > > > > > > > > > > > > > > > > +for one or more physical address
> > > > > > > > > > > > > > > > > > +ranges per device during device
> > > > > > > > > > > > > migration flow.
> > > > > > > > > > > > > > > > > > +The owner driver periodically queries
> > > > > > > > > > > > > > > > > > +these written physical address
> > > > > > > > > > > > > > > records from the device.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I wonder how PA works in this case.
> > > > > > > > > > > > > > > > > Device uses untranslated requests so it can only see
> IOVA.
> > > > > > > > > > > > > > > > > We can't mandate
> > > > > > > > > ATS anyhow.
> > > > > > > > > > > > > > > > Michael suggested to keep the language
> > > > > > > > > > > > > > > > uniform as PA as this is ultimately
> > > > > > > > > > > > > > > what the guest driver is supplying during vq
> > > > > > > > > > > > > > > creation and in posting buffers as physical address.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This seems to need some work. And, can you
> > > > > > > > > > > > > > > show me how it can
> > > > > > > > > work?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1) e.g if GAW is 48 bit, is the hypervisor
> > > > > > > > > > > > > > > expected to do a bisection of the whole range?
> > > > > > > > > > > > > > > 2) does the device need to reserve
> > > > > > > > > > > > > > > sufficient internal resources for logging
> > > > > > > > > > > > > > > the dirty page and why
> > > (not)?
> > > > > > > > > > > > > > No when dirty page logging starts, only at
> > > > > > > > > > > > > > that time, device will reserve
> > > > > > > > > > > > > enough resources.
> > > > > > > > > > > > >
> > > > > > > > > > > > > GAW is 48bit, how large would it have then?
> > > > > > > > > > > > Dirty page tracking is not dependent on the size of the GAW.
> > > > > > > > > > > > It is function of address ranges for the amount of
> > > > > > > > > > > > guest memory regardless of
> > > > > > > > > > > GAW.
> > > > > > > > > > >
> > > > > > > > > > > The problem is, e.g when vIOMMU is enabled, you
> > > > > > > > > > > can't know which IOVA is actually used by guests.
> > > > > > > > > > > And even for the case when vIOMMU is not enabled,
> > > > > > > > > > > the guest may have
> > > several TBs.
> > > > > > > > > > > Is it easy to reserve sufficient resources by the device itself?
> > > > > > > > > > >
> > > > > > > > > > When page tracking is enabled per device, it knows
> > > > > > > > > > about the range and it can
> > > > > > > > > reserve certain resource.
> > > > > > > > >
> > > > > > > > > I didn't see such an interface in this series. Anything I miss?
> > > > > > > > >
> > > > > > > > Yes, this patch and the next patch is covering the page
> > > > > > > > tracking start,stop and
> > > > > > > query commands.
> > > > > > > > They are named as write recording commands.
> > > > > > >
> > > > > > > So I still don't see how the device can reserve sufficient resources?
> > > > > > > Guests may map a very large area of memory to IOMMU (or when
> > > > > > > vIOMMU is disabled, GPA is used). It would be several TBs,
> > > > > > > how can the device reserve sufficient resources in this case?
> > > > > > When the map is established, the ranges are supplied to the
> > > > > > device to know
> > > > > how much to reserve.
> > > > > > If device does not have enough resource, it fails the command.
> > > > > >
> > > > > > One can advance it further to provision for the desired range..
> > > > >
> > > > > Well, I think I've asked whether or not a bisection is needed,
> > > > > and you told me not ...
> > > > >
> > > > > But at least we need to document this in the proposal, no?
> > > > >
> > > > We should expose a limit of the device in the proposed
> > > WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > > > So that future provisioning framework can use it.
> > > >
> > > > I will cover this in v5 early next week.
> > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > Btw, the IOVA is allocated by the guest actually, how
> > > > > > > > > can we know the
> > > > > > > range?
> > > > > > > > > (or using the host range?)
> > > > > > > > >
> > > > > > > > Hypervisor would have mapping translation.
> > > > > > >
> > > > > > > That's really tricky and can only work in some cases:
> > > > > > >
> > > > > > > 1) It requires the hypervisor to traverse the guest I/O page
> > > > > > > tables which could be very large range
> > > > > > > 2) It requests the hypervisor to trap the modification of
> > > > > > > guest I/O page tables and synchronize with the range
> > > > > > > changes, which is inefficient and can only be done when we are
> doing shadow PTEs.
> > > > > > > It won't work when the nesting translation could be
> > > > > > > offloaded to the hardware
> > > > > > > 3) It is racy with the guest modification of I/O page tables
> > > > > > > which is explained in another thread
> > > > > > Mapping changes with more hw mmu's is not a frequent event and
> > > > > > IOTLB
> > > > > flush is done using querying the dirty log for the smaller range.
> > > > > >
> > > > > > > 4) No aware of new features like PASID which has been
> > > > > > > explained in another thread
> > > > > > For all the pinned work with non sw based IOMMU, it is
> > > > > > typically small
> > > subset.
> > > > > > PASID is guest controlled.
> > > > >
> > > > > Let's repeat my points:
> > > > >
> > > > > 1) vq1 use untranslated request with PASID1
> > > > > 2) vq2 use untranslated request with PASID2
> > > > >
> > > > > Shouldn't we log PASID as well?
> > > > >
> > > > Possibly yes, either to request the tracking per PASID or to log the PASID.
> > > > When in future PASID based VQ are supported, this part should be
> > > extended.
> > >
> > > Who is going to do the extension? They are orthogonal features for sure.
> > Whoever extends the VQ for PASID programming.
> >
> > I plan to have generic command for VQ creation over CVQ
> 
> Another unrelated issue.
I disagree.

> 
> > for the wider use cases we discussed.
> 
> CVQ might want a dedicated PASID.
Why? For one off queue like that may be additional register because this is still bootstrap phase.
But using that as argument point to generalize for rest of the queue is wrong.

> 
> > It can have PASID parameter in future when one wants to add it.
> >
> > >
> > > >
> > > > > And
> > > > >
> > > > > 1) vq1 is using translated request
> > > > > 2) vq2 is using untranslated request
> > > > >
> > >
> > > How about this?
> > How did driver program the device for vq1 to translated request and vq2 to
> not.
> > And for which use case?
> 
> Again, it is allowed by the PCI spec, no? You've explained yourself that your
> design needs to obey PCI spec.
> 
How did the guest driver program this in the device?

> And, if you want to ask. for use case, there are handy:
> 
> - ATS
> - When IOMMU_PLATFORM is not negotiated
> - MSI
> 
So why and how driver did it differently for two vqs?

> Let's make sure the function of your proposal is correct before talking about
> any use cases.
This proposal as nothing to do with vqs.
It is simply that tracking does not involve PASID at the moment, and it can be added in future.

> 
> >
> > >
> > > >
> > > > > How could we differ?
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Host should always have more resources than device,
> > > > > > > > > > > in that sense there could be several methods that
> > > > > > > > > > > tries to utilize host memory instead of the one in
> > > > > > > > > > > the device. I think we've discussed this when going
> > > > > > > > > > > through the doc prepared
> > > by Eugenio.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > What happens if we're trying to migrate more than 1
> device?
> > > > > > > > > > > > >
> > > > > > > > > > > > That is perfectly fine.
> > > > > > > > > > > > Each device is updating its log of pages it wrote.
> > > > > > > > > > > > The hypervisor is collecting their sum.
> > > > > > > > > > >
> > > > > > > > > > > See above.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 3) DMA is part of the transport, it's
> > > > > > > > > > > > > > > natural to do logging there, why duplicate efforts in the
> virtio layer?
> > > > > > > > > > > > > > He he, you have funny comment.
> > > > > > > > > > > > > > When an abstract facility is added to virtio
> > > > > > > > > > > > > > you say to do in
> > > > > > > transport.
> > > > > > > > > > > > >
> > > > > > > > > > > > > So it's not done in the general facility but
> > > > > > > > > > > > > tied to the admin
> > > part.
> > > > > > > > > > > > > And we all know dirty page tracking is a
> > > > > > > > > > > > > challenge and Eugenio has a good summary of
> > > > > > > > > > > > > pros/cons. A revisit of those docs make me think
> > > > > > > > > > > > > virtio is not the good place for doing that for
> > > > > > > may reasons:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1) as stated, platform will evolve to be able to
> > > > > > > > > > > > > tracking dirty pages, actually, it has been
> > > > > > > > > > > > > supported by a lot of major IOMMU vendors
> > > > > > > > > > > >
> > > > > > > > > > > > This is optional facility in virtio.
> > > > > > > > > > > > Can you please point to the references? I don’t
> > > > > > > > > > > > see it in the common Linux
> > > > > > > > > > > kernel support for it.
> > > > > > > > > > >
> > > > > > > > > > > Note that when IOMMUFD is being proposed, dirty page
> > > > > > > > > > > tracking is one of the major considerations.
> > > > > > > > > > >
> > > > > > > > > > > This is one recent proposal:
> > > > > > > > > > >
> > > > > > > > > > > https://www.spinics.net/lists/kvm/msg330894.html
> > > > > > > > > > >
> > > > > > > > > > Sure, so if platform supports it. it can be used from the
> platform.
> > > > > > > > > > If it does not, the device supplies it.
> > > > > > > > > >
> > > > > > > > > > > > Instead Linux kernel choose to extend to the devices.
> > > > > > > > > > >
> > > > > > > > > > > Well, as I stated, tracking dirty pages is
> > > > > > > > > > > challenging if you want to do it on a device, and
> > > > > > > > > > > you can't simply invent dirty page tracking for each type of
> the devices.
> > > > > > > > > > >
> > > > > > > > > > It is not invented.
> > > > > > > > > > It is generic framework for all virtio device types as proposed
> here.
> > > > > > > > > > Keep in mind, that it is optional already in v3 series.
> > > > > > > > > >
> > > > > > > > > > > > At least not seen to arrive this in any near term
> > > > > > > > > > > > in start of
> > > > > > > > > > > > 2024 which is
> > > > > > > > > > > where users must use this.
> > > > > > > > > > > >
> > > > > > > > > > > > > 2) you can't assume virtio is the only device
> > > > > > > > > > > > > that can be used by the guest, having dirty
> > > > > > > > > > > > > pages tracking to be implemented in each type of
> > > > > > > > > > > > > device is unrealistic
> > > > > > > > > > > > Of course, there is no such assumption made. Where
> > > > > > > > > > > > did you see a text that
> > > > > > > > > > > made such assumption?
> > > > > > > > > > >
> > > > > > > > > > > So what happens if you have a guest with virtio and
> > > > > > > > > > > other devices
> > > > > > > assigned?
> > > > > > > > > > >
> > > > > > > > > > What happens? Each device type would do its own dirty
> > > > > > > > > > page
> > > tracking.
> > > > > > > > > > And if all devices does not have support, hypervisor
> > > > > > > > > > knows to fall back to
> > > > > > > > > platform iommu or its own.
> > > > > > > > > >
> > > > > > > > > > > > Each virtio and non virtio devices who wants to
> > > > > > > > > > > > report their dirty page report,
> > > > > > > > > > > will do their way.
> > > > > > > > > > > >
> > > > > > > > > > > > > 3) inventing it in the virtio layer will be
> > > > > > > > > > > > > deprecated in the future for sure, as platform
> > > > > > > > > > > > > will provide much rich features for logging e.g
> > > > > > > > > > > > > it can do it per PASID etc, I don't see any
> > > > > > > > > > > > > reason virtio need to compete with the features
> > > > > > > > > > > > > that will be provided by the platform
> > > > > > > > > > > > Can you bring the cpu vendors and committement to
> > > > > > > > > > > > virtio tc with timelines
> > > > > > > > > > > so that virtio TC can omit?
> > > > > > > > > > >
> > > > > > > > > > > Why do we need to bring CPU vendors in the virtio TC?
> > > > > > > > > > > Virtio needs to be built on top of transport or
> > > > > > > > > > > platform. There's no need to duplicate
> > > > > > > > > their job.
> > > > > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > > > > >
> > > > > > > > > > I wanted to see a strong commitment for the cpu
> > > > > > > > > > vendors to support dirty
> > > > > > > > > page tracking.
> > > > > > > > >
> > > > > > > > > The RFC of IOMMUFD support can go back to early 2022.
> > > > > > > > > Intel, AMD and ARM are all supporting that now.
> > > > > > > > >
> > > > > > > > > > And the work seems to have started for some platforms.
> > > > > > > > >
> > > > > > > > > Let me quote from the above link:
> > > > > > > > >
> > > > > > > > > """
> > > > > > > > > Today, AMD Milan (or more recent) supports it while ARM
> > > > > > > > > SMMUv3.2 alongside VT-D rev3.x also do support.
> > > > > > > > > """
> > > > > > > > >
> > > > > > > > > > Without such platform commitment, virtio also skipping
> > > > > > > > > > it would not
> > > > > work.
> > > > > > > > >
> > > > > > > > > Is the above sufficient? I'm a little bit more familiar
> > > > > > > > > with vtd, the hw feature has been there for years.
> > > > > > > > >
> > > > > > > > Vtd has a sticky D bit that requires synchronization with
> > > > > > > > IOPTE page caches
> > > > > > > when sw wants to clear it.
> > > > > > >
> > > > > > > This is by design.
> > > > > > >
> > > > > > > > Do you know if is it reliable when device does multiple
> > > > > > > > writes, ie,
> > > > > > > >
> > > > > > > > a. iommu write D bit
> > > > > > > > b. software read it
> > > > > > > > c. sw synchronize cache
> > > > > > > > d. iommu write D bit on next write by device
> > > > > > >
> > > > > > > What issue did you see here? But that's not even an excuse,
> > > > > > > if there's a bug, let's report it to IOMMU vendors and let them fix it.
> > > > > > > The thread I point to you is actually a good space.
> > > > > > >
> > > > > > So we cannot claim that it is there in the platform.
> > > > >
> > > > > I'm confused, the thread I point to you did the cache
> > > > > synchronization which has been explained in the changelog, so
> > > > > what's the
> > > issue?
> > > > >
> > > > If the ask is for IOMMU chip to fix something, we cannot claim
> > > > that dirty
> > > page tracking is available already in platform.
> > >
> > > Again, can you describe the issue? Why do you think the sticky part
> > > is an issue? IOTLB needs to be sync with IO page tables, what's wrong with
> this?
> > Nothing wrong with it.
> > The text is not affirmative to say it works if the sw clears it.
> >
> > >
> > > >
> > > > > >
> > > > > > > Again, the point is to let the correct role play.
> > > > > > >
> > > > > > How many more years should we block the virtio device
> > > > > > migration when
> > > > > platform do not have it?
> > > > >
> > > > > At least for VT-D, it has been used for years.
> > > > Is this device written pages tracked by KVM for VT-d as dirty page
> > > > log,
> > > instead through vfio?
> > >
> > > I don't get this question.
> > You said the VT-d has dirty page tracking for years so it must be used by the
> sw during device migration.
> 
> It's the best way if the platform has the support for that.
> 
> > And if that is there, how is these dirty pages of iommu are merged with the
> cpu side?
> > Is this done by KVM for passthrough devices for vfio?
> 
> I don't see how it is related to the discussion here. IOMMU support is
> sufficient as a start. If you requires CPU support, virtio is clearly the wrong
> forum.
You made point that VT-d dirty tracking is in use for years.
I am asking how kernel consumed it for passthrough devices like vfio?

> 
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > ARM SMMU based servers to be present with D bit tracking.
> > > > > > > > It is still early to say platform is ready.
> > > > > > >
> > > > > > > This is not what I read from both the series I posted and
> > > > > > > the spec, dirty bit has been supported several years ago at least for
> vtd.
> > > > > > Supported, but spec listed it as sticky bit that may require
> > > > > > special
> > > handling.
> > > > >
> > > > > Please explain why this is "special handling". IOMMU has several
> > > > > different layers of caching, by design, it can't just open a window for D
> bit.
> > > > >
> > > > > > May be it is working, but not all cpu platforms have it.
> > > > >
> > > > > I don't see the point. Migration is not supported for virito as well.
> > > > >
> > > > I don’t see a point either to discuss.
> > > >
> > > > I already acked that platform may have support as well, and not
> > > > all platform
> > > has it.
> > > > So the device feeds the data and its platform's choice to enable/disable.
> > >
> > > I've pointed out sufficient issues and I don't want to repeat them.
> > There does not seem to be any that is critical enough for non viommu case.
> 
> No, see above.
> 
In the tests without viommu, unmap range aligns with the dirty tracking range.

> > Viommu needs to flush the iotlb anyway.
> 
> I've explained it in antoher thread.
> 
> >
> > >
> > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > It is optional so whichever has the support it will be used.
> > > > > > >
> > > > > > > I can't see the point of this, it is already available. And
> > > > > > > migration doesn't exist in virtio spec yet.
> > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > > > >
> > > > > > > > > > > Why does it matter in 2024?
> > > > > > > > > > Because users needs to use it now.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > If not, we are better off to offer this, and
> > > > > > > > > > > > when/if platform support is, sure,
> > > > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > > > >
> > > > > > > > > > > > > 4) if the platform support is missing, we can
> > > > > > > > > > > > > use software or leverage transport for
> > > > > > > > > > > > > assistance like PRI
> > > > > > > > > > > > All of these are in theory.
> > > > > > > > > > > > Our experiment shows PRI performance is 21x slower
> > > > > > > > > > > > than page fault rate
> > > > > > > > > > > done by the cpu.
> > > > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > > > >
> > > > > > > > > > > If you stick to the wire speed during migration, it can
> converge.
> > > > > > > > > > Do you have perf data for this?
> > > > > > > > >
> > > > > > > > > No, but it's not hard to imagine the worst case. Wrote a
> > > > > > > > > small program that dirty every page by a NIC.
> > > > > > > > >
> > > > > > > > > > In the internal tests we don’t see this happening.
> > > > > > > > >
> > > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > > > >
> > > > > > > > > So if we get very high dirty rates (e.g by a high speed
> > > > > > > > > NIC), we can't satisfy the requirement of the downtime.
> > > > > > > > > Or if you see the converge, you might get help from the
> > > > > > > > > auto converge support by the hypervisors like KVM where
> > > > > > > > > it tries to throttle the VCPU then you can't reach
> > > > > > > the wire speed.
> > > > > > > > >
> > > > > > > > Once PRI is enabled, even without migration, there is basic perf
> issues.
> > > > > > >
> > > > > > > The context is not PRI here...
> > > > > > >
> > > > > > > It's about if you can stick to wire speed during live migration.
> > > > > > > Based on the analysis so far, you can't achieve wirespeed
> > > > > > > and downtime at
> > > > > the same time.
> > > > > > > That's why the hypervisor needs to throttle VCPU or devices.
> > > > > > >
> > > > > > So?
> > > > > > Device also may throttle itself.
> > > > >
> > > > > That's perfectly fine. We are on the same page, no? It's wrong
> > > > > to judge the dirty page tracking in the context of live
> > > > > migration by measuring whether or not the device can work at wire
> speed.
> > > > >
> > > > > >
> > > > > > > For PRI, it really depends on how you want to use it. E.g if
> > > > > > > you don't want to pin a page, the performance is the price you must
> pay.
> > > > > > PRI without pinning does not make sense for device to make
> > > > > > large mapping
> > > > > queries.
> > > > >
> > > > > That's also fine. Hypervisors can choose to enable and use PRI
> > > > > depending on the different cases.
> > > > >
> > > > So PRI is not must for device migration.
> > >
> > > I never say it's a must.
> > >
> > > > Device migration must be able to work without PRI enabled, as
> > > > simple as
> > > that as first base line.
> > >
> > > My point is that, you need document
> > >
> > > 1) why you think dirty page is a must or not
> > Explained in the patch already in commit log and in spec theory already.
> >
> > > 2) why did you choose one of a specific way instead of others
> > >
> > This is not part of the spec anyway. This is already discussed in mailing list
> here in community.
> 
> It helps the reviewers, it doesn't harm to have a summary in the changelog. Or
> people may ask the same questions endlessly.
> 
At least the current reviewers who discussed should stop asking endlessly. :)

> >
> > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > > > So it is unusable.
> > > > > > > > > > >
> > > > > > > > > > > It's not about mandating, it's about doing things in
> > > > > > > > > > > the correct layer. If PRI is slow, PCI can evolve for sure.
> > > > > > > > > > You should try.
> > > > > > > > >
> > > > > > > > > Not my duty, I just want to make sure things are done in
> > > > > > > > > the correct layer, and once it needs to be done in the
> > > > > > > > > virtio, there's nothing obviously
> > > > > > > wrong.
> > > > > > > > >
> > > > > > > > At present, it looks all platforms are not equally ready
> > > > > > > > for page
> > > tracking.
> > > > > > >
> > > > > > > That's not an excuse to let virtio support that.
> > > > > > It is wrong attribution as excuse.
> > > > > >
> > > > > > > And we need also to figure out if virtio can do that easily.
> > > > > > > I've pointed out sufficient issues, I'm pretty sure there
> > > > > > > would be more as the platform evolves.
> > > > > > >
> > > > > > I am not sure if virtio feeds the log into the platform.
> > > > >
> > > > > I don't understand the meaning here.
> > > > >
> > > > I mistakenly merged two sentences.
> > > >
> > > > Virtio feeds the dirty page details to the hypervisor platform
> > > > which collects
> > > and merges the page record.
> > > > So it is platform choice to use iommu based tracking or device based.
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > In the current state, it is mandating.
> > > > > > > > > > And if you think PRI is the only way,
> > > > > > > > >
> > > > > > > > > I don't, it's just an example where virtio can leverage
> > > > > > > > > from either transport or platform. Or if it's the fault
> > > > > > > > > in virtio that slows down the PRI, then it is something we can do.
> > > > > > > > >
> > > > > > > > Yea, it does not seem to be ready yet.
> > > > > > > >
> > > > > > > > > >  than you should propose that in the dirty page
> > > > > > > > > > tracking series that you listed
> > > > > > > > > above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > > > > >
> > > > > > > > > No, the point is to not duplicate works especially
> > > > > > > > > considering virtio can't do better than platform or transport.
> > > > > > > > >
> > > > > > > > Both the platform and virtio work is ongoing.
> > > > > > >
> > > > > > > Why duplicate the work then?
> > > > > > >
> > > > > > Not all cpu platforms support as far as I know.
> > > > >
> > > > > Yes, but we all know the platform is working to support this.
> > > > >
> > > > > Supporting this on the device is hard.
> > > > >
> > > > This is optional, whichever device would like to implement it, will support
> it.
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > When one does something in transport, you say,
> > > > > > > > > > > > > > this is transport specific, do
> > > > > > > > > > > > > some generic.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Here the device is being tracked is virtio device.
> > > > > > > > > > > > > > PCI-SIG has told already that PCIM interface
> > > > > > > > > > > > > > is outside the scope of
> > > > > > > it.
> > > > > > > > > > > > > > Hence, this is done in virtio layer here in abstract way.
> > > > > > > > > > > > >
> > > > > > > > > > > > > You will end up with a competition with the
> > > > > > > > > > > > > platform/transport one that will fail.
> > > > > > > > > > > > >
> > > > > > > > > > > > I don’t see a reason. There is no competition.
> > > > > > > > > > > > Platform always have a choice to not use device
> > > > > > > > > > > > side page tracking when it is
> > > > > > > > > > > supported.
> > > > > > > > > > >
> > > > > > > > > > > Platform provides a lot of other functionalities for dirty
> logging:
> > > > > > > > > > > e.g per PASID, granular, etc. So you want to
> > > > > > > > > > > duplicate them again in the virtio? If not, why choose this
> way?
> > > > > > > > > > >
> > > > > > > > > > It is optional for the platforms where platform do not have it.
> > > > > > > > >
> > > > > > > > > We are developing new virtio functionalities that are
> > > > > > > > > targeted for future platforms. Otherwise we would end up
> > > > > > > > > with a feature with a very narrow use case.
> > > > > > > > In general I agree that platform is an option too.
> > > > > > > > Hypervisor will be able to make the decision to use
> > > > > > > > platform when available
> > > > > > > and fallback to device method when platform does not have it.
> > > > > > > >
> > > > > > > > Future and to be equally usable in near term :)
> > > > > > >
> > > > > > > Please don't double standard again:
> > > > > > >
> > > > > > > When you are talking about TDISP, you want virtio to be
> > > > > > > designed to fit for the future where the platform is ready
> > > > > > > in the future When you are talking about dirty tracking, you
> > > > > > > want it to work now even if
> > > > > > >
> > > > > > The proposal of transport VQ is anti-TDISP.
> > > > >
> > > > > It's nothing about transport VQ, it's about you're saying the
> > > > > adminq based device context. There's a comment to point out that
> > > > > the current TDISP spec forbids modifying device state when TVM
> > > > > is attached. Then you told us the TDISP may evolve for that.
> > > > So? That is not double standard.
> > > > The proposal is based on main principle that it is not depending
> > > > on hypervisor traping + emulating which is the baseline of TDISP
> > > >
> > > > >
> > > > > > The proposal of dirty tracking is not anti-platform. It is
> > > > > > optional like rest of the
> > > > > platform.
> > > > > >
> > > > > > > 1) most of the platform is ready now
> > > > > > Can you list a ARM server CPU in production that has it? (not
> > > > > > in some pdf
> > > > > spec).
> > > > >
> > > > > Then in the context of a dirty page, I've proved you dirty page
> > > > > tracking has been supported by all major vendors.
> > > > Major IP vendor != major cpu chip vendor.
> > > > I don’t agree with the proof.
> > >
> > > So this will be an endless debate. Did I ever ask you about ETA or
> > > any product for TDISP?
> > >
> > ETA for TDISP is not relevant.
> > You claimed for _major_ vendor support based on nonphysical cpu, hence
> the disagreement.
> 
> How did you define "support"?
> 
You defined that it is supported. So, I think you deserve to define "support". :)

> Dirty tracking has been wroted into the IOMMU manual for Intel, AMD and
> ARM for years. So you think it's not supported now? I've told you it has been
> shipped by Intel at least then you ask me which ARM vendor ships those
> vIOMMU.
> 
I wish that spec manual date = server in the cloud operator data center availability date.

> For TDISP live migration, PCI doesn't even have a draft, no? I never ask which
> chip vendor ships the platform.
> 

> You want to support dirty page tracking in virtio and keep asking when it is
> supported by all platform vendors.
Because you claim that all physical cpu vendors support it without enlisting who is 'all' and 'major'.

> 
> You want to prove your proposal can work for TDISP and TDISP migration but
> never explain when it would be supported by at least one vendor.
> 
Part of the spec work is done keeping s
> Let's have a unified standard please.
The standard is unified.
The base line tenet in the proposal is not put any interface on the TDISP itself for migration that needs to be accessed by some other entity.

> 
> > And that is not the reality.
> >
> > > >
> > > > I already acknowledged that I have seen internal test report for
> > > > dirty tracking
> > > with one cpu and nic.
> > > >
> > > > I just don’t see all cpus have support for it.
> > > > Hence, this optional feature.
> > >
> > > Repeat myself again.
> > >
> > > If it can be done easily and efficiently in virtio, I agree. But
> > > I've pointed out several issues where it is not answered.
> >
> > I have answered most of your questions.
> >
> > The definition of 'easy' is very subjective.
> 
> The reason why I don't think it is easy is because I can easily see several issues
> that can't be solved easily.
> 
> > At one point RSS was also not easy in some devices and IOMMU dirty page
> tracking was also not easy.
> 
> Yes, but we can offload the IOMMU part to the vendor. Virtio can't do
> anything especially the part that duplicates with the function provided by the
> transport or platform.
And when platform does not provide, virtio device can.

> 
> >
> > >
> > > >
> > > > > Where you refuse to use the standard you used in explaining
> > > > > adminq for device context in TDISP.
> > > > >
> > > > > So I didn't ask you the ETA of the TDISP support for migration
> > > > > or adminq, but you want me to give you the production
> > > > > information which is
> > > pointless.
> > > > Because you keep claiming that _all_ cpus in the world has support
> > > > for
> > > efficient dirty page tracking.
> > > >
> > > > > You
> > > > > might need to ask ARM to get an answer, but a simple google told
> > > > > me the effort to support dirty page tracking in SMMUv3 could go
> > > > > back to early
> > > 2021.
> > > > >
> > > > To my knowledge ARM do not produce physical chips.
> > > > Your proposal is to keep those ARM server vendors to not use virtio
> devices.
> > >
> > > This arbitrary conclusion makes no sense.
> > >
> > Your conclusion about "all" and "major" physical cpu vendor supporting dirty
> page tracking is equally arbitrary.
> > So better to not argue on this.
> 
> See above.
> 
> Thanks
> 
> 
> >
> > > I know at least one cloud vendor has used a virtio based device for
> > > years on ARM. And that vendor has posted patches to support dirty
> > > page tracking since 2020.
> > >
> > > Thanks
> > >
> > > > Does not make sense to me.
> > > >
> > > > > https://lore.kernel.org/linux-iommu/56b001fa-b4fe-c595-dc5e-
> > > > > f362d2f07a19@linux.intel.com/t/
> > > > >
> > > > > Why is it not merged? It's simply because we agree to do it in
> > > > > the layer of IOMMUFD so it needs to wait.
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > > >
> > > > > > > 2) whether or not virtio can log dirty page correctly is
> > > > > > > still suspicious
> > > > > > >
> > > > > > > Thanks
> > > > > >
> > > > > > There is no double standard. The feature is optional which
> > > > > > co-exists as
> > > > > explained above.
> > > >
> >


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-22  3:46                                                                               ` Parav Pandit
@ 2023-11-22  7:44                                                                                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-22  7:44 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Wed, Nov 22, 2023 at 03:46:41AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Wednesday, November 22, 2023 2:31 AM
> > 
> > On Tue, Nov 21, 2023 at 04:29:36PM +0000, Parav Pandit wrote:
> > > Basic test with iperf is not working. Crashing it.
> > > All of this is complete unrelated discussion to this series to slow down the
> > work.
> > > I don’t see any value.
> > > Michael asked to do the test, we did, it does not work. Functionally broken
> > code has no comparison.
> > 
> > It's unfortunate it's unstable for you, if you could show perf comparison that
> > would be a strong argument for your case. Reporting Linux/qemu failures to
> > virtio TC is not going to help you though, wrong forum.
> 
> As I explained, the basic requirements are not met hence, the comparison is not applicable.
> There is no point in discussing specific OS implementation anyway. You asked to remove vfio citations hence we remove other citations as well.
> 
> Thanks.

So it crashes, and even if it does not it does not perform well, and even
if it performs well it does not meet the requirements.
Really, you are shooting yourself in the foot with stuff like this.
Does not look good at all.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-22  5:31                                                                 ` Jason Wang
@ 2023-11-23 13:19                                                                   ` Si-Wei Liu
  2023-11-23 14:39                                                                     ` Michael S. Tsirkin
  2023-11-24  2:29                                                                     ` Jason Wang
  0 siblings, 2 replies; 157+ messages in thread
From: Si-Wei Liu @ 2023-11-23 13:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Zhu, Lingshan, virtio-comment,
	cohuck, sburla, Shahaf Shuler, Maor Gottlieb, Yishai Hadas,
	eperezma



On 11/21/2023 9:31 PM, Jason Wang wrote:
> On Wed, Nov 22, 2023 at 10:31 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>> (dropping my personal email abandoned for upstream discussion for now,
>> please try to copy my corporate email address for more timely response)
>>
>> On 11/20/2023 10:55 PM, Jason Wang wrote:
>>> On Fri, Nov 17, 2023 at 10:48 PM Parav Pandit <parav@nvidia.com> wrote:
>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>> Sent: Friday, November 17, 2023 7:31 PM
>>>>> To: Parav Pandit <parav@nvidia.com>
>>>>>
>>>>> On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>> Sent: Friday, November 17, 2023 6:02 PM
>>>>>>>
>>>>>>> On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>> Sent: Friday, November 17, 2023 5:35 PM
>>>>>>>>> To: Parav Pandit <parav@nvidia.com>
>>>>>>>>>
>>>>>>>>> On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
>>>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>>> Sent: Friday, November 17, 2023 5:04 PM
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
>>>>>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>>>>> Sent: Friday, November 17, 2023 4:30 PM
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
>>>>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>>> Sent: Friday, November 17, 2023 3:30 PM
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>>>> On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu,
>>>>>>>>>>>>>>>> Lingshan
>>>>>>> wrote:
>>>>>>>>>>>>>>>>> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>>>>>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav
>>>>>>>>>>>>>>>>>> Pandit
>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> We should expose a limit of the device in the
>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>> WRITE_RECORD_CAP_QUERY command, that how much
>>>>> range
>>>>>>>>>>>>>>> it can
>>>>>>>>>>> track.
>>>>>>>>>>>>>>>>>>> So that future provisioning framework can use it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I will cover this in v5 early next week.
>>>>>>>>>>>>>>>>>> I do worry about how this can even work though.
>>>>>>>>>>>>>>>>>> If you want a generic device you do not get to
>>>>>>>>>>>>>>>>>> dictate how much memory VM
>>>>>>>>>>> has.
>>>>>>>>>>>>>>>>>> Aren't we talking bit per page? With 1TByte of
>>>>>>>>>>>>>>>>>> memory to track
>>>>>>>>>>>>>>>>>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> And you happily say "we'll address this in the future"
>>>>>>>>>>>>>>>>>> while at the same time fighting tooth and nail
>>>>>>>>>>>>>>>>>> against adding single bit status registers because
>>>>> scalability?
>>>>>>>>>>>>>>>>>> I have a feeling doing this completely
>>>>>>>>>>>>>>>>>> theoretical like this is
>>>>>>>>>>> problematic.
>>>>>>>>>>>>>>>>>> Maybe you have it all laid out neatly in your
>>>>>>>>>>>>>>>>>> head but I suspect not all of TC can picture it
>>>>>>>>>>>>>>>>>> clearly enough based just on spec
>>>>>>>>>>> text.
>>>>>>>>>>>>>>>>>> We do sometimes ask for POC implementation in
>>>>>>>>>>>>>>>>>> linux / qemu to demonstrate how things work
>>>>>>>>>>>>>>>>>> before merging
>>>>>>> code.
>>>>>>>>>>>>>>>>>> We skipped this for admin things so far but I
>>>>>>>>>>>>>>>>>> think it's a good idea to start doing it here.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> What makes me pause a bit before saying please
>>>>>>>>>>>>>>>>>> do a PoC is all the opposition that seems to
>>>>>>>>>>>>>>>>>> exist to even using admin commands in the 1st
>>>>>>>>>>>>>>>>>> place. I think once we finally stop arguing
>>>>>>>>>>>>>>>>>> about whether to use admin commands at all then
>>>>>>>>>>>>>>>>>> a PoC will be needed
>>>>>>>>>>>>> before merging.
>>>>>>>>>>>>>>>>> We have POR productions that implemented the
>>>>>>>>>>>>>>>>> approach in my
>>>>>>>>>>> series.
>>>>>>>>>>>>>>>>> They are multiple generations of productions in
>>>>>>>>>>>>>>>>> market and running in customers data centers for years.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Back to 2019 when we start working on vDPA, we
>>>>>>>>>>>>>>>>> have sent some samples of production(e.g.,
>>>>>>>>>>>>>>>>> Cascade
>>>>>>>>>>>>>>>>> Glacier) and the datasheet, you can find live
>>>>>>>>>>>>>>>>> migration facilities there, includes suspend, vq
>>>>>>>>>>>>>>>>> state and other
>>>>>>> features.
>>>>>>>>>>>>>>>>> And there is an reference in DPDK live migration,
>>>>>>>>>>>>>>>>> I have provided this page
>>>>>>>>>>>>>>>>> before:
>>>>>>>>>>>>>>>>> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.ht
>>>>>>>>>>>>>>>>> ml, it has been working for long long time.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So if we let the facts speak, if we want to see
>>>>>>>>>>>>>>>>> if the proposal is proven to work, I would
>>>>>>>>>>>>>>>>> say: They are POR for years, customers already
>>>>>>>>>>>>>>>>> deployed them for
>>>>>>>>>>> years.
>>>>>>>>>>>>>>>> And I guess what you are trying to say is that
>>>>>>>>>>>>>>>> this patchset we are reviewing here should be help
>>>>>>>>>>>>>>>> to the same standard and there should be a PoC? Sounds
>>>>> reasonable.
>>>>>>>>>>>>>>> Yes and the in-marketing productions are POR, the
>>>>>>>>>>>>>>> series just improves the design, for example, our
>>>>>>>>>>>>>>> series also use registers to track vq state, but
>>>>>>>>>>>>>>> improvements than CG or BSC. So I think they are
>>>>>>>>>>>>>>> proven
>>>>>>>>>>>>> to work.
>>>>>>>>>>>>>> If you prefer to go the route of POR and production
>>>>>>>>>>>>>> and proven documents
>>>>>>>>>>>>> etc, there is ton of it of multiple types of products I
>>>>>>>>>>>>> can dump here with open- source code and documentation and
>>>>> more.
>>>>>>>>>>>>>> Let me know what you would like to see.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Michael has requested some performance comparisons,
>>>>>>>>>>>>>> not all are ready to
>>>>>>>>>>>>> share yet.
>>>>>>>>>>>>>> Some are present that I will share in coming weeks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And all the vdpa dpdk you published does not have
>>>>>>>>>>>>>> basic CVQ support when I
>>>>>>>>>>>>> last looked at it.
>>>>>>>>>>>>>> Do you know when was it added?
>>>>>>>>>>>>> It's good enough for PoC I think, CVQ or not.
>>>>>>>>>>>>> The problem with CVQ generally, is that VDPA wants to
>>>>>>>>>>>>> shadow CVQ it at all times because it wants to decode
>>>>>>>>>>>>> and cache the content. But this problem has nothing to
>>>>>>>>>>>>> do with dirty tracking even though it also
>>>>>>>>>>> mentions "shadow":
>>>>>>>>>>>>> if device can report it's state then there's no need to shadow
>>>>> CVQ.
>>>>>>>>>>>> For the performance numbers with the pre-copy and device
>>>>>>>>>>>> context of
>>>>>>>>>>> patches posted 1 to 5, the downtime reduction of the VM is
>>>>>>>>>>> 3.71x with active traffic on 8 RQs at 100Gbps port speed.
>>>>>>>>>>>
>>>>>>>>>>> Sounds good can you please post a bit more detail?
>>>>>>>>>>> which configs are you comparing what was the result on each of
>>>>> them.
>>>>>>>>>> Common config: 8+8 tx and rx queues.
>>>>>>>>>> Port speed: 100Gbps
>>>>>>>>>> QEMU 8.1
>>>>>>>>>> Libvirt 7.0
>>>>>>>>>> GVM: Centos 7.4
>>>>>>>>>> Device: virtio VF hardware device
>>>>>>>>>>
>>>>>>>>>> Config_1: virtio suspend/resume similar to what Lingshan has,
>>>>>>>>>> largely vdpa stack
>>>>>>>>>> Config_2: Device context method of admin commands
>>>>>>>>> OK that sounds good. The weird thing here is that you measure
>>>>> "downtime".
>>>>>>>>> What exactly do you mean here?
>>>>>>>>> I am guessing it's the time to retrieve on source and re-program
>>>>>>>>> device state on destination? And this is 3.71x out of how long?
>>>>>>>> Yes. Downtime is the time during which the VM is not responding or
>>>>>>>> receiving
>>>>>>> packets, which involves reprogramming the device.
>>>>>>>> 3.71x is relative time for this discussion.
>>>>>>> Oh interesting. So VM state movement including reprogramming the CPU
>>>>>>> is dominated by reprogramming this single NIC, by a factor of almost 4?
>>>>>> Yes.
>>>>> Could you post some numbers too then?  I want to know whether that would
>>>>> imply that VM boot is slowed down significantly too. If yes that's another
>>>>> motivation for pci transport 2.0.
>>>> It was 1.8 sec down to 480msec.
>>> Well, there's work ongoing to reduce the downtime of the shadow virtqueue.
>>>
>>> Eugenio or Si-wei may share an exact number, but it should be several
>>> hundreds of ms.
>> That was mostly for device teardown time at the the source but there's
>> also setup cost at the destination that needs to be counted.
>> Several hundred of milliseconds would be the ultimate goal I would say
>> (right now the numbers from Parav more or less reflects the status quo
>> but there's ongoing work to make it further down), and I don't doubt
>> several hundreds of ms is possible. But to be fair, on the other hand,
>> shadow vq on real vdpa hardware device would need a lot of dedicated
>> optimization work across all layers (including hardware or firmware) all
>> over the places to achieve what a simple suspend-resume (save/load)
>> interface can easily do with VFIO migration.
> That's fine. Just to clairfy, shadow virtqueue here doesn't mean it
> can't save/load. We want to see how it is useful for dirty page
> tracking since tracking dirty pages by device itself seems problematic
> at least from my point of view.
TBH I don't see how this comparison can help prove the problematic part 
of device dirty tracking, or if it has anything to do with. In many 
cases vDPA and hardware virtio are for different deployment scenarios 
with varied target users, I don't see how vDPA can completely substitute 
hardware virtio for many reasons regardless shadow virtqueue wins or not.

If anything relevant I would more like to see performance comparison 
with platform dirty tracking via IOMMUFD, but that's perhaps too early 
stage at this point to conclude anything given there's very limited 
availability (in terms of supporting software, I know some supporting 
hardware has been around for a few years) and none of the potential 
software optimizations is in place at this point to make a fair 
comparison for. Granted device assisted tracking has its own set of 
limitations e.g. loose coupling or integration with platform features, 
lack of nested and PASID support et al. However, state of the art for 
platform dirty tracking is not perfect either, far off being highly 
optimized for all types of workload or scenarios. At least to me the 
cost of page table walk to scan all PTEs across all levels is not easily 
negligible - given no PML equivalent here, are we sure the whole range 
scan can be as efficient and scalable as memory size / # of PTEs grows? 
How large it may impact the downtime with this rudimentary dirty scan? 
No data point was given thus far. If chances are that there could be 
major improvement from device tracking for those general use cases to 
supplement what platform cannot achieve efficiently enough, it's not too 
good to kill off the possibility entirely at this early stage. Maybe a 
PoC or some comparative performance data can help prove the theory?

On the other hand, the device assisted tracking has at least one 
advantage that platform cannot simply offer - throttle down device for 
convergence, inherently or explicitly whenever needed. I think earlier 
Micheal suggested something to make the core data structure used for 
logging more efficient and compact, working like PML but using a queue 
or an array, and the entry of which may contain a list of discrete pages 
or contiguous PFN ranges. On top of this one may add parallelism to 
distribute load to multiple queues, or add zero copy to speed up dirty 
sync to userspace - things virtio queues are pretty good at doing. After 
all, nothing can be perfect to begin with, and every complex feature 
would need substantial time to improve and evolve. It does so for shadow 
virtqueue from where it gets started to where it is now, even so there's 
still a lot of optimization work not done yet. There must be head room 
here for device page tracking or platform tracking, too.

Regards,
-Siwei


>
> Shadow virtqueue can be used with a save/load model for device state
> recovery for sure.
>
>>> But it seems the shadow virtqueue itself is not the major factor but
>>> the time spent on programming vendor specific mappings for example.
>> Yep. The slowness on mapping part is mostly due to the artifact of
>> software-based implementation. IMHO for live migration p.o.v it's better
>> to not involve any mapping operation in the down time path at all.
> Yes.
>
> Thanks
>
>> -Siwei
>>> Thanks
>>>
>>>> The time didn't come from pci side or boot side.
>>>>
>>>> For pci side of things you would want to compare the pci vs non pci device based VM boot time.
>>>>
>>> This publicly archived list offers a means to provide input to the
>>>
>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>
>>>
>>>
>>> In order to verify user consent to the Feedback License terms and
>>>
>>> to minimize spam in the list archive, subscription is required
>>>
>>> before posting.
>>>
>>>
>>>
>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>
>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>
>>> List help: virtio-comment-help@lists.oasis-open.org
>>>
>>> List archive: https://lists.oasis-open.org/archives/virtio-comment/
>>>
>>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
>>>
>>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
>>>
>>> Committee: https://www.oasis-open.org/committees/virtio/
>>>
>>> Join OASIS: https://www.oasis-open.org/join/
>>>
>>
>> This publicly archived list offers a means to provide input to the
>> OASIS Virtual I/O Device (VIRTIO) TC.
>>
>> In order to verify user consent to the Feedback License terms and
>> to minimize spam in the list archive, subscription is required
>> before posting.
>>
>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>> List help: virtio-comment-help@lists.oasis-open.org
>> List archive: https://lists.oasis-open.org/archives/virtio-comment/
>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
>> Committee: https://www.oasis-open.org/committees/virtio/
>> Join OASIS: https://www.oasis-open.org/join/
>>
>
> This publicly archived list offers a means to provide input to the
>
> OASIS Virtual I/O Device (VIRTIO) TC.
>
>
>
> In order to verify user consent to the Feedback License terms and
>
> to minimize spam in the list archive, subscription is required
>
> before posting.
>
>
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>
> List help: virtio-comment-help@lists.oasis-open.org
>
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
>
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
>
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
>
> Committee: https://www.oasis-open.org/committees/virtio/
>
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-23 13:19                                                                   ` Si-Wei Liu
@ 2023-11-23 14:39                                                                     ` Michael S. Tsirkin
  2023-11-24  2:29                                                                     ` Jason Wang
  1 sibling, 0 replies; 157+ messages in thread
From: Michael S. Tsirkin @ 2023-11-23 14:39 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Jason Wang, Parav Pandit, Zhu, Lingshan, virtio-comment, cohuck,
	sburla, Shahaf Shuler, Maor Gottlieb, Yishai Hadas, eperezma

On Thu, Nov 23, 2023 at 05:19:06AM -0800, Si-Wei Liu wrote:
> If anything relevant I would more like to see performance comparison with
> platform dirty tracking via IOMMUFD, but that's perhaps too early stage at
> this point to conclude anything given there's very limited availability (in
> terms of supporting software, I know some supporting hardware has been
> around for a few years) and none of the potential software optimizations is
> in place at this point to make a fair comparison for.


Exactly. I suggested shadow as a kind of fallback since it is instead
available.

> Granted device
> assisted tracking has its own set of limitations e.g. loose coupling or
> integration with platform features, lack of nested and PASID support et al.
> However, state of the art for platform dirty tracking is not perfect either,
> far off being highly optimized for all types of workload or scenarios. At
> least to me the cost of page table walk to scan all PTEs across all levels
> is not easily negligible - given no PML equivalent here, are we sure the
> whole range scan can be as efficient and scalable as memory size / # of PTEs
> grows? How large it may impact the downtime with this rudimentary dirty
> scan? No data point was given thus far. If chances are that there could be
> major improvement from device tracking for those general use cases to
> supplement what platform cannot achieve efficiently enough, it's not too
> good to kill off the possibility entirely at this early stage. Maybe a PoC
> or some comparative performance data can help prove the theory?
> 
> On the other hand, the device assisted tracking has at least one advantage
> that platform cannot simply offer - throttle down device for convergence,
> inherently or explicitly whenever needed. I think earlier Micheal suggested
> something to make the core data structure used for logging more efficient
> and compact, working like PML but using a queue or an array, and the entry
> of which may contain a list of discrete pages or contiguous PFN ranges.

Hmm no, what I really meant is a way for device to store all or parts
of this structure in host RAM as opposed to all of it in on-device
memory as it has to be now.

> On
> top of this one may add parallelism to distribute load to multiple queues,
> or add zero copy to speed up dirty sync to userspace - things virtio queues
> are pretty good at doing. After all, nothing can be perfect to begin with,
> and every complex feature would need substantial time to improve and evolve.
> It does so for shadow virtqueue from where it gets started to where it is
> now, even so there's still a lot of optimization work not done yet. There
> must be head room here for device page tracking or platform tracking, too.
> 
> Regards,
> -Siwei
> 
> 
> > 
> > Shadow virtqueue can be used with a save/load model for device state
> > recovery for sure.
> > 
> > > > But it seems the shadow virtqueue itself is not the major factor but
> > > > the time spent on programming vendor specific mappings for example.
> > > Yep. The slowness on mapping part is mostly due to the artifact of
> > > software-based implementation. IMHO for live migration p.o.v it's better
> > > to not involve any mapping operation in the down time path at all.
> > Yes.
> > 
> > Thanks
> > 
> > > -Siwei
> > > > Thanks
> > > > 
> > > > > The time didn't come from pci side or boot side.
> > > > > 
> > > > > For pci side of things you would want to compare the pci vs non pci device based VM boot time.
> > > > > 
> > > > This publicly archived list offers a means to provide input to the
> > > > 
> > > > OASIS Virtual I/O Device (VIRTIO) TC.
> > > > 
> > > > 
> > > > 
> > > > In order to verify user consent to the Feedback License terms and
> > > > 
> > > > to minimize spam in the list archive, subscription is required
> > > > 
> > > > before posting.
> > > > 
> > > > 
> > > > 
> > > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > > > 
> > > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > > > 
> > > > List help: virtio-comment-help@lists.oasis-open.org
> > > > 
> > > > List archive: https://lists.oasis-open.org/archives/virtio-comment/
> > > > 
> > > > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> > > > 
> > > > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> > > > 
> > > > Committee: https://www.oasis-open.org/committees/virtio/
> > > > 
> > > > Join OASIS: https://www.oasis-open.org/join/
> > > > 
> > > 
> > > This publicly archived list offers a means to provide input to the
> > > OASIS Virtual I/O Device (VIRTIO) TC.
> > > 
> > > In order to verify user consent to the Feedback License terms and
> > > to minimize spam in the list archive, subscription is required
> > > before posting.
> > > 
> > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > > List help: virtio-comment-help@lists.oasis-open.org
> > > List archive: https://lists.oasis-open.org/archives/virtio-comment/
> > > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> > > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> > > Committee: https://www.oasis-open.org/committees/virtio/
> > > Join OASIS: https://www.oasis-open.org/join/
> > > 
> > 
> > This publicly archived list offers a means to provide input to the
> > 
> > OASIS Virtual I/O Device (VIRTIO) TC.
> > 
> > 
> > 
> > In order to verify user consent to the Feedback License terms and
> > 
> > to minimize spam in the list archive, subscription is required
> > 
> > before posting.
> > 
> > 
> > 
> > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > 
> > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > 
> > List help: virtio-comment-help@lists.oasis-open.org
> > 
> > List archive: https://lists.oasis-open.org/archives/virtio-comment/
> > 
> > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> > 
> > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> > 
> > Committee: https://www.oasis-open.org/committees/virtio/
> > 
> > Join OASIS: https://www.oasis-open.org/join/
> > 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-23 13:19                                                                   ` Si-Wei Liu
  2023-11-23 14:39                                                                     ` Michael S. Tsirkin
@ 2023-11-24  2:29                                                                     ` Jason Wang
  2023-11-28  3:00                                                                       ` Si-Wei Liu
  1 sibling, 1 reply; 157+ messages in thread
From: Jason Wang @ 2023-11-24  2:29 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Parav Pandit, Michael S. Tsirkin, Zhu, Lingshan, virtio-comment,
	cohuck, sburla, Shahaf Shuler, Maor Gottlieb, Yishai Hadas,
	eperezma

On Thu, Nov 23, 2023 at 9:19 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 11/21/2023 9:31 PM, Jason Wang wrote:
> > On Wed, Nov 22, 2023 at 10:31 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >> (dropping my personal email abandoned for upstream discussion for now,
> >> please try to copy my corporate email address for more timely response)
> >>
> >> On 11/20/2023 10:55 PM, Jason Wang wrote:
> >>> On Fri, Nov 17, 2023 at 10:48 PM Parav Pandit <parav@nvidia.com> wrote:
> >>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>> Sent: Friday, November 17, 2023 7:31 PM
> >>>>> To: Parav Pandit <parav@nvidia.com>
> >>>>>
> >>>>> On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
> >>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>> Sent: Friday, November 17, 2023 6:02 PM
> >>>>>>>
> >>>>>>> On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> >>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>> Sent: Friday, November 17, 2023 5:35 PM
> >>>>>>>>> To: Parav Pandit <parav@nvidia.com>
> >>>>>>>>>
> >>>>>>>>> On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> >>>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>> Sent: Friday, November 17, 2023 5:04 PM
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> >>>>>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>>> Sent: Friday, November 17, 2023 4:30 PM
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
> >>>>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>>>>> Sent: Friday, November 17, 2023 3:30 PM
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>>>> On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu,
> >>>>>>>>>>>>>>>> Lingshan
> >>>>>>> wrote:
> >>>>>>>>>>>>>>>>> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>>>>>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav
> >>>>>>>>>>>>>>>>>> Pandit
> >>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>> We should expose a limit of the device in the
> >>>>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>> WRITE_RECORD_CAP_QUERY command, that how much
> >>>>> range
> >>>>>>>>>>>>>>> it can
> >>>>>>>>>>> track.
> >>>>>>>>>>>>>>>>>>> So that future provisioning framework can use it.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I will cover this in v5 early next week.
> >>>>>>>>>>>>>>>>>> I do worry about how this can even work though.
> >>>>>>>>>>>>>>>>>> If you want a generic device you do not get to
> >>>>>>>>>>>>>>>>>> dictate how much memory VM
> >>>>>>>>>>> has.
> >>>>>>>>>>>>>>>>>> Aren't we talking bit per page? With 1TByte of
> >>>>>>>>>>>>>>>>>> memory to track
> >>>>>>>>>>>>>>>>>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> And you happily say "we'll address this in the future"
> >>>>>>>>>>>>>>>>>> while at the same time fighting tooth and nail
> >>>>>>>>>>>>>>>>>> against adding single bit status registers because
> >>>>> scalability?
> >>>>>>>>>>>>>>>>>> I have a feeling doing this completely
> >>>>>>>>>>>>>>>>>> theoretical like this is
> >>>>>>>>>>> problematic.
> >>>>>>>>>>>>>>>>>> Maybe you have it all laid out neatly in your
> >>>>>>>>>>>>>>>>>> head but I suspect not all of TC can picture it
> >>>>>>>>>>>>>>>>>> clearly enough based just on spec
> >>>>>>>>>>> text.
> >>>>>>>>>>>>>>>>>> We do sometimes ask for POC implementation in
> >>>>>>>>>>>>>>>>>> linux / qemu to demonstrate how things work
> >>>>>>>>>>>>>>>>>> before merging
> >>>>>>> code.
> >>>>>>>>>>>>>>>>>> We skipped this for admin things so far but I
> >>>>>>>>>>>>>>>>>> think it's a good idea to start doing it here.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> What makes me pause a bit before saying please
> >>>>>>>>>>>>>>>>>> do a PoC is all the opposition that seems to
> >>>>>>>>>>>>>>>>>> exist to even using admin commands in the 1st
> >>>>>>>>>>>>>>>>>> place. I think once we finally stop arguing
> >>>>>>>>>>>>>>>>>> about whether to use admin commands at all then
> >>>>>>>>>>>>>>>>>> a PoC will be needed
> >>>>>>>>>>>>> before merging.
> >>>>>>>>>>>>>>>>> We have POR productions that implemented the
> >>>>>>>>>>>>>>>>> approach in my
> >>>>>>>>>>> series.
> >>>>>>>>>>>>>>>>> They are multiple generations of productions in
> >>>>>>>>>>>>>>>>> market and running in customers data centers for years.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Back to 2019 when we start working on vDPA, we
> >>>>>>>>>>>>>>>>> have sent some samples of production(e.g.,
> >>>>>>>>>>>>>>>>> Cascade
> >>>>>>>>>>>>>>>>> Glacier) and the datasheet, you can find live
> >>>>>>>>>>>>>>>>> migration facilities there, includes suspend, vq
> >>>>>>>>>>>>>>>>> state and other
> >>>>>>> features.
> >>>>>>>>>>>>>>>>> And there is an reference in DPDK live migration,
> >>>>>>>>>>>>>>>>> I have provided this page
> >>>>>>>>>>>>>>>>> before:
> >>>>>>>>>>>>>>>>> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.ht
> >>>>>>>>>>>>>>>>> ml, it has been working for long long time.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> So if we let the facts speak, if we want to see
> >>>>>>>>>>>>>>>>> if the proposal is proven to work, I would
> >>>>>>>>>>>>>>>>> say: They are POR for years, customers already
> >>>>>>>>>>>>>>>>> deployed them for
> >>>>>>>>>>> years.
> >>>>>>>>>>>>>>>> And I guess what you are trying to say is that
> >>>>>>>>>>>>>>>> this patchset we are reviewing here should be help
> >>>>>>>>>>>>>>>> to the same standard and there should be a PoC? Sounds
> >>>>> reasonable.
> >>>>>>>>>>>>>>> Yes and the in-marketing productions are POR, the
> >>>>>>>>>>>>>>> series just improves the design, for example, our
> >>>>>>>>>>>>>>> series also use registers to track vq state, but
> >>>>>>>>>>>>>>> improvements than CG or BSC. So I think they are
> >>>>>>>>>>>>>>> proven
> >>>>>>>>>>>>> to work.
> >>>>>>>>>>>>>> If you prefer to go the route of POR and production
> >>>>>>>>>>>>>> and proven documents
> >>>>>>>>>>>>> etc, there is ton of it of multiple types of products I
> >>>>>>>>>>>>> can dump here with open- source code and documentation and
> >>>>> more.
> >>>>>>>>>>>>>> Let me know what you would like to see.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Michael has requested some performance comparisons,
> >>>>>>>>>>>>>> not all are ready to
> >>>>>>>>>>>>> share yet.
> >>>>>>>>>>>>>> Some are present that I will share in coming weeks.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> And all the vdpa dpdk you published does not have
> >>>>>>>>>>>>>> basic CVQ support when I
> >>>>>>>>>>>>> last looked at it.
> >>>>>>>>>>>>>> Do you know when was it added?
> >>>>>>>>>>>>> It's good enough for PoC I think, CVQ or not.
> >>>>>>>>>>>>> The problem with CVQ generally, is that VDPA wants to
> >>>>>>>>>>>>> shadow CVQ it at all times because it wants to decode
> >>>>>>>>>>>>> and cache the content. But this problem has nothing to
> >>>>>>>>>>>>> do with dirty tracking even though it also
> >>>>>>>>>>> mentions "shadow":
> >>>>>>>>>>>>> if device can report it's state then there's no need to shadow
> >>>>> CVQ.
> >>>>>>>>>>>> For the performance numbers with the pre-copy and device
> >>>>>>>>>>>> context of
> >>>>>>>>>>> patches posted 1 to 5, the downtime reduction of the VM is
> >>>>>>>>>>> 3.71x with active traffic on 8 RQs at 100Gbps port speed.
> >>>>>>>>>>>
> >>>>>>>>>>> Sounds good can you please post a bit more detail?
> >>>>>>>>>>> which configs are you comparing what was the result on each of
> >>>>> them.
> >>>>>>>>>> Common config: 8+8 tx and rx queues.
> >>>>>>>>>> Port speed: 100Gbps
> >>>>>>>>>> QEMU 8.1
> >>>>>>>>>> Libvirt 7.0
> >>>>>>>>>> GVM: Centos 7.4
> >>>>>>>>>> Device: virtio VF hardware device
> >>>>>>>>>>
> >>>>>>>>>> Config_1: virtio suspend/resume similar to what Lingshan has,
> >>>>>>>>>> largely vdpa stack
> >>>>>>>>>> Config_2: Device context method of admin commands
> >>>>>>>>> OK that sounds good. The weird thing here is that you measure
> >>>>> "downtime".
> >>>>>>>>> What exactly do you mean here?
> >>>>>>>>> I am guessing it's the time to retrieve on source and re-program
> >>>>>>>>> device state on destination? And this is 3.71x out of how long?
> >>>>>>>> Yes. Downtime is the time during which the VM is not responding or
> >>>>>>>> receiving
> >>>>>>> packets, which involves reprogramming the device.
> >>>>>>>> 3.71x is relative time for this discussion.
> >>>>>>> Oh interesting. So VM state movement including reprogramming the CPU
> >>>>>>> is dominated by reprogramming this single NIC, by a factor of almost 4?
> >>>>>> Yes.
> >>>>> Could you post some numbers too then?  I want to know whether that would
> >>>>> imply that VM boot is slowed down significantly too. If yes that's another
> >>>>> motivation for pci transport 2.0.
> >>>> It was 1.8 sec down to 480msec.
> >>> Well, there's work ongoing to reduce the downtime of the shadow virtqueue.
> >>>
> >>> Eugenio or Si-wei may share an exact number, but it should be several
> >>> hundreds of ms.
> >> That was mostly for device teardown time at the the source but there's
> >> also setup cost at the destination that needs to be counted.
> >> Several hundred of milliseconds would be the ultimate goal I would say
> >> (right now the numbers from Parav more or less reflects the status quo
> >> but there's ongoing work to make it further down), and I don't doubt
> >> several hundreds of ms is possible. But to be fair, on the other hand,
> >> shadow vq on real vdpa hardware device would need a lot of dedicated
> >> optimization work across all layers (including hardware or firmware) all
> >> over the places to achieve what a simple suspend-resume (save/load)
> >> interface can easily do with VFIO migration.
> > That's fine. Just to clairfy, shadow virtqueue here doesn't mean it
> > can't save/load. We want to see how it is useful for dirty page
> > tracking since tracking dirty pages by device itself seems problematic
> > at least from my point of view.
> TBH I don't see how this comparison can help prove the problematic part
> of device dirty tracking, or if it has anything to do with.

Shadow virtqueue is not used to prove the problem, the problem could
be uncovered during the review.

The shadow virtuqueue is used to give us a bottom line. If a huge
effort were done for spec but it can't perform better than virtqueue,
the effort became meaningless.

> In many
> cases vDPA and hardware virtio are for different deployment scenarios
> with varied target users, I don't see how vDPA can completely substitute
> hardware virtio for many reasons regardless shadow virtqueue wins or not.

It's not about whether vDPA can win or not. It's about a quick
demonstration about how shadow virtqueue can perform.  From the view
of the shadow virtqueue, it doesn't know whether the underlayer is
vDPA or virtio. It's not hard to imagine, the downtime we get from
vDPA is the bottom line of downtime via virtio since virtio is much
more easier.

>
> If anything relevant I would more like to see performance comparison
> with platform dirty tracking via IOMMUFD, but that's perhaps too early
> stage at this point to conclude anything given there's very limited
> availability (in terms of supporting software, I know some supporting
> hardware has been around for a few years) and none of the potential
> software optimizations is in place at this point to make a fair
> comparison for.

We need to make sure the correctness of the function before we can
talk about optimizations. And I don't see how this proposal is
optimized for many ways.

> Granted device assisted tracking has its own set of
> limitations e.g. loose coupling or integration with platform features,
> lack of nested and PASID support et al. However, state of the art for
> platform dirty tracking is not perfect either, far off being highly
> optimized for all types of workload or scenarios. At least to me the
> cost of page table walk to scan all PTEs across all levels is not easily
> negligible - given no PML equivalent here, are we sure the whole range
> scan can be as efficient and scalable as memory size / # of PTEs grows?

If you see the discussion, this proposal requires scan PTEs as well in
many ways.

> How large it may impact the downtime with this rudimentary dirty scan?
> No data point was given thus far. If chances are that there could be
> major improvement from device tracking for those general use cases to
> supplement what platform cannot achieve efficiently enough, it's not too
> good to kill off the possibility entirely at this early stage. Maybe a
> PoC or some comparative performance data can help prove the theory?

We can ask in the thread of IOMMUFD dirty tracking patches.

>
> On the other hand, the device assisted tracking has at least one
> advantage that platform cannot simply offer - throttle down device for
> convergence, inherently or explicitly whenever needed.

Please refer the past discussion, I can see how it is throttled in the
case of PML similar mechanism. But I can't see how it can be done
here. This proposal requires the device to reserver sufficient
resources where the throttle is implementation specific where the
hypervisor can't depend on. It needs API to set dirty page rates at
least.

>I think earlier
> Micheal suggested something to make the core data structure used for
> logging more efficient and compact, working like PML but using a queue
> or an array, and the entry of which may contain a list of discrete pages
> or contiguous PFN ranges.

PML solve the resources problem but not other problem:

1) Throttling: it's still not something that hypervisor can depend.
The reason why PML in CPU work is that hypervisor can throttle the KVM
process so it can slow down to the expected dirty rates.
2) Platform specific issue: PASID, ATS, translation failures, reserved
regions, and a lot of other stuffs
3) vIOMMU issue: horrible delay in IOTLB invalidation path
4) Doesn't work in the case of vIOMMU offloading

And compare the the existing approach, it ends up with more PCI
transactions under heavy load.

> On top of this one may add parallelism to
> distribute load to multiple queues, or add zero copy to speed up dirty
> sync to userspace - things virtio queues are pretty good at doing. After
> all, nothing can be perfect to begin with, and every complex feature
> would need substantial time to improve and evolve.

Evolve is good, but the problem is platform is also evolving. The
function is duplicated there and platform provides a lot of advanced
features that can co-operate with dirty page tracking like vIOMMU
offloading where it almost impossible to be done in virtio. Virtio
needs to leverage the platform or transport instead of reinventing
wheels so it can focus on the virtio device logic.

> It does so for shadow
> virtqueue from where it gets started to where it is now, even so there's
> still a lot of optimization work not done yet. There must be head room
> here for device page tracking or platform tracking, too.

Let's then focus on the possible issues (I've pointed out a brunches).

Thanks

>
> Regards,
> -Siwei
>
>
> >
> > Shadow virtqueue can be used with a save/load model for device state
> > recovery for sure.
> >
> >>> But it seems the shadow virtqueue itself is not the major factor but
> >>> the time spent on programming vendor specific mappings for example.
> >> Yep. The slowness on mapping part is mostly due to the artifact of
> >> software-based implementation. IMHO for live migration p.o.v it's better
> >> to not involve any mapping operation in the down time path at all.
> > Yes.
> >
> > Thanks
> >
> >> -Siwei
> >>> Thanks
> >>>
> >>>> The time didn't come from pci side or boot side.
> >>>>
> >>>> For pci side of things you would want to compare the pci vs non pci device based VM boot time.
> >>>>
> >>> This publicly archived list offers a means to provide input to the
> >>>
> >>> OASIS Virtual I/O Device (VIRTIO) TC.
> >>>
> >>>
> >>>
> >>> In order to verify user consent to the Feedback License terms and
> >>>
> >>> to minimize spam in the list archive, subscription is required
> >>>
> >>> before posting.
> >>>
> >>>
> >>>
> >>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >>>
> >>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >>>
> >>> List help: virtio-comment-help@lists.oasis-open.org
> >>>
> >>> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> >>>
> >>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> >>>
> >>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> >>>
> >>> Committee: https://www.oasis-open.org/committees/virtio/
> >>>
> >>> Join OASIS: https://www.oasis-open.org/join/
> >>>
> >>
> >> This publicly archived list offers a means to provide input to the
> >> OASIS Virtual I/O Device (VIRTIO) TC.
> >>
> >> In order to verify user consent to the Feedback License terms and
> >> to minimize spam in the list archive, subscription is required
> >> before posting.
> >>
> >> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >> List help: virtio-comment-help@lists.oasis-open.org
> >> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> >> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> >> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> >> Committee: https://www.oasis-open.org/committees/virtio/
> >> Join OASIS: https://www.oasis-open.org/join/
> >>
> >
> > This publicly archived list offers a means to provide input to the
> >
> > OASIS Virtual I/O Device (VIRTIO) TC.
> >
> >
> >
> > In order to verify user consent to the Feedback License terms and
> >
> > to minimize spam in the list archive, subscription is required
> >
> > before posting.
> >
> >
> >
> > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >
> > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >
> > List help: virtio-comment-help@lists.oasis-open.org
> >
> > List archive: https://lists.oasis-open.org/archives/virtio-comment/
> >
> > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> >
> > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> >
> > Committee: https://www.oasis-open.org/committees/virtio/
> >
> > Join OASIS: https://www.oasis-open.org/join/
> >
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-22  6:41                                     ` [virtio-comment] " Parav Pandit
@ 2023-11-24  3:06                                       ` Jason Wang
  0 siblings, 0 replies; 157+ messages in thread
From: Jason Wang @ 2023-11-24  3:06 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment, mst, cohuck, sburla, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Wed, Nov 22, 2023 at 2:41 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, November 22, 2023 9:59 AM
> >
> > On Wed, Nov 22, 2023 at 12:31 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, November 21, 2023 12:45 PM
> > > >
> > > > On Thu, Nov 16, 2023 at 1:30 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Thursday, November 16, 2023 9:54 AM
> > > > > >
> > > > > > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com>
> > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Monday, November 13, 2023 9:07 AM
> > > > > > > >
> > > > > > > > On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > Sent: Tuesday, November 7, 2023 9:34 AM
> > > > > > > > > >
> > > > > > > > > > On Mon, Nov 6, 2023 at 2:54 PM Parav Pandit
> > > > > > > > > > <parav@nvidia.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > Sent: Monday, November 6, 2023 12:04 PM
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit
> > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > Sent: Thursday, November 2, 2023 9:54 AM
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Nov 1, 2023 at 11:02 AM Parav Pandit
> > > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > > Sent: Wednesday, November 1, 2023 6:00 AM
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Oct 31, 2023 at 11:27 AM Parav
> > > > > > > > > > > > > > > > Pandit <parav@nvidia.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:13 AM
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Mon, Oct 30, 2023 at 9:21 PM Parav
> > > > > > > > > > > > > > > > > > Pandit <parav@nvidia.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > During a device migration flow
> > > > > > > > > > > > > > > > > > > (typically in a precopy phase of the
> > > > > > > > > > > > > > > > > > > live migration), a device may write to the guest
> > memory.
> > > > > > > > > > > > > > > > > > > Some iommu/hypervisor may not be able
> > > > > > > > > > > > > > > > > > > to track these
> > > > > > > > > > > > > > written pages.
> > > > > > > > > > > > > > > > > > > These pages to be migrated from source
> > > > > > > > > > > > > > > > > > > to destination
> > > > > > > > > > hypervisor.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > A device which writes to these pages,
> > > > > > > > > > > > > > > > > > > provides the page address record of
> > > > > > > > > > > > > > > > > > > the to the owner
> > > > device.
> > > > > > > > > > > > > > > > > > > The owner device starts write
> > > > > > > > > > > > > > > > > > > recording for the device and queries
> > > > > > > > > > > > > > > > > > > all the page addresses written by the
> > > > > > > > device.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Fixes:
> > > > > > > > > > > > > > > > > > > https://github.com/oasis-tcs/virtio-sp
> > > > > > > > > > > > > > > > > > > ec/i
> > > > > > > > > > > > > > > > > > > ssue
> > > > > > > > > > > > > > > > > > > s/17
> > > > > > > > > > > > > > > > > > > 6
> > > > > > > > > > > > > > > > > > > Signed-off-by: Parav Pandit
> > > > > > > > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > > > > > > > > > > > Signed-off-by: Satananda Burla
> > > > > > > > > > > > > > > > > > > <sburla@marvell.com>
> > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > > changelog:
> > > > > > > > > > > > > > > > > > > v1->v2:
> > > > > > > > > > > > > > > > > > > - addressed comments from Michael
> > > > > > > > > > > > > > > > > > > - replaced iova with physical address
> > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > >  admin-cmds-device-migration.tex | 15
> > > > > > > > > > > > > > > > > > > +++++++++++++++
> > > > > > > > > > > > > > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > > > a/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > > > > > b/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > > > > > index ed911e4..2e32f2c
> > > > > > > > > > > > > > > > > > > 100644
> > > > > > > > > > > > > > > > > > > --- a/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > > > > > +++ b/admin-cmds-device-migration.tex
> > > > > > > > > > > > > > > > > > > @@ -95,6 +95,21 @@
> > > > > > > > > > > > > > > > > > > \subsubsection{Device
> > > > > > > > > > > > > > > > > > > Migration}\label{sec:Basic Facilities
> > > > > > > > > > > > > > > > > > > of a Virtio Device / The owner driver
> > > > > > > > > > > > > > > > > > > can discard any partially read or
> > > > > > > > > > > > > > > > > > > written device context when  any of
> > > > > > > > > > > > > > > > > > > the device migration flow
> > > > > > > > > > > > > > > > > > should be aborted.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > +During the device migration flow, a
> > > > > > > > > > > > > > > > > > > +passthrough device may write data to
> > > > > > > > > > > > > > > > > > > +the guest virtual machine's memory, a
> > > > > > > > > > > > > > > > > > > +source hypervisor needs to keep track
> > > > > > > > > > > > > > > > > > > +of these written memory to migrate
> > > > > > > > > > > > > > > > > > > +such memory to destination
> > > > > > > > > > > > > > > > > > hypervisor.
> > > > > > > > > > > > > > > > > > > +Some systems may not be able to keep
> > > > > > > > > > > > > > > > > > > +track of such memory write addresses
> > > > > > > > > > > > > > > > > > > +at hypervisor
> > > > level.
> > > > > > > > > > > > > > > > > > > +In such a scenario, a device records
> > > > > > > > > > > > > > > > > > > +and reports these written memory
> > > > > > > > > > > > > > > > > > > +addresses to the owner device. The
> > > > > > > > > > > > > > > > > > > +owner driver enables write recording
> > > > > > > > > > > > > > > > > > > +for one or more physical address
> > > > > > > > > > > > > > > > > > > +ranges per device during device
> > > > > > > > > > > > > > migration flow.
> > > > > > > > > > > > > > > > > > > +The owner driver periodically queries
> > > > > > > > > > > > > > > > > > > +these written physical address
> > > > > > > > > > > > > > > > records from the device.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I wonder how PA works in this case.
> > > > > > > > > > > > > > > > > > Device uses untranslated requests so it can only see
> > IOVA.
> > > > > > > > > > > > > > > > > > We can't mandate
> > > > > > > > > > ATS anyhow.
> > > > > > > > > > > > > > > > > Michael suggested to keep the language
> > > > > > > > > > > > > > > > > uniform as PA as this is ultimately
> > > > > > > > > > > > > > > > what the guest driver is supplying during vq
> > > > > > > > > > > > > > > > creation and in posting buffers as physical address.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This seems to need some work. And, can you
> > > > > > > > > > > > > > > > show me how it can
> > > > > > > > > > work?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1) e.g if GAW is 48 bit, is the hypervisor
> > > > > > > > > > > > > > > > expected to do a bisection of the whole range?
> > > > > > > > > > > > > > > > 2) does the device need to reserve
> > > > > > > > > > > > > > > > sufficient internal resources for logging
> > > > > > > > > > > > > > > > the dirty page and why
> > > > (not)?
> > > > > > > > > > > > > > > No when dirty page logging starts, only at
> > > > > > > > > > > > > > > that time, device will reserve
> > > > > > > > > > > > > > enough resources.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > GAW is 48bit, how large would it have then?
> > > > > > > > > > > > > Dirty page tracking is not dependent on the size of the GAW.
> > > > > > > > > > > > > It is function of address ranges for the amount of
> > > > > > > > > > > > > guest memory regardless of
> > > > > > > > > > > > GAW.
> > > > > > > > > > > >
> > > > > > > > > > > > The problem is, e.g when vIOMMU is enabled, you
> > > > > > > > > > > > can't know which IOVA is actually used by guests.
> > > > > > > > > > > > And even for the case when vIOMMU is not enabled,
> > > > > > > > > > > > the guest may have
> > > > several TBs.
> > > > > > > > > > > > Is it easy to reserve sufficient resources by the device itself?
> > > > > > > > > > > >
> > > > > > > > > > > When page tracking is enabled per device, it knows
> > > > > > > > > > > about the range and it can
> > > > > > > > > > reserve certain resource.
> > > > > > > > > >
> > > > > > > > > > I didn't see such an interface in this series. Anything I miss?
> > > > > > > > > >
> > > > > > > > > Yes, this patch and the next patch is covering the page
> > > > > > > > > tracking start,stop and
> > > > > > > > query commands.
> > > > > > > > > They are named as write recording commands.
> > > > > > > >
> > > > > > > > So I still don't see how the device can reserve sufficient resources?
> > > > > > > > Guests may map a very large area of memory to IOMMU (or when
> > > > > > > > vIOMMU is disabled, GPA is used). It would be several TBs,
> > > > > > > > how can the device reserve sufficient resources in this case?
> > > > > > > When the map is established, the ranges are supplied to the
> > > > > > > device to know
> > > > > > how much to reserve.
> > > > > > > If device does not have enough resource, it fails the command.
> > > > > > >
> > > > > > > One can advance it further to provision for the desired range..
> > > > > >
> > > > > > Well, I think I've asked whether or not a bisection is needed,
> > > > > > and you told me not ...
> > > > > >
> > > > > > But at least we need to document this in the proposal, no?
> > > > > >
> > > > > We should expose a limit of the device in the proposed
> > > > WRITE_RECORD_CAP_QUERY command, that how much range it can track.
> > > > > So that future provisioning framework can use it.
> > > > >
> > > > > I will cover this in v5 early next week.
> > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > Btw, the IOVA is allocated by the guest actually, how
> > > > > > > > > > can we know the
> > > > > > > > range?
> > > > > > > > > > (or using the host range?)
> > > > > > > > > >
> > > > > > > > > Hypervisor would have mapping translation.
> > > > > > > >
> > > > > > > > That's really tricky and can only work in some cases:
> > > > > > > >
> > > > > > > > 1) It requires the hypervisor to traverse the guest I/O page
> > > > > > > > tables which could be very large range
> > > > > > > > 2) It requests the hypervisor to trap the modification of
> > > > > > > > guest I/O page tables and synchronize with the range
> > > > > > > > changes, which is inefficient and can only be done when we are
> > doing shadow PTEs.
> > > > > > > > It won't work when the nesting translation could be
> > > > > > > > offloaded to the hardware
> > > > > > > > 3) It is racy with the guest modification of I/O page tables
> > > > > > > > which is explained in another thread
> > > > > > > Mapping changes with more hw mmu's is not a frequent event and
> > > > > > > IOTLB
> > > > > > flush is done using querying the dirty log for the smaller range.
> > > > > > >
> > > > > > > > 4) No aware of new features like PASID which has been
> > > > > > > > explained in another thread
> > > > > > > For all the pinned work with non sw based IOMMU, it is
> > > > > > > typically small
> > > > subset.
> > > > > > > PASID is guest controlled.
> > > > > >
> > > > > > Let's repeat my points:
> > > > > >
> > > > > > 1) vq1 use untranslated request with PASID1
> > > > > > 2) vq2 use untranslated request with PASID2
> > > > > >
> > > > > > Shouldn't we log PASID as well?
> > > > > >
> > > > > Possibly yes, either to request the tracking per PASID or to log the PASID.
> > > > > When in future PASID based VQ are supported, this part should be
> > > > extended.
> > > >
> > > > Who is going to do the extension? They are orthogonal features for sure.
> > > Whoever extends the VQ for PASID programming.
> > >
> > > I plan to have generic command for VQ creation over CVQ
> >
> > Another unrelated issue.
> I disagree.
>
> >
> > > for the wider use cases we discussed.
> >
> > CVQ might want a dedicated PASID.
> Why? For one off queue like that may be additional register because this is still bootstrap phase.

For many reasons. Hypervisor may want to control CVQ.

> But using that as argument point to generalize for rest of the queue is wrong.
>
> >
> > > It can have PASID parameter in future when one wants to add it.
> > >
> > > >
> > > > >
> > > > > > And
> > > > > >
> > > > > > 1) vq1 is using translated request
> > > > > > 2) vq2 is using untranslated request
> > > > > >
> > > >
> > > > How about this?
> > > How did driver program the device for vq1 to translated request and vq2 to
> > not.
> > > And for which use case?
> >
> > Again, it is allowed by the PCI spec, no? You've explained yourself that your
> > design needs to obey PCI spec.
> >
> How did the guest driver program this in the device?

It doesn't. When ATS is enabled, the device can use untranslated
requests as well as translated requests. No?

>
> > And, if you want to ask. for use case, there are handy:
> >
> > - ATS
> > - When IOMMU_PLATFORM is not negotiated
> > - MSI
> >
> So why and how driver did it differently for two vqs?

I hope this is your serious answer but it looks like it is not. It has
nothing to do with the driver.

Vq1 is doing DMA
Vq2 is doing MSI-X

>
> > Let's make sure the function of your proposal is correct before talking about
> > any use cases.
> This proposal as nothing to do with vqs.

See above.

> It is simply that tracking does not involve PASID at the moment, and it can be added in future.
>
> >
> > >
> > > >
> > > > >
> > > > > > How could we differ?
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Host should always have more resources than device,
> > > > > > > > > > > > in that sense there could be several methods that
> > > > > > > > > > > > tries to utilize host memory instead of the one in
> > > > > > > > > > > > the device. I think we've discussed this when going
> > > > > > > > > > > > through the doc prepared
> > > > by Eugenio.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > What happens if we're trying to migrate more than 1
> > device?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > That is perfectly fine.
> > > > > > > > > > > > > Each device is updating its log of pages it wrote.
> > > > > > > > > > > > > The hypervisor is collecting their sum.
> > > > > > > > > > > >
> > > > > > > > > > > > See above.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 3) DMA is part of the transport, it's
> > > > > > > > > > > > > > > > natural to do logging there, why duplicate efforts in the
> > virtio layer?
> > > > > > > > > > > > > > > He he, you have funny comment.
> > > > > > > > > > > > > > > When an abstract facility is added to virtio
> > > > > > > > > > > > > > > you say to do in
> > > > > > > > transport.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So it's not done in the general facility but
> > > > > > > > > > > > > > tied to the admin
> > > > part.
> > > > > > > > > > > > > > And we all know dirty page tracking is a
> > > > > > > > > > > > > > challenge and Eugenio has a good summary of
> > > > > > > > > > > > > > pros/cons. A revisit of those docs make me think
> > > > > > > > > > > > > > virtio is not the good place for doing that for
> > > > > > > > may reasons:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1) as stated, platform will evolve to be able to
> > > > > > > > > > > > > > tracking dirty pages, actually, it has been
> > > > > > > > > > > > > > supported by a lot of major IOMMU vendors
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is optional facility in virtio.
> > > > > > > > > > > > > Can you please point to the references? I don’t
> > > > > > > > > > > > > see it in the common Linux
> > > > > > > > > > > > kernel support for it.
> > > > > > > > > > > >
> > > > > > > > > > > > Note that when IOMMUFD is being proposed, dirty page
> > > > > > > > > > > > tracking is one of the major considerations.
> > > > > > > > > > > >
> > > > > > > > > > > > This is one recent proposal:
> > > > > > > > > > > >
> > > > > > > > > > > > https://www.spinics.net/lists/kvm/msg330894.html
> > > > > > > > > > > >
> > > > > > > > > > > Sure, so if platform supports it. it can be used from the
> > platform.
> > > > > > > > > > > If it does not, the device supplies it.
> > > > > > > > > > >
> > > > > > > > > > > > > Instead Linux kernel choose to extend to the devices.
> > > > > > > > > > > >
> > > > > > > > > > > > Well, as I stated, tracking dirty pages is
> > > > > > > > > > > > challenging if you want to do it on a device, and
> > > > > > > > > > > > you can't simply invent dirty page tracking for each type of
> > the devices.
> > > > > > > > > > > >
> > > > > > > > > > > It is not invented.
> > > > > > > > > > > It is generic framework for all virtio device types as proposed
> > here.
> > > > > > > > > > > Keep in mind, that it is optional already in v3 series.
> > > > > > > > > > >
> > > > > > > > > > > > > At least not seen to arrive this in any near term
> > > > > > > > > > > > > in start of
> > > > > > > > > > > > > 2024 which is
> > > > > > > > > > > > where users must use this.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 2) you can't assume virtio is the only device
> > > > > > > > > > > > > > that can be used by the guest, having dirty
> > > > > > > > > > > > > > pages tracking to be implemented in each type of
> > > > > > > > > > > > > > device is unrealistic
> > > > > > > > > > > > > Of course, there is no such assumption made. Where
> > > > > > > > > > > > > did you see a text that
> > > > > > > > > > > > made such assumption?
> > > > > > > > > > > >
> > > > > > > > > > > > So what happens if you have a guest with virtio and
> > > > > > > > > > > > other devices
> > > > > > > > assigned?
> > > > > > > > > > > >
> > > > > > > > > > > What happens? Each device type would do its own dirty
> > > > > > > > > > > page
> > > > tracking.
> > > > > > > > > > > And if all devices does not have support, hypervisor
> > > > > > > > > > > knows to fall back to
> > > > > > > > > > platform iommu or its own.
> > > > > > > > > > >
> > > > > > > > > > > > > Each virtio and non virtio devices who wants to
> > > > > > > > > > > > > report their dirty page report,
> > > > > > > > > > > > will do their way.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 3) inventing it in the virtio layer will be
> > > > > > > > > > > > > > deprecated in the future for sure, as platform
> > > > > > > > > > > > > > will provide much rich features for logging e.g
> > > > > > > > > > > > > > it can do it per PASID etc, I don't see any
> > > > > > > > > > > > > > reason virtio need to compete with the features
> > > > > > > > > > > > > > that will be provided by the platform
> > > > > > > > > > > > > Can you bring the cpu vendors and committement to
> > > > > > > > > > > > > virtio tc with timelines
> > > > > > > > > > > > so that virtio TC can omit?
> > > > > > > > > > > >
> > > > > > > > > > > > Why do we need to bring CPU vendors in the virtio TC?
> > > > > > > > > > > > Virtio needs to be built on top of transport or
> > > > > > > > > > > > platform. There's no need to duplicate
> > > > > > > > > > their job.
> > > > > > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > > > > > >
> > > > > > > > > > > I wanted to see a strong commitment for the cpu
> > > > > > > > > > > vendors to support dirty
> > > > > > > > > > page tracking.
> > > > > > > > > >
> > > > > > > > > > The RFC of IOMMUFD support can go back to early 2022.
> > > > > > > > > > Intel, AMD and ARM are all supporting that now.
> > > > > > > > > >
> > > > > > > > > > > And the work seems to have started for some platforms.
> > > > > > > > > >
> > > > > > > > > > Let me quote from the above link:
> > > > > > > > > >
> > > > > > > > > > """
> > > > > > > > > > Today, AMD Milan (or more recent) supports it while ARM
> > > > > > > > > > SMMUv3.2 alongside VT-D rev3.x also do support.
> > > > > > > > > > """
> > > > > > > > > >
> > > > > > > > > > > Without such platform commitment, virtio also skipping
> > > > > > > > > > > it would not
> > > > > > work.
> > > > > > > > > >
> > > > > > > > > > Is the above sufficient? I'm a little bit more familiar
> > > > > > > > > > with vtd, the hw feature has been there for years.
> > > > > > > > > >
> > > > > > > > > Vtd has a sticky D bit that requires synchronization with
> > > > > > > > > IOPTE page caches
> > > > > > > > when sw wants to clear it.
> > > > > > > >
> > > > > > > > This is by design.
> > > > > > > >
> > > > > > > > > Do you know if is it reliable when device does multiple
> > > > > > > > > writes, ie,
> > > > > > > > >
> > > > > > > > > a. iommu write D bit
> > > > > > > > > b. software read it
> > > > > > > > > c. sw synchronize cache
> > > > > > > > > d. iommu write D bit on next write by device
> > > > > > > >
> > > > > > > > What issue did you see here? But that's not even an excuse,
> > > > > > > > if there's a bug, let's report it to IOMMU vendors and let them fix it.
> > > > > > > > The thread I point to you is actually a good space.
> > > > > > > >
> > > > > > > So we cannot claim that it is there in the platform.
> > > > > >
> > > > > > I'm confused, the thread I point to you did the cache
> > > > > > synchronization which has been explained in the changelog, so
> > > > > > what's the
> > > > issue?
> > > > > >
> > > > > If the ask is for IOMMU chip to fix something, we cannot claim
> > > > > that dirty
> > > > page tracking is available already in platform.
> > > >
> > > > Again, can you describe the issue? Why do you think the sticky part
> > > > is an issue? IOTLB needs to be sync with IO page tables, what's wrong with
> > this?
> > > Nothing wrong with it.
> > > The text is not affirmative to say it works if the sw clears it.
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > > Again, the point is to let the correct role play.
> > > > > > > >
> > > > > > > How many more years should we block the virtio device
> > > > > > > migration when
> > > > > > platform do not have it?
> > > > > >
> > > > > > At least for VT-D, it has been used for years.
> > > > > Is this device written pages tracked by KVM for VT-d as dirty page
> > > > > log,
> > > > instead through vfio?
> > > >
> > > > I don't get this question.
> > > You said the VT-d has dirty page tracking for years so it must be used by the
> > sw during device migration.
> >
> > It's the best way if the platform has the support for that.
> >
> > > And if that is there, how is these dirty pages of iommu are merged with the
> > cpu side?
> > > Is this done by KVM for passthrough devices for vfio?
> >
> > I don't see how it is related to the discussion here. IOMMU support is
> > sufficient as a start. If you requires CPU support, virtio is clearly the wrong
> > forum.
> You made point that VT-d dirty tracking is in use for years.
> I am asking how kernel consumed it for passthrough devices like vfio?

I've shown you the RFC in the past, and I've explained why it is
delayed because of the work of IOMMUFD. What else do you want?

Where is the code of supporting TDISP in the kernel? Is it supported
by any kernel now?

[...]

> > It helps the reviewers, it doesn't harm to have a summary in the changelog. Or
> > people may ask the same questions endlessly.
> >
> At least the current reviewers who discussed should stop asking endlessly. :)
>

It's just because you still didn't explain it. Is this kind of summary
somewhere in V4?

> > >
> > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > > > > So it is unusable.
> > > > > > > > > > > >
> > > > > > > > > > > > It's not about mandating, it's about doing things in
> > > > > > > > > > > > the correct layer. If PRI is slow, PCI can evolve for sure.
> > > > > > > > > > > You should try.
> > > > > > > > > >
> > > > > > > > > > Not my duty, I just want to make sure things are done in
> > > > > > > > > > the correct layer, and once it needs to be done in the
> > > > > > > > > > virtio, there's nothing obviously
> > > > > > > > wrong.
> > > > > > > > > >
> > > > > > > > > At present, it looks all platforms are not equally ready
> > > > > > > > > for page
> > > > tracking.
> > > > > > > >
> > > > > > > > That's not an excuse to let virtio support that.
> > > > > > > It is wrong attribution as excuse.
> > > > > > >
> > > > > > > > And we need also to figure out if virtio can do that easily.
> > > > > > > > I've pointed out sufficient issues, I'm pretty sure there
> > > > > > > > would be more as the platform evolves.
> > > > > > > >
> > > > > > > I am not sure if virtio feeds the log into the platform.
> > > > > >
> > > > > > I don't understand the meaning here.
> > > > > >
> > > > > I mistakenly merged two sentences.
> > > > >
> > > > > Virtio feeds the dirty page details to the hypervisor platform
> > > > > which collects
> > > > and merges the page record.
> > > > > So it is platform choice to use iommu based tracking or device based.
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > In the current state, it is mandating.
> > > > > > > > > > > And if you think PRI is the only way,
> > > > > > > > > >
> > > > > > > > > > I don't, it's just an example where virtio can leverage
> > > > > > > > > > from either transport or platform. Or if it's the fault
> > > > > > > > > > in virtio that slows down the PRI, then it is something we can do.
> > > > > > > > > >
> > > > > > > > > Yea, it does not seem to be ready yet.
> > > > > > > > >
> > > > > > > > > > >  than you should propose that in the dirty page
> > > > > > > > > > > tracking series that you listed
> > > > > > > > > > above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > > > > > >
> > > > > > > > > > No, the point is to not duplicate works especially
> > > > > > > > > > considering virtio can't do better than platform or transport.
> > > > > > > > > >
> > > > > > > > > Both the platform and virtio work is ongoing.
> > > > > > > >
> > > > > > > > Why duplicate the work then?
> > > > > > > >
> > > > > > > Not all cpu platforms support as far as I know.
> > > > > >
> > > > > > Yes, but we all know the platform is working to support this.
> > > > > >
> > > > > > Supporting this on the device is hard.
> > > > > >
> > > > > This is optional, whichever device would like to implement it, will support
> > it.
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > When one does something in transport, you say,
> > > > > > > > > > > > > > > this is transport specific, do
> > > > > > > > > > > > > > some generic.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Here the device is being tracked is virtio device.
> > > > > > > > > > > > > > > PCI-SIG has told already that PCIM interface
> > > > > > > > > > > > > > > is outside the scope of
> > > > > > > > it.
> > > > > > > > > > > > > > > Hence, this is done in virtio layer here in abstract way.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > You will end up with a competition with the
> > > > > > > > > > > > > > platform/transport one that will fail.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > I don’t see a reason. There is no competition.
> > > > > > > > > > > > > Platform always have a choice to not use device
> > > > > > > > > > > > > side page tracking when it is
> > > > > > > > > > > > supported.
> > > > > > > > > > > >
> > > > > > > > > > > > Platform provides a lot of other functionalities for dirty
> > logging:
> > > > > > > > > > > > e.g per PASID, granular, etc. So you want to
> > > > > > > > > > > > duplicate them again in the virtio? If not, why choose this
> > way?
> > > > > > > > > > > >
> > > > > > > > > > > It is optional for the platforms where platform do not have it.
> > > > > > > > > >
> > > > > > > > > > We are developing new virtio functionalities that are
> > > > > > > > > > targeted for future platforms. Otherwise we would end up
> > > > > > > > > > with a feature with a very narrow use case.
> > > > > > > > > In general I agree that platform is an option too.
> > > > > > > > > Hypervisor will be able to make the decision to use
> > > > > > > > > platform when available
> > > > > > > > and fallback to device method when platform does not have it.
> > > > > > > > >
> > > > > > > > > Future and to be equally usable in near term :)
> > > > > > > >
> > > > > > > > Please don't double standard again:
> > > > > > > >
> > > > > > > > When you are talking about TDISP, you want virtio to be
> > > > > > > > designed to fit for the future where the platform is ready
> > > > > > > > in the future When you are talking about dirty tracking, you
> > > > > > > > want it to work now even if
> > > > > > > >
> > > > > > > The proposal of transport VQ is anti-TDISP.
> > > > > >
> > > > > > It's nothing about transport VQ, it's about you're saying the
> > > > > > adminq based device context. There's a comment to point out that
> > > > > > the current TDISP spec forbids modifying device state when TVM
> > > > > > is attached. Then you told us the TDISP may evolve for that.
> > > > > So? That is not double standard.
> > > > > The proposal is based on main principle that it is not depending
> > > > > on hypervisor traping + emulating which is the baseline of TDISP
> > > > >
> > > > > >
> > > > > > > The proposal of dirty tracking is not anti-platform. It is
> > > > > > > optional like rest of the
> > > > > > platform.
> > > > > > >
> > > > > > > > 1) most of the platform is ready now
> > > > > > > Can you list a ARM server CPU in production that has it? (not
> > > > > > > in some pdf
> > > > > > spec).
> > > > > >
> > > > > > Then in the context of a dirty page, I've proved you dirty page
> > > > > > tracking has been supported by all major vendors.
> > > > > Major IP vendor != major cpu chip vendor.
> > > > > I don’t agree with the proof.
> > > >
> > > > So this will be an endless debate. Did I ever ask you about ETA or
> > > > any product for TDISP?
> > > >
> > > ETA for TDISP is not relevant.
> > > You claimed for _major_ vendor support based on nonphysical cpu, hence
> > the disagreement.
> >
> > How did you define "support"?
> >
> You defined that it is supported. So, I think you deserve to define "support". :)

Support in the spec. If you don't think it can be called "support".
Please explain why.

>
> > Dirty tracking has been wroted into the IOMMU manual for Intel, AMD and
> > ARM for years. So you think it's not supported now? I've told you it has been
> > shipped by Intel at least then you ask me which ARM vendor ships those
> > vIOMMU.
> >
> I wish that spec manual date = server in the cloud operator data center availability date.

Any new feature needs time to land. So do the live migration. You are
doubling the standard again.

>
> > For TDISP live migration, PCI doesn't even have a draft, no? I never ask which
> > chip vendor ships the platform.
> >
>
> > You want to support dirty page tracking in virtio and keep asking when it is
> > supported by all platform vendors.
> Because you claim that all physical cpu vendors support it without enlisting who is 'all' and 'major'.

Major means Intel, AMD and ARM. Does it sound good?

>
> >
> > You want to prove your proposal can work for TDISP and TDISP migration but
> > never explain when it would be supported by at least one vendor.
> >
> Part of the spec work is done keeping s

What's the meaning of this?

> > Let's have a unified standard please.
> The standard is unified.
> The base line tenet in the proposal is not put any interface on the TDISP itself for migration that needs to be accessed by some other entity.

Can you explain why we can wait for the platform support for TDISP but
not dirty page tracking?

>
> >
> > > And that is not the reality.
> > >
> > > > >
> > > > > I already acknowledged that I have seen internal test report for
> > > > > dirty tracking
> > > > with one cpu and nic.
> > > > >
> > > > > I just don’t see all cpus have support for it.
> > > > > Hence, this optional feature.
> > > >
> > > > Repeat myself again.
> > > >
> > > > If it can be done easily and efficiently in virtio, I agree. But
> > > > I've pointed out several issues where it is not answered.
> > >
> > > I have answered most of your questions.
> > >
> > > The definition of 'easy' is very subjective.
> >
> > The reason why I don't think it is easy is because I can easily see several issues
> > that can't be solved easily.
> >
> > > At one point RSS was also not easy in some devices and IOMMU dirty page
> > tracking was also not easy.
> >
> > Yes, but we can offload the IOMMU part to the vendor. Virtio can't do
> > anything especially the part that duplicates with the function provided by the
> > transport or platform.
> And when platform does not provide, virtio device can.

Platform doesn't support TDISP now, why don't you invent that?

Thanks



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-22  4:28                                                                   ` Parav Pandit
@ 2023-11-24  3:08                                                                     ` Jason Wang
  0 siblings, 0 replies; 157+ messages in thread
From: Jason Wang @ 2023-11-24  3:08 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan, virtio-comment, cohuck,
	sburla, Shahaf Shuler, Maor Gottlieb, Yishai Hadas, eperezma,
	Siwei Liu

On Wed, Nov 22, 2023 at 12:28 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, November 22, 2023 9:50 AM
> >
> > On Wed, Nov 22, 2023 at 12:30 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, November 21, 2023 12:25 PM
> > > >
> > > > On Fri, Nov 17, 2023 at 10:48 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 7:31 PM
> > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Friday, November 17, 2023 6:02 PM
> > > > > > > >
> > > > > > > > On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > Sent: Friday, November 17, 2023 5:35 PM
> > > > > > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > > > > > >
> > > > > > > > > > On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> > > > > > > > > > >
> > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > Sent: Friday, November 17, 2023 5:04 PM
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit
> > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > > > Sent: Friday, November 17, 2023 4:30 PM
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav
> > > > > > > > > > > > > > Pandit
> > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800,
> > > > > > > > > > > > > > > > > Zhu, Lingshan
> > > > > > > > wrote:
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM
> > > > > > > > > > > > > > > > >>> +0000, Parav Pandit
> > > > > > > > wrote:
> > > > > > > > > > > > > > > > >>>> We should expose a limit of the device
> > > > > > > > > > > > > > > > >>>> in the proposed
> > > > > > > > > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how
> > > > > > > > > > > > > > > > much
> > > > > > range
> > > > > > > > > > > > > > > > it can
> > > > > > > > > > > > track.
> > > > > > > > > > > > > > > > >>>> So that future provisioning framework can use
> > it.
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >>>> I will cover this in v5 early next week.
> > > > > > > > > > > > > > > > >>> I do worry about how this can even work
> > though.
> > > > > > > > > > > > > > > > >>> If you want a generic device you do not
> > > > > > > > > > > > > > > > >>> get to dictate how much memory VM
> > > > > > > > > > > > has.
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> Aren't we talking bit per page? With
> > > > > > > > > > > > > > > > >>> 1TByte of memory to track
> > > > > > > > > > > > > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> And you happily say "we'll address this in the
> > future"
> > > > > > > > > > > > > > > > >>> while at the same time fighting tooth
> > > > > > > > > > > > > > > > >>> and nail against adding single bit
> > > > > > > > > > > > > > > > >>> status registers because
> > > > > > scalability?
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> I have a feeling doing this completely
> > > > > > > > > > > > > > > > >>> theoretical like this is
> > > > > > > > > > > > problematic.
> > > > > > > > > > > > > > > > >>> Maybe you have it all laid out neatly in
> > > > > > > > > > > > > > > > >>> your head but I suspect not all of TC
> > > > > > > > > > > > > > > > >>> can picture it clearly enough based just
> > > > > > > > > > > > > > > > >>> on spec
> > > > > > > > > > > > text.
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> We do sometimes ask for POC
> > > > > > > > > > > > > > > > >>> implementation in linux / qemu to
> > > > > > > > > > > > > > > > >>> demonstrate how things work before
> > > > > > > > > > > > > > > > >>> merging
> > > > > > > > code.
> > > > > > > > > > > > > > > > >>> We skipped this for admin things so far
> > > > > > > > > > > > > > > > >>> but I think it's a good idea to start doing it here.
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> What makes me pause a bit before saying
> > > > > > > > > > > > > > > > >>> please do a PoC is all the opposition
> > > > > > > > > > > > > > > > >>> that seems to exist to even using admin
> > > > > > > > > > > > > > > > >>> commands in the 1st place. I think once
> > > > > > > > > > > > > > > > >>> we finally stop arguing about whether to
> > > > > > > > > > > > > > > > >>> use admin commands at all then a PoC
> > > > > > > > > > > > > > > > >>> will be needed
> > > > > > > > > > > > > > before merging.
> > > > > > > > > > > > > > > > >> We have POR productions that implemented
> > > > > > > > > > > > > > > > >> the approach in my
> > > > > > > > > > > > series.
> > > > > > > > > > > > > > > > >> They are multiple generations of
> > > > > > > > > > > > > > > > >> productions in market and running in
> > > > > > > > > > > > > > > > >> customers data centers for
> > > > years.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> Back to 2019 when we start working on
> > > > > > > > > > > > > > > > >> vDPA, we have sent some samples of
> > > > > > > > > > > > > > > > >> production(e.g., Cascade
> > > > > > > > > > > > > > > > >> Glacier) and the datasheet, you can find
> > > > > > > > > > > > > > > > >> live migration facilities there, includes
> > > > > > > > > > > > > > > > >> suspend, vq state and other
> > > > > > > > features.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> And there is an reference in DPDK live
> > > > > > > > > > > > > > > > >> migration, I have provided this page
> > > > > > > > > > > > > > > > >> before:
> > > > > > > > > > > > > > > > >> https://doc.dpdk.org/guides-21.11/vdpadev
> > > > > > > > > > > > > > > > >> s/if c.ht ml, it has been working for
> > > > > > > > > > > > > > > > >> long long time.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> So if we let the facts speak, if we want
> > > > > > > > > > > > > > > > >> to see if the proposal is proven to work,
> > > > > > > > > > > > > > > > >> I would
> > > > > > > > > > > > > > > > >> say: They are POR for years, customers
> > > > > > > > > > > > > > > > >> already deployed them for
> > > > > > > > > > > > years.
> > > > > > > > > > > > > > > > > And I guess what you are trying to say is
> > > > > > > > > > > > > > > > > that this patchset we are reviewing here
> > > > > > > > > > > > > > > > > should be help to the same standard and
> > > > > > > > > > > > > > > > > there should be a PoC? Sounds
> > > > > > reasonable.
> > > > > > > > > > > > > > > > Yes and the in-marketing productions are
> > > > > > > > > > > > > > > > POR, the series just improves the design,
> > > > > > > > > > > > > > > > for example, our series also use registers
> > > > > > > > > > > > > > > > to track vq state, but improvements than CG
> > > > > > > > > > > > > > > > or BSC. So I think they are proven
> > > > > > > > > > > > > > to work.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If you prefer to go the route of POR and
> > > > > > > > > > > > > > > production and proven documents
> > > > > > > > > > > > > > etc, there is ton of it of multiple types of
> > > > > > > > > > > > > > products I can dump here with open- source code
> > > > > > > > > > > > > > and documentation and
> > > > > > more.
> > > > > > > > > > > > > > > Let me know what you would like to see.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Michael has requested some performance
> > > > > > > > > > > > > > > comparisons, not all are ready to
> > > > > > > > > > > > > > share yet.
> > > > > > > > > > > > > > > Some are present that I will share in coming weeks.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > And all the vdpa dpdk you published does not
> > > > > > > > > > > > > > > have basic CVQ support when I
> > > > > > > > > > > > > > last looked at it.
> > > > > > > > > > > > > > > Do you know when was it added?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It's good enough for PoC I think, CVQ or not.
> > > > > > > > > > > > > > The problem with CVQ generally, is that VDPA
> > > > > > > > > > > > > > wants to shadow CVQ it at all times because it
> > > > > > > > > > > > > > wants to decode and cache the content. But this
> > > > > > > > > > > > > > problem has nothing to do with dirty tracking
> > > > > > > > > > > > > > even though it also
> > > > > > > > > > > > mentions "shadow":
> > > > > > > > > > > > > > if device can report it's state then there's no
> > > > > > > > > > > > > > need to shadow
> > > > > > CVQ.
> > > > > > > > > > > > >
> > > > > > > > > > > > > For the performance numbers with the pre-copy and
> > > > > > > > > > > > > device context of
> > > > > > > > > > > > patches posted 1 to 5, the downtime reduction of the
> > > > > > > > > > > > VM is 3.71x with active traffic on 8 RQs at 100Gbps port
> > speed.
> > > > > > > > > > > >
> > > > > > > > > > > > Sounds good can you please post a bit more detail?
> > > > > > > > > > > > which configs are you comparing what was the result
> > > > > > > > > > > > on each of
> > > > > > them.
> > > > > > > > > > >
> > > > > > > > > > > Common config: 8+8 tx and rx queues.
> > > > > > > > > > > Port speed: 100Gbps
> > > > > > > > > > > QEMU 8.1
> > > > > > > > > > > Libvirt 7.0
> > > > > > > > > > > GVM: Centos 7.4
> > > > > > > > > > > Device: virtio VF hardware device
> > > > > > > > > > >
> > > > > > > > > > > Config_1: virtio suspend/resume similar to what
> > > > > > > > > > > Lingshan has, largely vdpa stack
> > > > > > > > > > > Config_2: Device context method of admin commands
> > > > > > > > > >
> > > > > > > > > > OK that sounds good. The weird thing here is that you
> > > > > > > > > > measure
> > > > > > "downtime".
> > > > > > > > > > What exactly do you mean here?
> > > > > > > > > > I am guessing it's the time to retrieve on source and
> > > > > > > > > > re-program device state on destination? And this is
> > > > > > > > > > 3.71x out of
> > > > how long?
> > > > > > > > > Yes. Downtime is the time during which the VM is not
> > > > > > > > > responding or receiving
> > > > > > > > packets, which involves reprogramming the device.
> > > > > > > > > 3.71x is relative time for this discussion.
> > > > > > > >
> > > > > > > > Oh interesting. So VM state movement including reprogramming
> > > > > > > > the CPU is dominated by reprogramming this single NIC, by a
> > > > > > > > factor of
> > > > almost 4?
> > > > > > > Yes.
> > > > > >
> > > > > > Could you post some numbers too then?  I want to know whether
> > > > > > that would imply that VM boot is slowed down significantly too.
> > > > > > If yes that's another motivation for pci transport 2.0.
> > > > > It was 1.8 sec down to 480msec.
> > > >
> > > > Well, there's work ongoing to reduce the downtime of the shadow
> > virtqueue.
> > > >
> > > > Eugenio or Si-wei may share an exact number, but it should be
> > > > several hundreds of ms.
> > > >
> > > Shadow vq is not applicable at all as comparison point because there is no
> > virtio specific qemu etc software involved here.
> >
> > I don't get the point.
> >
> > Shadow virtqueue is virtio specific for sure and the core logic is decoupled of
> > the vDPA logic. If not, it's bug and we need to fix.
> >
> The base requirement is that the software is not mediating any virtio interfaces (config, cvq, data vqs).

I think we agree that any proposal should work in both passthrough and
non-passthrough. No?

Otherwise we circle back.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-22  4:19                                                       ` [virtio-comment] " Parav Pandit
@ 2023-11-24  3:09                                                         ` Jason Wang
  0 siblings, 0 replies; 157+ messages in thread
From: Jason Wang @ 2023-11-24  3:09 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Wed, Nov 22, 2023 at 12:19 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, November 22, 2023 9:45 AM
> >
> > On Wed, Nov 22, 2023 at 12:26 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, November 21, 2023 9:55 AM
> > > >
> > > > On Fri, Nov 17, 2023 at 11:02 AM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > >
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Thursday, November 16, 2023 11:51 PM
> > > > > >
> > > > > > On Thu, Nov 16, 2023 at 05:29:49PM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Thursday, November 16, 2023 10:56 PM
> > > > > > > >
> > > > > > > > On Thu, Nov 16, 2023 at 04:26:53PM +0000, Parav Pandit wrote:
> > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > Sent: Thursday, November 16, 2023 5:18 PM
> > > > > > > > > >
> > > > > > > > > > On Thu, Nov 16, 2023 at 07:40:57AM +0000, Parav Pandit
> > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > Sent: Thursday, November 16, 2023 1:06 PM
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Nov 16, 2023 at 12:51:40AM -0500, Michael S.
> > > > > > > > > > > > Tsirkin
> > > > wrote:
> > > > > > > > > > > > > On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav
> > > > > > > > > > > > > Pandit
> > > > wrote:
> > > > > > > > > > > > > > We should expose a limit of the device in the
> > > > > > > > > > > > > > proposed
> > > > > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how much
> > range
> > > > > > > > > > > > it can
> > > > > > > > track.
> > > > > > > > > > > > > > So that future provisioning framework can use it.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I will cover this in v5 early next week.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I do worry about how this can even work though. If
> > > > > > > > > > > > > you want a generic device you do not get to
> > > > > > > > > > > > > dictate how much memory
> > > > > > VM has.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Aren't we talking bit per page? With 1TByte of
> > > > > > > > > > > > > memory to track
> > > > > > > > > > > > > -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > > > > >
> > > > > > > > > > > > Ugh. Actually of course:
> > > > > > > > > > > > With 1TByte of memory to track -> 256Mbit -> 32Mbit
> > > > > > > > > > > > -> 8Mbyte per VF
> > > > > > > > > > > >
> > > > > > > > > > > > 8Gbyte per *PF* with 1K VFs.
> > > > > > > > > > > >
> > > > > > > > > > > Device may not maintain as a bitmap.
> > > > > > > > > >
> > > > > > > > > > However you maintain it, there's 256Mega bit of information.
> > > > > > > > > There may be other data structures that device may deploy
> > > > > > > > > as for example
> > > > > > > > hash or tree or something else.
> > > > > > > >
> > > > > > > > Point being?
> > > > > > > The device may have some hashing accelerator or other
> > > > > > > improvements that
> > > > > > may perform better than bitmap as many queues in parallel
> > > > > > attempt to update the shared database.
> > > > > >
> > > > > > Maybe, I didn't give this thought.
> > > > > >
> > > > > > My point was that to be able to keep all combinations of
> > > > > > dirty/non dirty page for each 4k page in a 1TByte guest device
> > > > > > needs 8MBytes of on-device memory per VF. As designed the query
> > > > > > also has to report it for each VF accurately even if multiple VFs are
> > accessing same guest.
> > > > > Yes.
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > And this is runtime memory only during the short live
> > > > > > > > > migration period of
> > > > > > > > 400msec or less.
> > > > > > > > > It is not some _always_ resident memory.
> > > >
> > > > When developing the spec, we should not have any assumption for the
> > > > implementation. For example, you can't just assume virtio is always
> > > > emulated in the software in the DPU.
> > > >
> > > There is no such assumption.
> > > It is supported on non DPU devices too.
> >
> > You meant e.g a 8MB on-chip resource per VF is good to go?
> >
> It is the device implementation detail. Maybe it uses 8MB, may be not.

So you leave a pandora box for the vendor to open and see? You are
debating 16 bit on chip resources but now you're saying 8MB can work
or not?

> And if you are going to compare again with slow registers memory, it is not apple to apple comparison anyway.

I never say using a register is a better way. Register is a transport
specific concept, you can't say it can work for all transports.

>
> Non DPU device may have such memory for data path acceleration.
>
> > >
> > > > How can you make sure you can converge in 400ms without having a
> > > > interface for the driver to set the correct parameter like dirty rates?
> > >
> > > 400msec is also written anywhere as requirement if this is what you want to
> > argue about.
> >
> > No, the downtime needs to coordinate with the hypervisor, that is what I
> > want to say. Unfortunately, I don't see any interface in this series.
> >
> What do you mean by coordinated?

I've given the equation that is used by Qemu to calculate the
downtime. It needs knowledge beyond the virtio, no?

Device needs to be throttled in a way that is under the expectation of
a hypervisor, otherwise the downtime can't be satisfied.

> This series has mechanism to eliminate the downtime on src and dst side during device migration during pre-copy phase.

I don't see how it can "eliminate" the downtime.

>
> > > There is nothing prevents to extend the interface to define the SLA as
> > additional commands in the future to improve the solution.
> > >
> > > There is no need to boil the ocean now. Once the base infrastructure is
> > built, we will improve it further.
> > > And proposed patches are reasonably well covered to our knowledge.
> >
> > Well, it is not me but you that claims it can be done in 400ms. I'm wondering
> > how and you told me it could be done in the future?
> >
> In our tests it is near to this number.

You need to explain why it can, especially under heavy load.

> The discussion is about programming the SLA and that can be an extension.

Migrations have SLA naturally that is the downtime, you can't just
bypass it from the start of the design.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-22  4:34                                                                               ` Parav Pandit
@ 2023-11-24  3:15                                                                                 ` Jason Wang
  0 siblings, 0 replies; 157+ messages in thread
From: Jason Wang @ 2023-11-24  3:15 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, virtio-comment, cohuck, sburla,
	Shahaf Shuler, Maor Gottlieb, Yishai Hadas, lingshan.zhu

On Wed, Nov 22, 2023 at 12:35 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, November 22, 2023 9:48 AM
> >
> > On Wed, Nov 22, 2023 at 12:29 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, November 21, 2023 10:47 AM
> > > >
> > > > On Fri, Nov 17, 2023 at 8:51 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Michael S.
> > > > > > Tsirkin
> > > > > > Sent: Friday, November 17, 2023 6:11 PM
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 12:22:59PM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Friday, November 17, 2023 5:03 PM
> > > > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > > >
> > > > > > > > > Somehow the claim of shadow vq is great without sharing
> > > > > > > > > any performance
> > > > > > > > numbers is what I don't agree with.
> > > > > > > >
> > > > > > > > It's upstream in QEMU. Test it youself.
> > > > > > > >
> > > > > > > We did few minutes back.
> > > > > > > It results in a call trace.
> > > > > > > Vhost_vdpa_setup_vq_irq crashes on list corruption on net-next.
> > > > > >
> > > > > > Wrong list for this bug report.
> > > > > >
> > > > > > > We are stopping any shadow vq tests on unstable stuff.
> > > > > >
> > > > > > If you don't want to benchmark against alternatives how are you
> > > > > > going to prove your stuff is worth everyone's time?
> > > > >
> > > > > Comparing performance of the functional things count.
> > > > > You suggest shadow vq, frankly you should post the grand numbers
> > > > > of
> > > > shadow vq.
> > > >
> > > > We need an apple to apple comparison. Otherwise you may argue with
> > > > that, no?
> > > >
> > > When the requirements are met the comparison can be made of the
> > solution.
> > > And I don’t see that the basic requirements are matching for two different
> > use cases.
> > > So no point in discussing one OS specific implementation as reference
> > point.
> >
> > Shadow virtqueue is not OS specific, it's a common method. If you disagree,
> > please explain why.
> >
> As you claim shadow virtqueue is generic not dependent on OS, how does I benchmark on QNX today?

You know Qemu is portable? How did you benchmark QNX with your proposal?

>
> > > Otherwise I will end up adding vfio link in the commit log in next version as
> > you are asking similar things here and being non neutral to your ask.
> >
> > When doing a benchmark, you need to describe your setups, no? So any
> > benchmark is setup specific, nothing wrong.
> >
> > It looks to me you claim your method is better, but refuse to give proofs.
> >
> I gave details to Michael in the email. Please refer.
>
> > >
> > > Anyway, please bring the perf data whichever you want to compare in
> > another forum. It is not the criteria anyway.
> >
> > So how can you prove your method is the best one? You have posted the
> > series for months, and so far I still don't see any rationale about why you
> > choose to go that way.
> It is explained in theory of operation.
> You refuse to read it.

Can you tell me which part in your v4 explains why you choose to go this way?

"""
+During the device migration flow, a passthrough device may write data to the
+guest virtual machine's memory, a source hypervisor needs to keep
track of these
+written memory to migrate such memory to destination hypervisor.
+Some systems may not be able to keep track of such memory write addresses at
+hypervisor level. In such a scenario, a device records and reports these
+written memory addresses to the owner device. The owner driver enables write
+recording for one or more physical address ranges per device during device
+migration flow. The owner driver periodically queries these written physical
+address records from the device. As the driver reads the written
address records,
+the device clears those records from the device.
+Once the device reports zero or small number of written address
records, the device
+mode is set to \field{Stop} or \field{Freeze}. Once the device is set
to \field{Stop}
+or \field{Freeze} mode, and once all the IOVA records are read, the
driver stops
+the write recording in the device.
"""

>
> >
> > This is very odd as we've gone through several methods one or two years ago
> > when discussing vDPA live migration.
> >
> It does not matter as this is not vdpa forum.
>
> > >
> > > > >
> > > > > It is really not my role to report bug of unstable stuff and
> > > > > compare the perf
> > > > against.
> > > >
> > > > Qemu/KVM is highly relevant here no? And it's the way to develop the
> > > > community. The shadow vq code is handy.
> > > It is relevant for direct mapped device.
> >
> > Let's focus on the function then discuss the use cases. If you can't prove your
> > proposal has a proper function, what's the point of discussing the use cases?
> >
> The proper function is described.
> You choose to not accept in favour of considering on the vdpa.

Is the discussion here any relevant to vDPA. I'm telling you to use
the platform IOMMU, how is it related to vDPA?

>
> > > There is absolutely no point of converting virtio device to another
> > virtualization layer and run again and get another virtio device.
> > > So for direct mapping use case shadow vq is not relevant.
> >
> > It is needed because shadow virtqueue is the baseline. Most of the issues
> > don't exist in the case of shadow virtqueue.
> >
> I disagree.
> For direct mapping there is no virtio specific OS layer involved.

Again, you need to prove if direct mapping can work.

And I've pointed out the "direct mapping" of vq for dirty page
tracking can work with "direct mapping" of vIOMMU.

> Hence shadow vq specific implementation is not appliable.
>
> > We don't want to end up with a solution that
> >
> > 1) can't outperform shadow virtqueue
> Disagree. There is no shadow vq in direct mapping. No comparison.
>
> > 2) have more issues than shadow virtqueue
> >
> There are none.

I think I've pointed out sufficient issues, if you choose to ignore
those issues, the discussion is going nowhere.

>
> > > For other use cases, please continue.
> > >
> > > >
> > > > Just an email to Qemu should be fine, we're not asking you to fix the bug.
> > > >
> > > > Btw, how do you define stable? E.g do you think the Linus tree is stable?
> > > >
> > > Basic test with iperf is not working. Crashing it.
> >
> > As a kernel developer, dealing with crashing at any layer is pretty common.
> > No?
> >
> So, kernel developers do not ask to compare the crashing code.

Report it and let the community fix it then compare again.

This is how the community runs.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-24  2:29                                                                     ` Jason Wang
@ 2023-11-28  3:00                                                                       ` Si-Wei Liu
  2023-11-29  5:12                                                                         ` Jason Wang
  0 siblings, 1 reply; 157+ messages in thread
From: Si-Wei Liu @ 2023-11-28  3:00 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Michael S. Tsirkin, Zhu, Lingshan, virtio-comment,
	cohuck, sburla, Shahaf Shuler, Maor Gottlieb, Yishai Hadas,
	eperezma



On 11/23/2023 6:29 PM, Jason Wang wrote:
> On Thu, Nov 23, 2023 at 9:19 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 11/21/2023 9:31 PM, Jason Wang wrote:
>>> On Wed, Nov 22, 2023 at 10:31 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>>> (dropping my personal email abandoned for upstream discussion for now,
>>>> please try to copy my corporate email address for more timely response)
>>>>
>>>> On 11/20/2023 10:55 PM, Jason Wang wrote:
>>>>> On Fri, Nov 17, 2023 at 10:48 PM Parav Pandit <parav@nvidia.com> wrote:
>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>> Sent: Friday, November 17, 2023 7:31 PM
>>>>>>> To: Parav Pandit <parav@nvidia.com>
>>>>>>>
>>>>>>> On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>> Sent: Friday, November 17, 2023 6:02 PM
>>>>>>>>>
>>>>>>>>> On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
>>>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>>> Sent: Friday, November 17, 2023 5:35 PM
>>>>>>>>>>> To: Parav Pandit <parav@nvidia.com>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
>>>>>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>>>>> Sent: Friday, November 17, 2023 5:04 PM
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
>>>>>>>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>>>>>>> Sent: Friday, November 17, 2023 4:30 PM
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
>>>>>>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>>>>> Sent: Friday, November 17, 2023 3:30 PM
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>>>>>> On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu,
>>>>>>>>>>>>>>>>>> Lingshan
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>>>>>>>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav
>>>>>>>>>>>>>>>>>>>> Pandit
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> We should expose a limit of the device in the
>>>>>>>>>>>>>>>>>>>>> proposed
>>>>>>>>>>>>>>>>> WRITE_RECORD_CAP_QUERY command, that how much
>>>>>>> range
>>>>>>>>>>>>>>>>> it can
>>>>>>>>>>>>> track.
>>>>>>>>>>>>>>>>>>>>> So that future provisioning framework can use it.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I will cover this in v5 early next week.
>>>>>>>>>>>>>>>>>>>> I do worry about how this can even work though.
>>>>>>>>>>>>>>>>>>>> If you want a generic device you do not get to
>>>>>>>>>>>>>>>>>>>> dictate how much memory VM
>>>>>>>>>>>>> has.
>>>>>>>>>>>>>>>>>>>> Aren't we talking bit per page? With 1TByte of
>>>>>>>>>>>>>>>>>>>> memory to track
>>>>>>>>>>>>>>>>>>>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> And you happily say "we'll address this in the future"
>>>>>>>>>>>>>>>>>>>> while at the same time fighting tooth and nail
>>>>>>>>>>>>>>>>>>>> against adding single bit status registers because
>>>>>>> scalability?
>>>>>>>>>>>>>>>>>>>> I have a feeling doing this completely
>>>>>>>>>>>>>>>>>>>> theoretical like this is
>>>>>>>>>>>>> problematic.
>>>>>>>>>>>>>>>>>>>> Maybe you have it all laid out neatly in your
>>>>>>>>>>>>>>>>>>>> head but I suspect not all of TC can picture it
>>>>>>>>>>>>>>>>>>>> clearly enough based just on spec
>>>>>>>>>>>>> text.
>>>>>>>>>>>>>>>>>>>> We do sometimes ask for POC implementation in
>>>>>>>>>>>>>>>>>>>> linux / qemu to demonstrate how things work
>>>>>>>>>>>>>>>>>>>> before merging
>>>>>>>>> code.
>>>>>>>>>>>>>>>>>>>> We skipped this for admin things so far but I
>>>>>>>>>>>>>>>>>>>> think it's a good idea to start doing it here.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> What makes me pause a bit before saying please
>>>>>>>>>>>>>>>>>>>> do a PoC is all the opposition that seems to
>>>>>>>>>>>>>>>>>>>> exist to even using admin commands in the 1st
>>>>>>>>>>>>>>>>>>>> place. I think once we finally stop arguing
>>>>>>>>>>>>>>>>>>>> about whether to use admin commands at all then
>>>>>>>>>>>>>>>>>>>> a PoC will be needed
>>>>>>>>>>>>>>> before merging.
>>>>>>>>>>>>>>>>>>> We have POR productions that implemented the
>>>>>>>>>>>>>>>>>>> approach in my
>>>>>>>>>>>>> series.
>>>>>>>>>>>>>>>>>>> They are multiple generations of productions in
>>>>>>>>>>>>>>>>>>> market and running in customers data centers for years.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Back to 2019 when we start working on vDPA, we
>>>>>>>>>>>>>>>>>>> have sent some samples of production(e.g.,
>>>>>>>>>>>>>>>>>>> Cascade
>>>>>>>>>>>>>>>>>>> Glacier) and the datasheet, you can find live
>>>>>>>>>>>>>>>>>>> migration facilities there, includes suspend, vq
>>>>>>>>>>>>>>>>>>> state and other
>>>>>>>>> features.
>>>>>>>>>>>>>>>>>>> And there is an reference in DPDK live migration,
>>>>>>>>>>>>>>>>>>> I have provided this page
>>>>>>>>>>>>>>>>>>> before:
>>>>>>>>>>>>>>>>>>> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.ht
>>>>>>>>>>>>>>>>>>> ml, it has been working for long long time.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So if we let the facts speak, if we want to see
>>>>>>>>>>>>>>>>>>> if the proposal is proven to work, I would
>>>>>>>>>>>>>>>>>>> say: They are POR for years, customers already
>>>>>>>>>>>>>>>>>>> deployed them for
>>>>>>>>>>>>> years.
>>>>>>>>>>>>>>>>>> And I guess what you are trying to say is that
>>>>>>>>>>>>>>>>>> this patchset we are reviewing here should be help
>>>>>>>>>>>>>>>>>> to the same standard and there should be a PoC? Sounds
>>>>>>> reasonable.
>>>>>>>>>>>>>>>>> Yes and the in-marketing productions are POR, the
>>>>>>>>>>>>>>>>> series just improves the design, for example, our
>>>>>>>>>>>>>>>>> series also use registers to track vq state, but
>>>>>>>>>>>>>>>>> improvements than CG or BSC. So I think they are
>>>>>>>>>>>>>>>>> proven
>>>>>>>>>>>>>>> to work.
>>>>>>>>>>>>>>>> If you prefer to go the route of POR and production
>>>>>>>>>>>>>>>> and proven documents
>>>>>>>>>>>>>>> etc, there is ton of it of multiple types of products I
>>>>>>>>>>>>>>> can dump here with open- source code and documentation and
>>>>>>> more.
>>>>>>>>>>>>>>>> Let me know what you would like to see.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Michael has requested some performance comparisons,
>>>>>>>>>>>>>>>> not all are ready to
>>>>>>>>>>>>>>> share yet.
>>>>>>>>>>>>>>>> Some are present that I will share in coming weeks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> And all the vdpa dpdk you published does not have
>>>>>>>>>>>>>>>> basic CVQ support when I
>>>>>>>>>>>>>>> last looked at it.
>>>>>>>>>>>>>>>> Do you know when was it added?
>>>>>>>>>>>>>>> It's good enough for PoC I think, CVQ or not.
>>>>>>>>>>>>>>> The problem with CVQ generally, is that VDPA wants to
>>>>>>>>>>>>>>> shadow CVQ it at all times because it wants to decode
>>>>>>>>>>>>>>> and cache the content. But this problem has nothing to
>>>>>>>>>>>>>>> do with dirty tracking even though it also
>>>>>>>>>>>>> mentions "shadow":
>>>>>>>>>>>>>>> if device can report it's state then there's no need to shadow
>>>>>>> CVQ.
>>>>>>>>>>>>>> For the performance numbers with the pre-copy and device
>>>>>>>>>>>>>> context of
>>>>>>>>>>>>> patches posted 1 to 5, the downtime reduction of the VM is
>>>>>>>>>>>>> 3.71x with active traffic on 8 RQs at 100Gbps port speed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sounds good can you please post a bit more detail?
>>>>>>>>>>>>> which configs are you comparing what was the result on each of
>>>>>>> them.
>>>>>>>>>>>> Common config: 8+8 tx and rx queues.
>>>>>>>>>>>> Port speed: 100Gbps
>>>>>>>>>>>> QEMU 8.1
>>>>>>>>>>>> Libvirt 7.0
>>>>>>>>>>>> GVM: Centos 7.4
>>>>>>>>>>>> Device: virtio VF hardware device
>>>>>>>>>>>>
>>>>>>>>>>>> Config_1: virtio suspend/resume similar to what Lingshan has,
>>>>>>>>>>>> largely vdpa stack
>>>>>>>>>>>> Config_2: Device context method of admin commands
>>>>>>>>>>> OK that sounds good. The weird thing here is that you measure
>>>>>>> "downtime".
>>>>>>>>>>> What exactly do you mean here?
>>>>>>>>>>> I am guessing it's the time to retrieve on source and re-program
>>>>>>>>>>> device state on destination? And this is 3.71x out of how long?
>>>>>>>>>> Yes. Downtime is the time during which the VM is not responding or
>>>>>>>>>> receiving
>>>>>>>>> packets, which involves reprogramming the device.
>>>>>>>>>> 3.71x is relative time for this discussion.
>>>>>>>>> Oh interesting. So VM state movement including reprogramming the CPU
>>>>>>>>> is dominated by reprogramming this single NIC, by a factor of almost 4?
>>>>>>>> Yes.
>>>>>>> Could you post some numbers too then?  I want to know whether that would
>>>>>>> imply that VM boot is slowed down significantly too. If yes that's another
>>>>>>> motivation for pci transport 2.0.
>>>>>> It was 1.8 sec down to 480msec.
>>>>> Well, there's work ongoing to reduce the downtime of the shadow virtqueue.
>>>>>
>>>>> Eugenio or Si-wei may share an exact number, but it should be several
>>>>> hundreds of ms.
>>>> That was mostly for device teardown time at the the source but there's
>>>> also setup cost at the destination that needs to be counted.
>>>> Several hundred of milliseconds would be the ultimate goal I would say
>>>> (right now the numbers from Parav more or less reflects the status quo
>>>> but there's ongoing work to make it further down), and I don't doubt
>>>> several hundreds of ms is possible. But to be fair, on the other hand,
>>>> shadow vq on real vdpa hardware device would need a lot of dedicated
>>>> optimization work across all layers (including hardware or firmware) all
>>>> over the places to achieve what a simple suspend-resume (save/load)
>>>> interface can easily do with VFIO migration.
>>> That's fine. Just to clairfy, shadow virtqueue here doesn't mean it
>>> can't save/load. We want to see how it is useful for dirty page
>>> tracking since tracking dirty pages by device itself seems problematic
>>> at least from my point of view.
>> TBH I don't see how this comparison can help prove the problematic part
>> of device dirty tracking, or if it has anything to do with.
> Shadow virtqueue is not used to prove the problem, the problem could
> be uncovered during the review.
>
> The shadow virtuqueue is used to give us a bottom line. If a huge
> effort were done for spec but it can't perform better than virtqueue,
> the effort became meaningless.
Got it. Thanks for detailed clarifications, Jason. So it's not device 
assisted dirty tracking itself you find issue with, but just the 
flaw/inefficiency in the current proposal as pointed out in previous 
discussions?  In other word, does it make sense to you if certain device 
assisted tracking scheme is proved to be helpful or perform better than 
the others, and its backed by real performance data, be it shadow vq or 
platform IOMMU tracking in just a few scenarios or in the mostly common 
used set up, then is it acceptable to you even if the same device 
tracking mechanism doesn't support or doesn't have reasonably good value 
for other scenarios (for e.g. PASID, ATS, vIOMMU and etc as you listed 
below)?

It's up to the author to further improve on the current spec proposal, 
but if the device assisted tracking itself in general is problematic and 
prohibited even if proved to be best performing for some (but not ALL) 
use cases, I will be very surprised to know the reason why, as it is 
just an optional device feature aiming to be self-contained in virtio 
itself, without having to depending on vendor specific optimization 
(like vdpa shadow vq).

Thanks
-Siwei
>
>> In many
>> cases vDPA and hardware virtio are for different deployment scenarios
>> with varied target users, I don't see how vDPA can completely substitute
>> hardware virtio for many reasons regardless shadow virtqueue wins or not.
> It's not about whether vDPA can win or not. It's about a quick
> demonstration about how shadow virtqueue can perform.  From the view
> of the shadow virtqueue, it doesn't know whether the underlayer is
> vDPA or virtio. It's not hard to imagine, the downtime we get from
> vDPA is the bottom line of downtime via virtio since virtio is much
> more easier.
>
>> If anything relevant I would more like to see performance comparison
>> with platform dirty tracking via IOMMUFD, but that's perhaps too early
>> stage at this point to conclude anything given there's very limited
>> availability (in terms of supporting software, I know some supporting
>> hardware has been around for a few years) and none of the potential
>> software optimizations is in place at this point to make a fair
>> comparison for.
> We need to make sure the correctness of the function before we can
> talk about optimizations. And I don't see how this proposal is
> optimized for many ways.
>
>> Granted device assisted tracking has its own set of
>> limitations e.g. loose coupling or integration with platform features,
>> lack of nested and PASID support et al. However, state of the art for
>> platform dirty tracking is not perfect either, far off being highly
>> optimized for all types of workload or scenarios. At least to me the
>> cost of page table walk to scan all PTEs across all levels is not easily
>> negligible - given no PML equivalent here, are we sure the whole range
>> scan can be as efficient and scalable as memory size / # of PTEs grows?
> If you see the discussion, this proposal requires scan PTEs as well in
> many ways.
>
>> How large it may impact the downtime with this rudimentary dirty scan?
>> No data point was given thus far. If chances are that there could be
>> major improvement from device tracking for those general use cases to
>> supplement what platform cannot achieve efficiently enough, it's not too
>> good to kill off the possibility entirely at this early stage. Maybe a
>> PoC or some comparative performance data can help prove the theory?
> We can ask in the thread of IOMMUFD dirty tracking patches.
>
>> On the other hand, the device assisted tracking has at least one
>> advantage that platform cannot simply offer - throttle down device for
>> convergence, inherently or explicitly whenever needed.
> Please refer the past discussion, I can see how it is throttled in the
> case of PML similar mechanism. But I can't see how it can be done
> here. This proposal requires the device to reserver sufficient
> resources where the throttle is implementation specific where the
> hypervisor can't depend on. It needs API to set dirty page rates at
> least.
>
>> I think earlier
>> Micheal suggested something to make the core data structure used for
>> logging more efficient and compact, working like PML but using a queue
>> or an array, and the entry of which may contain a list of discrete pages
>> or contiguous PFN ranges.
> PML solve the resources problem but not other problem:
>
> 1) Throttling: it's still not something that hypervisor can depend.
> The reason why PML in CPU work is that hypervisor can throttle the KVM
> process so it can slow down to the expected dirty rates.
> 2) Platform specific issue: PASID, ATS, translation failures, reserved
> regions, and a lot of other stuffs
> 3) vIOMMU issue: horrible delay in IOTLB invalidation path
> 4) Doesn't work in the case of vIOMMU offloading
>
> And compare the the existing approach, it ends up with more PCI
> transactions under heavy load.
>
>> On top of this one may add parallelism to
>> distribute load to multiple queues, or add zero copy to speed up dirty
>> sync to userspace - things virtio queues are pretty good at doing. After
>> all, nothing can be perfect to begin with, and every complex feature
>> would need substantial time to improve and evolve.
> Evolve is good, but the problem is platform is also evolving. The
> function is duplicated there and platform provides a lot of advanced
> features that can co-operate with dirty page tracking like vIOMMU
> offloading where it almost impossible to be done in virtio. Virtio
> needs to leverage the platform or transport instead of reinventing
> wheels so it can focus on the virtio device logic.
>
>> It does so for shadow
>> virtqueue from where it gets started to where it is now, even so there's
>> still a lot of optimization work not done yet. There must be head room
>> here for device page tracking or platform tracking, too.
> Let's then focus on the possible issues (I've pointed out a brunches).
>
> Thanks
>
>> Regards,
>> -Siwei
>>
>>
>>> Shadow virtqueue can be used with a save/load model for device state
>>> recovery for sure.
>>>
>>>>> But it seems the shadow virtqueue itself is not the major factor but
>>>>> the time spent on programming vendor specific mappings for example.
>>>> Yep. The slowness on mapping part is mostly due to the artifact of
>>>> software-based implementation. IMHO for live migration p.o.v it's better
>>>> to not involve any mapping operation in the down time path at all.
>>> Yes.
>>>
>>> Thanks
>>>
>>>> -Siwei
>>>>> Thanks
>>>>>
>>>>>> The time didn't come from pci side or boot side.
>>>>>>
>>>>>> For pci side of things you would want to compare the pci vs non pci device based VM boot time.
>>>>>>
>>>>> This publicly archived list offers a means to provide input to the
>>>>>
>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>>
>>>>>
>>>>>
>>>>> In order to verify user consent to the Feedback License terms and
>>>>>
>>>>> to minimize spam in the list archive, subscription is required
>>>>>
>>>>> before posting.
>>>>>
>>>>>
>>>>>
>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>>>
>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>>>
>>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>>>
>>>>> List archive: https://lists.oasis-open.org/archives/virtio-comment/
>>>>>
>>>>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
>>>>>
>>>>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
>>>>>
>>>>> Committee: https://www.oasis-open.org/committees/virtio/
>>>>>
>>>>> Join OASIS: https://www.oasis-open.org/join/
>>>>>
>>>> This publicly archived list offers a means to provide input to the
>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>
>>>> In order to verify user consent to the Feedback License terms and
>>>> to minimize spam in the list archive, subscription is required
>>>> before posting.
>>>>
>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>> List archive: https://lists.oasis-open.org/archives/virtio-comment/
>>>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
>>>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
>>>> Committee: https://www.oasis-open.org/committees/virtio/
>>>> Join OASIS: https://www.oasis-open.org/join/
>>>>
>>> This publicly archived list offers a means to provide input to the
>>>
>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>
>>>
>>>
>>> In order to verify user consent to the Feedback License terms and
>>>
>>> to minimize spam in the list archive, subscription is required
>>>
>>> before posting.
>>>
>>>
>>>
>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>
>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>
>>> List help: virtio-comment-help@lists.oasis-open.org
>>>
>>> List archive: https://lists.oasis-open.org/archives/virtio-comment/
>>>
>>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
>>>
>>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
>>>
>>> Committee: https://www.oasis-open.org/committees/virtio/
>>>
>>> Join OASIS: https://www.oasis-open.org/join/
>>>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  2023-11-28  3:00                                                                       ` Si-Wei Liu
@ 2023-11-29  5:12                                                                         ` Jason Wang
  0 siblings, 0 replies; 157+ messages in thread
From: Jason Wang @ 2023-11-29  5:12 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: Parav Pandit, Michael S. Tsirkin, Zhu, Lingshan, virtio-comment,
	cohuck, sburla, Shahaf Shuler, Maor Gottlieb, Yishai Hadas,
	eperezma

On Tue, Nov 28, 2023 at 11:00 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 11/23/2023 6:29 PM, Jason Wang wrote:
> > On Thu, Nov 23, 2023 at 9:19 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 11/21/2023 9:31 PM, Jason Wang wrote:
> >>> On Wed, Nov 22, 2023 at 10:31 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>> (dropping my personal email abandoned for upstream discussion for now,
> >>>> please try to copy my corporate email address for more timely response)
> >>>>
> >>>> On 11/20/2023 10:55 PM, Jason Wang wrote:
> >>>>> On Fri, Nov 17, 2023 at 10:48 PM Parav Pandit <parav@nvidia.com> wrote:
> >>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>> Sent: Friday, November 17, 2023 7:31 PM
> >>>>>>> To: Parav Pandit <parav@nvidia.com>
> >>>>>>>
> >>>>>>> On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
> >>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>> Sent: Friday, November 17, 2023 6:02 PM
> >>>>>>>>>
> >>>>>>>>> On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> >>>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>> Sent: Friday, November 17, 2023 5:35 PM
> >>>>>>>>>>> To: Parav Pandit <parav@nvidia.com>
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> >>>>>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>>> Sent: Friday, November 17, 2023 5:04 PM
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> >>>>>>>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>>>>>>>> Sent: Friday, November 17, 2023 4:30 PM
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit wrote:
> >>>>>>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>>>>>>> Sent: Friday, November 17, 2023 3:30 PM
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>>>>>> On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu,
> >>>>>>>>>>>>>>>>>> Lingshan
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>>>>>>>> On Thu, Nov 16, 2023 at 05:29:54AM +0000, Parav
> >>>>>>>>>>>>>>>>>>>> Pandit
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>> We should expose a limit of the device in the
> >>>>>>>>>>>>>>>>>>>>> proposed
> >>>>>>>>>>>>>>>>> WRITE_RECORD_CAP_QUERY command, that how much
> >>>>>>> range
> >>>>>>>>>>>>>>>>> it can
> >>>>>>>>>>>>> track.
> >>>>>>>>>>>>>>>>>>>>> So that future provisioning framework can use it.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I will cover this in v5 early next week.
> >>>>>>>>>>>>>>>>>>>> I do worry about how this can even work though.
> >>>>>>>>>>>>>>>>>>>> If you want a generic device you do not get to
> >>>>>>>>>>>>>>>>>>>> dictate how much memory VM
> >>>>>>>>>>>>> has.
> >>>>>>>>>>>>>>>>>>>> Aren't we talking bit per page? With 1TByte of
> >>>>>>>>>>>>>>>>>>>> memory to track
> >>>>>>>>>>>>>>>>>>>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> And you happily say "we'll address this in the future"
> >>>>>>>>>>>>>>>>>>>> while at the same time fighting tooth and nail
> >>>>>>>>>>>>>>>>>>>> against adding single bit status registers because
> >>>>>>> scalability?
> >>>>>>>>>>>>>>>>>>>> I have a feeling doing this completely
> >>>>>>>>>>>>>>>>>>>> theoretical like this is
> >>>>>>>>>>>>> problematic.
> >>>>>>>>>>>>>>>>>>>> Maybe you have it all laid out neatly in your
> >>>>>>>>>>>>>>>>>>>> head but I suspect not all of TC can picture it
> >>>>>>>>>>>>>>>>>>>> clearly enough based just on spec
> >>>>>>>>>>>>> text.
> >>>>>>>>>>>>>>>>>>>> We do sometimes ask for POC implementation in
> >>>>>>>>>>>>>>>>>>>> linux / qemu to demonstrate how things work
> >>>>>>>>>>>>>>>>>>>> before merging
> >>>>>>>>> code.
> >>>>>>>>>>>>>>>>>>>> We skipped this for admin things so far but I
> >>>>>>>>>>>>>>>>>>>> think it's a good idea to start doing it here.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> What makes me pause a bit before saying please
> >>>>>>>>>>>>>>>>>>>> do a PoC is all the opposition that seems to
> >>>>>>>>>>>>>>>>>>>> exist to even using admin commands in the 1st
> >>>>>>>>>>>>>>>>>>>> place. I think once we finally stop arguing
> >>>>>>>>>>>>>>>>>>>> about whether to use admin commands at all then
> >>>>>>>>>>>>>>>>>>>> a PoC will be needed
> >>>>>>>>>>>>>>> before merging.
> >>>>>>>>>>>>>>>>>>> We have POR productions that implemented the
> >>>>>>>>>>>>>>>>>>> approach in my
> >>>>>>>>>>>>> series.
> >>>>>>>>>>>>>>>>>>> They are multiple generations of productions in
> >>>>>>>>>>>>>>>>>>> market and running in customers data centers for years.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Back to 2019 when we start working on vDPA, we
> >>>>>>>>>>>>>>>>>>> have sent some samples of production(e.g.,
> >>>>>>>>>>>>>>>>>>> Cascade
> >>>>>>>>>>>>>>>>>>> Glacier) and the datasheet, you can find live
> >>>>>>>>>>>>>>>>>>> migration facilities there, includes suspend, vq
> >>>>>>>>>>>>>>>>>>> state and other
> >>>>>>>>> features.
> >>>>>>>>>>>>>>>>>>> And there is an reference in DPDK live migration,
> >>>>>>>>>>>>>>>>>>> I have provided this page
> >>>>>>>>>>>>>>>>>>> before:
> >>>>>>>>>>>>>>>>>>> https://doc.dpdk.org/guides-21.11/vdpadevs/ifc.ht
> >>>>>>>>>>>>>>>>>>> ml, it has been working for long long time.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> So if we let the facts speak, if we want to see
> >>>>>>>>>>>>>>>>>>> if the proposal is proven to work, I would
> >>>>>>>>>>>>>>>>>>> say: They are POR for years, customers already
> >>>>>>>>>>>>>>>>>>> deployed them for
> >>>>>>>>>>>>> years.
> >>>>>>>>>>>>>>>>>> And I guess what you are trying to say is that
> >>>>>>>>>>>>>>>>>> this patchset we are reviewing here should be help
> >>>>>>>>>>>>>>>>>> to the same standard and there should be a PoC? Sounds
> >>>>>>> reasonable.
> >>>>>>>>>>>>>>>>> Yes and the in-marketing productions are POR, the
> >>>>>>>>>>>>>>>>> series just improves the design, for example, our
> >>>>>>>>>>>>>>>>> series also use registers to track vq state, but
> >>>>>>>>>>>>>>>>> improvements than CG or BSC. So I think they are
> >>>>>>>>>>>>>>>>> proven
> >>>>>>>>>>>>>>> to work.
> >>>>>>>>>>>>>>>> If you prefer to go the route of POR and production
> >>>>>>>>>>>>>>>> and proven documents
> >>>>>>>>>>>>>>> etc, there is ton of it of multiple types of products I
> >>>>>>>>>>>>>>> can dump here with open- source code and documentation and
> >>>>>>> more.
> >>>>>>>>>>>>>>>> Let me know what you would like to see.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Michael has requested some performance comparisons,
> >>>>>>>>>>>>>>>> not all are ready to
> >>>>>>>>>>>>>>> share yet.
> >>>>>>>>>>>>>>>> Some are present that I will share in coming weeks.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> And all the vdpa dpdk you published does not have
> >>>>>>>>>>>>>>>> basic CVQ support when I
> >>>>>>>>>>>>>>> last looked at it.
> >>>>>>>>>>>>>>>> Do you know when was it added?
> >>>>>>>>>>>>>>> It's good enough for PoC I think, CVQ or not.
> >>>>>>>>>>>>>>> The problem with CVQ generally, is that VDPA wants to
> >>>>>>>>>>>>>>> shadow CVQ it at all times because it wants to decode
> >>>>>>>>>>>>>>> and cache the content. But this problem has nothing to
> >>>>>>>>>>>>>>> do with dirty tracking even though it also
> >>>>>>>>>>>>> mentions "shadow":
> >>>>>>>>>>>>>>> if device can report it's state then there's no need to shadow
> >>>>>>> CVQ.
> >>>>>>>>>>>>>> For the performance numbers with the pre-copy and device
> >>>>>>>>>>>>>> context of
> >>>>>>>>>>>>> patches posted 1 to 5, the downtime reduction of the VM is
> >>>>>>>>>>>>> 3.71x with active traffic on 8 RQs at 100Gbps port speed.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Sounds good can you please post a bit more detail?
> >>>>>>>>>>>>> which configs are you comparing what was the result on each of
> >>>>>>> them.
> >>>>>>>>>>>> Common config: 8+8 tx and rx queues.
> >>>>>>>>>>>> Port speed: 100Gbps
> >>>>>>>>>>>> QEMU 8.1
> >>>>>>>>>>>> Libvirt 7.0
> >>>>>>>>>>>> GVM: Centos 7.4
> >>>>>>>>>>>> Device: virtio VF hardware device
> >>>>>>>>>>>>
> >>>>>>>>>>>> Config_1: virtio suspend/resume similar to what Lingshan has,
> >>>>>>>>>>>> largely vdpa stack
> >>>>>>>>>>>> Config_2: Device context method of admin commands
> >>>>>>>>>>> OK that sounds good. The weird thing here is that you measure
> >>>>>>> "downtime".
> >>>>>>>>>>> What exactly do you mean here?
> >>>>>>>>>>> I am guessing it's the time to retrieve on source and re-program
> >>>>>>>>>>> device state on destination? And this is 3.71x out of how long?
> >>>>>>>>>> Yes. Downtime is the time during which the VM is not responding or
> >>>>>>>>>> receiving
> >>>>>>>>> packets, which involves reprogramming the device.
> >>>>>>>>>> 3.71x is relative time for this discussion.
> >>>>>>>>> Oh interesting. So VM state movement including reprogramming the CPU
> >>>>>>>>> is dominated by reprogramming this single NIC, by a factor of almost 4?
> >>>>>>>> Yes.
> >>>>>>> Could you post some numbers too then?  I want to know whether that would
> >>>>>>> imply that VM boot is slowed down significantly too. If yes that's another
> >>>>>>> motivation for pci transport 2.0.
> >>>>>> It was 1.8 sec down to 480msec.
> >>>>> Well, there's work ongoing to reduce the downtime of the shadow virtqueue.
> >>>>>
> >>>>> Eugenio or Si-wei may share an exact number, but it should be several
> >>>>> hundreds of ms.
> >>>> That was mostly for device teardown time at the the source but there's
> >>>> also setup cost at the destination that needs to be counted.
> >>>> Several hundred of milliseconds would be the ultimate goal I would say
> >>>> (right now the numbers from Parav more or less reflects the status quo
> >>>> but there's ongoing work to make it further down), and I don't doubt
> >>>> several hundreds of ms is possible. But to be fair, on the other hand,
> >>>> shadow vq on real vdpa hardware device would need a lot of dedicated
> >>>> optimization work across all layers (including hardware or firmware) all
> >>>> over the places to achieve what a simple suspend-resume (save/load)
> >>>> interface can easily do with VFIO migration.
> >>> That's fine. Just to clairfy, shadow virtqueue here doesn't mean it
> >>> can't save/load. We want to see how it is useful for dirty page
> >>> tracking since tracking dirty pages by device itself seems problematic
> >>> at least from my point of view.
> >> TBH I don't see how this comparison can help prove the problematic part
> >> of device dirty tracking, or if it has anything to do with.
> > Shadow virtqueue is not used to prove the problem, the problem could
> > be uncovered during the review.
> >
> > The shadow virtuqueue is used to give us a bottom line. If a huge
> > effort were done for spec but it can't perform better than virtqueue,
> > the effort became meaningless.
> Got it. Thanks for detailed clarifications, Jason. So it's not device
> assisted dirty tracking itself you find issue with, but just the
> flaw/inefficiency in the current proposal as pointed out in previous
> discussions?  In other word, does it make sense to you if certain device
> assisted tracking scheme is proved to be helpful or perform better than
> the others, and its backed by real performance data, be it shadow vq or
> platform IOMMU tracking in just a few scenarios or in the mostly common
> used set up, then is it acceptable to you even if the same device
> tracking mechanism doesn't support or doesn't have reasonably good value
> for other scenarios (for e.g. PASID, ATS, vIOMMU and etc as you listed
> below)?

For PASID, maybe we can leave it aside.
For ATS, I think we need at least clarify the behaviour.
For vIOMMU, I'd suggest testing the RTT on a domain-selected
invalidation via vIOMMU. To make sure at least there's no lockups on
the guest.

And I think we need an API to throttle the dirty rate (no matter what
kind of dirty page tracking is used) so the hypervisor can use
heuristic to converge the migration.

But we can hear from others for sure.

>
> It's up to the author to further improve on the current spec proposal,
> but if the device assisted tracking itself in general is problematic and
> prohibited even if proved to be best performing for some (but not ALL)
> use cases, I will be very surprised to know the reason why, as it is
> just an optional device feature aiming to be self-contained in virtio
> itself, without having to depending on vendor specific optimization
> (like vdpa shadow vq).

There's no reason if it works well :). Parav said it is optional in
the new version.

My question is not about whether it's good or not to do that in
virtio. It's about whether it can behave correctly when we try to do
that in virtio:

Besides the above issues, here are more: 1) DMA to MSI-X 2) Log P2P
write or not 3) Log DMA to RMRR or not etc. All of which seems not
easy to be done in virtio device, or I would like to know how those
are handled.

THanks

>
> Thanks
> -Siwei
> >
> >> In many
> >> cases vDPA and hardware virtio are for different deployment scenarios
> >> with varied target users, I don't see how vDPA can completely substitute
> >> hardware virtio for many reasons regardless shadow virtqueue wins or not.
> > It's not about whether vDPA can win or not. It's about a quick
> > demonstration about how shadow virtqueue can perform.  From the view
> > of the shadow virtqueue, it doesn't know whether the underlayer is
> > vDPA or virtio. It's not hard to imagine, the downtime we get from
> > vDPA is the bottom line of downtime via virtio since virtio is much
> > more easier.
> >
> >> If anything relevant I would more like to see performance comparison
> >> with platform dirty tracking via IOMMUFD, but that's perhaps too early
> >> stage at this point to conclude anything given there's very limited
> >> availability (in terms of supporting software, I know some supporting
> >> hardware has been around for a few years) and none of the potential
> >> software optimizations is in place at this point to make a fair
> >> comparison for.
> > We need to make sure the correctness of the function before we can
> > talk about optimizations. And I don't see how this proposal is
> > optimized for many ways.
> >
> >> Granted device assisted tracking has its own set of
> >> limitations e.g. loose coupling or integration with platform features,
> >> lack of nested and PASID support et al. However, state of the art for
> >> platform dirty tracking is not perfect either, far off being highly
> >> optimized for all types of workload or scenarios. At least to me the
> >> cost of page table walk to scan all PTEs across all levels is not easily
> >> negligible - given no PML equivalent here, are we sure the whole range
> >> scan can be as efficient and scalable as memory size / # of PTEs grows?
> > If you see the discussion, this proposal requires scan PTEs as well in
> > many ways.
> >
> >> How large it may impact the downtime with this rudimentary dirty scan?
> >> No data point was given thus far. If chances are that there could be
> >> major improvement from device tracking for those general use cases to
> >> supplement what platform cannot achieve efficiently enough, it's not too
> >> good to kill off the possibility entirely at this early stage. Maybe a
> >> PoC or some comparative performance data can help prove the theory?
> > We can ask in the thread of IOMMUFD dirty tracking patches.
> >
> >> On the other hand, the device assisted tracking has at least one
> >> advantage that platform cannot simply offer - throttle down device for
> >> convergence, inherently or explicitly whenever needed.
> > Please refer the past discussion, I can see how it is throttled in the
> > case of PML similar mechanism. But I can't see how it can be done
> > here. This proposal requires the device to reserver sufficient
> > resources where the throttle is implementation specific where the
> > hypervisor can't depend on. It needs API to set dirty page rates at
> > least.
> >
> >> I think earlier
> >> Micheal suggested something to make the core data structure used for
> >> logging more efficient and compact, working like PML but using a queue
> >> or an array, and the entry of which may contain a list of discrete pages
> >> or contiguous PFN ranges.
> > PML solve the resources problem but not other problem:
> >
> > 1) Throttling: it's still not something that hypervisor can depend.
> > The reason why PML in CPU work is that hypervisor can throttle the KVM
> > process so it can slow down to the expected dirty rates.
> > 2) Platform specific issue: PASID, ATS, translation failures, reserved
> > regions, and a lot of other stuffs
> > 3) vIOMMU issue: horrible delay in IOTLB invalidation path
> > 4) Doesn't work in the case of vIOMMU offloading
> >
> > And compare the the existing approach, it ends up with more PCI
> > transactions under heavy load.
> >
> >> On top of this one may add parallelism to
> >> distribute load to multiple queues, or add zero copy to speed up dirty
> >> sync to userspace - things virtio queues are pretty good at doing. After
> >> all, nothing can be perfect to begin with, and every complex feature
> >> would need substantial time to improve and evolve.
> > Evolve is good, but the problem is platform is also evolving. The
> > function is duplicated there and platform provides a lot of advanced
> > features that can co-operate with dirty page tracking like vIOMMU
> > offloading where it almost impossible to be done in virtio. Virtio
> > needs to leverage the platform or transport instead of reinventing
> > wheels so it can focus on the virtio device logic.
> >
> >> It does so for shadow
> >> virtqueue from where it gets started to where it is now, even so there's
> >> still a lot of optimization work not done yet. There must be head room
> >> here for device page tracking or platform tracking, too.
> > Let's then focus on the possible issues (I've pointed out a brunches).
> >
> > Thanks
> >
> >> Regards,
> >> -Siwei
> >>
> >>
> >>> Shadow virtqueue can be used with a save/load model for device state
> >>> recovery for sure.
> >>>
> >>>>> But it seems the shadow virtqueue itself is not the major factor but
> >>>>> the time spent on programming vendor specific mappings for example.
> >>>> Yep. The slowness on mapping part is mostly due to the artifact of
> >>>> software-based implementation. IMHO for live migration p.o.v it's better
> >>>> to not involve any mapping operation in the down time path at all.
> >>> Yes.
> >>>
> >>> Thanks
> >>>
> >>>> -Siwei
> >>>>> Thanks
> >>>>>
> >>>>>> The time didn't come from pci side or boot side.
> >>>>>>
> >>>>>> For pci side of things you would want to compare the pci vs non pci device based VM boot time.
> >>>>>>
> >>>>> This publicly archived list offers a means to provide input to the
> >>>>>
> >>>>> OASIS Virtual I/O Device (VIRTIO) TC.
> >>>>>
> >>>>>
> >>>>>
> >>>>> In order to verify user consent to the Feedback License terms and
> >>>>>
> >>>>> to minimize spam in the list archive, subscription is required
> >>>>>
> >>>>> before posting.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >>>>>
> >>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >>>>>
> >>>>> List help: virtio-comment-help@lists.oasis-open.org
> >>>>>
> >>>>> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> >>>>>
> >>>>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> >>>>>
> >>>>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> >>>>>
> >>>>> Committee: https://www.oasis-open.org/committees/virtio/
> >>>>>
> >>>>> Join OASIS: https://www.oasis-open.org/join/
> >>>>>
> >>>> This publicly archived list offers a means to provide input to the
> >>>> OASIS Virtual I/O Device (VIRTIO) TC.
> >>>>
> >>>> In order to verify user consent to the Feedback License terms and
> >>>> to minimize spam in the list archive, subscription is required
> >>>> before posting.
> >>>>
> >>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >>>> List help: virtio-comment-help@lists.oasis-open.org
> >>>> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> >>>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> >>>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> >>>> Committee: https://www.oasis-open.org/committees/virtio/
> >>>> Join OASIS: https://www.oasis-open.org/join/
> >>>>
> >>> This publicly archived list offers a means to provide input to the
> >>>
> >>> OASIS Virtual I/O Device (VIRTIO) TC.
> >>>
> >>>
> >>>
> >>> In order to verify user consent to the Feedback License terms and
> >>>
> >>> to minimize spam in the list archive, subscription is required
> >>>
> >>> before posting.
> >>>
> >>>
> >>>
> >>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >>>
> >>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >>>
> >>> List help: virtio-comment-help@lists.oasis-open.org
> >>>
> >>> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> >>>
> >>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> >>>
> >>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> >>>
> >>> Committee: https://www.oasis-open.org/committees/virtio/
> >>>
> >>> Join OASIS: https://www.oasis-open.org/join/
> >>>
>
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 157+ messages in thread

end of thread, other threads:[~2023-11-29  5:12 UTC | newest]

Thread overview: 157+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-30 13:19 [virtio-comment] [PATCH v3 0/8] Introduce device migration support commands Parav Pandit
2023-10-30 13:19 ` [virtio-comment] [PATCH v3 1/8] admin: Add theory of operation for device migration Parav Pandit
2023-10-30 13:19 ` [virtio-comment] [PATCH v3 2/8] admin: Redefine reserved2 as command specific output Parav Pandit
2023-10-30 13:19 ` [virtio-comment] [PATCH v3 3/8] device-context: Define the device context fields for device migration Parav Pandit
2023-10-30 13:19 ` [virtio-comment] [PATCH v3 4/8] admin: Add device migration admin commands Parav Pandit
2023-10-30 13:19 ` [virtio-comment] [PATCH v3 5/8] admin: Add requirements of device migration commands Parav Pandit
2023-10-30 13:19 ` [virtio-comment] [PATCH v3 6/8] admin: Add theory of operation for write recording commands Parav Pandit
2023-10-31  1:43   ` [virtio-comment] " Jason Wang
2023-10-31  3:27     ` [virtio-comment] " Parav Pandit
2023-10-31  7:45       ` [virtio-comment] " Michael S. Tsirkin
2023-10-31  9:32         ` Zhu, Lingshan
2023-10-31  9:41           ` Michael S. Tsirkin
2023-10-31  9:47             ` Zhu, Lingshan
2023-11-01  0:29       ` Jason Wang
2023-11-01  3:02         ` [virtio-comment] " Parav Pandit
2023-11-02  4:24           ` [virtio-comment] " Jason Wang
2023-11-02  6:10             ` [virtio-comment] " Parav Pandit
2023-11-06  6:34               ` [virtio-comment] " Jason Wang
2023-11-06  6:53                 ` [virtio-comment] " Parav Pandit
2023-11-07  4:04                   ` [virtio-comment] " Jason Wang
2023-11-07  7:05                     ` Michael S. Tsirkin
2023-11-08  4:28                       ` Jason Wang
2023-11-08  8:17                         ` Michael S. Tsirkin
2023-11-08  9:00                           ` [virtio-comment] " Parav Pandit
2023-11-08 17:16                             ` [virtio-comment] " Michael S. Tsirkin
2023-11-09  6:27                               ` Parav Pandit
2023-11-09  3:31                           ` Jason Wang
2023-11-09  7:59                             ` Michael S. Tsirkin
2023-11-10  6:46                               ` [virtio-comment] " Parav Pandit
2023-11-13  3:41                                 ` [virtio-comment] " Jason Wang
2023-11-13 14:30                                   ` Michael S. Tsirkin
2023-11-14  2:03                                     ` Zhu, Lingshan
2023-11-14  7:52                                       ` Jason Wang
2023-11-15 17:37                                   ` [virtio-comment] " Parav Pandit
2023-11-16  4:24                                     ` [virtio-comment] " Jason Wang
2023-11-16  6:49                                       ` Michael S. Tsirkin
2023-11-21  4:21                                         ` Jason Wang
2023-11-21 16:24                                           ` [virtio-comment] " Parav Pandit
2023-11-22  4:11                                             ` [virtio-comment] " Jason Wang
2023-11-16  6:50                                     ` Michael S. Tsirkin
2023-11-13  3:31                               ` Jason Wang
2023-11-13  6:57                                 ` Michael S. Tsirkin
2023-11-14  7:34                                   ` Zhu, Lingshan
2023-11-14  7:59                                     ` Jason Wang
2023-11-14  8:27                                     ` Michael S. Tsirkin
2023-11-15  4:05                                       ` Zhu, Lingshan
2023-11-15  7:51                                         ` Michael S. Tsirkin
2023-11-15  7:59                                           ` Zhu, Lingshan
2023-11-15  8:05                                             ` Michael S. Tsirkin
2023-11-15  8:42                                               ` Zhu, Lingshan
2023-11-15 11:52                                                 ` Michael S. Tsirkin
2023-11-16  9:38                                                   ` Zhu, Lingshan
2023-11-16 12:18                                                     ` Michael S. Tsirkin
2023-11-17  9:50                                                       ` Zhu, Lingshan
2023-11-17  9:55                                                         ` Michael S. Tsirkin
2023-11-14  7:57                                   ` Jason Wang
2023-11-14  9:16                                     ` Michael S. Tsirkin
2023-11-15 17:42                                 ` [virtio-comment] " Parav Pandit
2023-11-16  4:18                                   ` [virtio-comment] " Jason Wang
2023-11-16  5:27                                     ` [virtio-comment] " Parav Pandit
2023-11-17 10:15                                   ` [virtio-comment] " Michael S. Tsirkin
2023-11-17 10:48                                     ` Parav Pandit
2023-11-17 11:19                                       ` Michael S. Tsirkin
2023-11-17 11:32                                         ` Parav Pandit
2023-11-17 11:49                                           ` Michael S. Tsirkin
2023-11-17 12:15                                             ` Parav Pandit
2023-11-17 12:37                                               ` Michael S. Tsirkin
2023-11-17 12:49                                                 ` Parav Pandit
2023-11-17 13:58                                                   ` Michael S. Tsirkin
2023-11-17 14:49                                                     ` Parav Pandit
2023-11-17 15:00                                                       ` Michael S. Tsirkin
2023-11-09  6:26                         ` [virtio-comment] " Parav Pandit
2023-11-15  7:59                           ` [virtio-comment] " Michael S. Tsirkin
2023-11-15 17:42                             ` [virtio-comment] " Parav Pandit
2023-11-09  6:24                     ` Parav Pandit
2023-11-13  3:37                       ` [virtio-comment] " Jason Wang
2023-11-15 17:38                         ` [virtio-comment] " Parav Pandit
2023-11-16  4:23                           ` [virtio-comment] " Jason Wang
2023-11-16  5:29                             ` [virtio-comment] " Parav Pandit
2023-11-16  5:51                               ` [virtio-comment] " Michael S. Tsirkin
2023-11-16  7:35                                 ` Michael S. Tsirkin
2023-11-16  7:40                                   ` [virtio-comment] " Parav Pandit
2023-11-16 11:48                                     ` [virtio-comment] " Michael S. Tsirkin
2023-11-16 16:26                                       ` [virtio-comment] " Parav Pandit
2023-11-16 17:25                                         ` [virtio-comment] " Michael S. Tsirkin
2023-11-16 17:29                                           ` [virtio-comment] " Parav Pandit
2023-11-16 18:20                                             ` [virtio-comment] " Michael S. Tsirkin
2023-11-17  3:02                                               ` [virtio-comment] " Parav Pandit
2023-11-17  8:46                                                 ` [virtio-comment] " Michael S. Tsirkin
2023-11-17  9:14                                                   ` [virtio-comment] " Parav Pandit
2023-11-17  9:37                                                     ` [virtio-comment] " Michael S. Tsirkin
2023-11-17  9:41                                                       ` [virtio-comment] " Parav Pandit
2023-11-17  9:44                                                         ` Parav Pandit
2023-11-17  9:51                                                         ` [virtio-comment] " Michael S. Tsirkin
2023-11-17  9:54                                                           ` Zhu, Lingshan
2023-11-17 10:02                                                             ` Michael S. Tsirkin
2023-11-17 10:10                                                               ` Parav Pandit
2023-11-17  9:57                                                           ` Parav Pandit
2023-11-17 10:37                                                             ` Michael S. Tsirkin
2023-11-17 10:52                                                               ` Parav Pandit
2023-11-17 11:32                                                                 ` Michael S. Tsirkin
2023-11-17 12:22                                                                   ` Parav Pandit
2023-11-17 12:40                                                                     ` Michael S. Tsirkin
2023-11-17 12:51                                                                       ` Parav Pandit
2023-11-21  5:16                                                                         ` Jason Wang
2023-11-21 16:29                                                                           ` Parav Pandit
2023-11-21 21:00                                                                             ` Michael S. Tsirkin
2023-11-22  3:46                                                                               ` Parav Pandit
2023-11-22  7:44                                                                                 ` Michael S. Tsirkin
2023-11-22  4:17                                                                             ` Jason Wang
2023-11-22  4:34                                                                               ` Parav Pandit
2023-11-24  3:15                                                                                 ` Jason Wang
2023-11-17  9:52                                                         ` Zhu, Lingshan
2023-11-17  9:59                                                           ` [virtio-comment] " Parav Pandit
2023-11-17 10:00                                                             ` [virtio-comment] " Zhu, Lingshan
2023-11-21  4:24                                                 ` Jason Wang
2023-11-21 16:26                                                   ` [virtio-comment] " Parav Pandit
2023-11-22  4:14                                                     ` [virtio-comment] " Jason Wang
2023-11-22  4:19                                                       ` [virtio-comment] " Parav Pandit
2023-11-24  3:09                                                         ` [virtio-comment] " Jason Wang
2023-11-16 10:28                                 ` Zhu, Lingshan
2023-11-16 11:59                                   ` Michael S. Tsirkin
2023-11-17  9:59                                     ` Zhu, Lingshan
2023-11-17 10:03                                       ` Parav Pandit
2023-11-17 11:00                                         ` Michael S. Tsirkin
2023-11-17 11:05                                           ` Parav Pandit
2023-11-17 11:33                                             ` Michael S. Tsirkin
2023-11-17 11:45                                               ` Parav Pandit
2023-11-17 12:04                                                 ` Michael S. Tsirkin
2023-11-17 12:11                                                   ` Parav Pandit
2023-11-17 12:32                                                     ` Michael S. Tsirkin
2023-11-17 13:03                                                       ` Parav Pandit
2023-11-17 14:00                                                         ` Michael S. Tsirkin
2023-11-17 14:48                                                           ` Parav Pandit
2023-11-17 14:59                                                             ` Michael S. Tsirkin
2023-11-21  6:55                                                             ` Jason Wang
2023-11-21 16:30                                                               ` Parav Pandit
2023-11-22  4:19                                                                 ` Jason Wang
2023-11-22  4:28                                                                   ` Parav Pandit
2023-11-24  3:08                                                                     ` Jason Wang
2023-11-22  2:31                                                               ` Si-Wei Liu
2023-11-22  5:31                                                                 ` Jason Wang
2023-11-23 13:19                                                                   ` Si-Wei Liu
2023-11-23 14:39                                                                     ` Michael S. Tsirkin
2023-11-24  2:29                                                                     ` Jason Wang
2023-11-28  3:00                                                                       ` Si-Wei Liu
2023-11-29  5:12                                                                         ` Jason Wang
2023-11-17 10:40                                       ` Michael S. Tsirkin
2023-11-21  4:23                                 ` Jason Wang
2023-11-21  7:14                               ` Jason Wang
2023-11-21 16:31                                 ` [virtio-comment] " Parav Pandit
2023-11-22  4:28                                   ` [virtio-comment] " Jason Wang
2023-11-22  6:41                                     ` [virtio-comment] " Parav Pandit
2023-11-24  3:06                                       ` [virtio-comment] " Jason Wang
2023-11-15  7:58                       ` Michael S. Tsirkin
2023-10-30 13:19 ` [virtio-comment] [PATCH v3 7/8] admin: Add " Parav Pandit
2023-10-30 13:19 ` [virtio-comment] [PATCH v3 8/8] admin: Add requirements of write reporting commands Parav Pandit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.