* [virtio-comment] [PATCH 0/8] Introduce device migration supporting commands
@ 2023-09-09 14:29 Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 1/8] admin: Add theory of operation for device migration Parav Pandit
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: Parav Pandit @ 2023-09-09 14:29 UTC (permalink / raw)
To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit
This series introduces administration commands for member device migration
for PCI transport; when needed it can be extended for other transports
too.
Use case requirements:
======================
1. A hypervisor system needs to provide a PCI VF as passthrough
device to the guest virtual machine and also support live
migration of this virtual machine.
2. A virtual machine may have one or more such passthrough
virtio devices.
3. A virtual machine may have other PCI passthrough device
which may also interact with virtio device.
4. A hypervisor runs a vendor agnostic driver with extension
to support device migration.
5. A PCI VF passthrough device needs to support transparent
device reset and PCI FLR must while the device migration is
ongoing.
6. A owner driver do not involve in device operations mediation
for the passthrough device at virtio interface level.
7. Mechanism is generic enough that applies to large family of
virtio devices and it does not involve trapping any virtio
device interfaces for the passthrough devices.
Overview:
=========
Above usecase requirements is solved by PCI PF group owner driver
facilitating the member device migration functionality using
administration commands.
There are three major functionalities.
1. Suspend and resume the device operation
2. Read and Write the device context containing all the information
that can be transferred from source to destination to migrate to
a member device
3. Track pages written by the device during device migration is
ongoing
This comprehensive series introduces 4 infrastructure pieces
covering PCI transport, peer to peer PCI devices, page tracking (aka dirty page
tracking) and generic device context.
1. Device mode get,set (active, stop, freeze)
2. Device context read and write
3. Defines device context
4. Write reporting to track page addresses
This series enables virtio PCI SR-IOV member device to member device
migration. It can also be used to/from migrate from PCI SR-IOV member
device to software composed PCI device if/when needed which can
parse and compose software based PCI virtio device.
Example flow:
=============
Source hypervisor:
1. Instructs device to start tracking pages it is writing
2. Periodically query the addresses of the written pages
3. Suspend the device operation
4. Read the device context and transfer to destination hypervisor
Destination hypervisor:
5. Write the device context received from source
6. Resume the device that has newly written device context
Patch summary:
==============
patch-1: Adds theory of operation for device migration commands
patch-2: Redefine reserved2 to command output field
patch-3: Defines short device context for split virtqueues
patch-4: Adds device migration commands
patch-5: Adds requirements for device migration commands
patch-6: Adds theory of operation for write reporting commands
patch-7: Adds write reporting commands
patch-8: Adds requirements for write reporting commands
In next version v1, more detailed device context will be defined
along with requirements.
It also takes inspiration from the similar idea presented at KVM Forum
at [1].
Please review.
[1] https://static.sched.com/hosted_files/kvmforum2022/3a/KVM22-Migratable-Vhost-vDPA.pdf
Parav Pandit (8):
admin: Add theory of operation for device migration
admin: Redefine reserved2 as command specific output
device-context: Define the device context fields for device migration
admin: Add device migration admin commands
admin: Add requirements of device migration commands
admin: Add theory of operation for write recording commands
admin: Add write recording commands
admin: Add requirements of write reporting commands
admin-cmds-device-migration.tex | 574 ++++++++++++++++++++++++++++++++
admin.tex | 38 ++-
content.tex | 1 +
device-context.tex | 107 ++++++
4 files changed, 713 insertions(+), 7 deletions(-)
create mode 100644 admin-cmds-device-migration.tex
create mode 100644 device-context.tex
--
2.34.1
This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.
In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.
Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/
^ permalink raw reply [flat|nested] 9+ messages in thread
* [virtio-comment] [PATCH 1/8] admin: Add theory of operation for device migration
2023-09-09 14:29 [virtio-comment] [PATCH 0/8] Introduce device migration supporting commands Parav Pandit
@ 2023-09-09 14:29 ` Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 2/8] admin: Redefine reserved2 as command specific output Parav Pandit
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Parav Pandit @ 2023-09-09 14:29 UTC (permalink / raw)
To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit
One or more passthrough PCI VF devices are ubiquitous for virtual
machines usage using generic kernel framework such as vfio [1].
A passthrough PCI VF device is fully owned by the virtual machine
device driver. This passthrough device controls its own device
reset flow, basic functionality as PCI VF function level reset
and rest of the virtio device functionality such as control vq,
config space access, data path descriptors handling.
Additionally, VM live migration using a precopy method is also widely used.
To support a VM live migration for such passthrough virtio devices,
the owner PCI PF device administers the device migration flow.
This patch introduces the basic theory of operation which describes the flow
and supporting administration commands.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/uapi/linux/vfio.h?h=v6.1.47
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
admin-cmds-device-migration.tex | 94 +++++++++++++++++++++++++++++++++
admin.tex | 1 +
2 files changed, 95 insertions(+)
create mode 100644 admin-cmds-device-migration.tex
diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
new file mode 100644
index 0000000..f839af4
--- /dev/null
+++ b/admin-cmds-device-migration.tex
@@ -0,0 +1,94 @@
+\subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device / Device groups / Group
+administration commands / Device Migration}
+
+In some systems, there is a need to migrate a running virtual machine
+from one to another system. A running virtual machine has one or more
+passthrough virtio member devices attached to it. A passthrough device
+is entirely operated by the guest virtual machine. For example, with
+the SR-IOV group type, group member (VF) may undergo virtio device
+initialization and reset flow and may also undergo PCI function level
+reset(FLR) flow. Such flows must comply to the PCI standard and also
+virtio specification; at the same time such flows must not obstruct
+the device migration flow. In such a scenario, a group owner device
+can provide the administration command interface to facilitate the device
+migration related operations.
+
+When a virtual machine migrates from one hypervisor to another hypervisor,
+these hypervisors are named as source and destination hypervisor respectively.
+In such a scenario, a source hypervisor administers the
+member device to suspend the device and preserves the device context.
+Subsequently, a destination hypervisor administers the member device to
+setup a device context and resumes the member device. The source hypervisor
+reads the member device context and the destination hypervisor writes the member
+device context. The method to transfer the member device context from the source
+to the destination hypervisor is outside the scope of this specification.
+
+The member device can be in any of the three migration modes. The owner driver
+sets the member device in one of the following modes during device migration flow.
+
+\begin{tabularx}{\textwidth}{ |l||l|X| }
+\hline
+Value & Name & Description \\
+\hline \hline
+0x0 & Active &
+ It is the default mode after instantiation of the member device. \\
+\hline
+0x1 & Stop &
+ In this mode, the member device does not send any notifications,
+ and it does not access any driver memory.
+ The member device may receive driver notifications in this mode,
+ the member device context and device configuration space may change. \\
+\hline
+0x2 & Freeze &
+ In this mode, the member device does not accept any driver notifications,
+ it ignores any device configuration space writes,
+ the device do not have any changes in the device context. The
+ member device is not accessed in the system through the virtio interface. \\
+\hline
+\hline
+0x03-0xFF & - & reserved for future use \\
+\hline
+\end{tabularx}
+
+When the owner driver wants to stop the operation of the
+device, the owner driver sets the device mode to \field{Stop}. Once the
+device is in the \field{Stop} mode, the device does not initiate any notifications
+or does not access any driver memory. Since the member driver may be still
+active which may send further driver notifications to the device, the device
+context may be updated. When the member driver has stopped accessing the
+device, the owner driver sets the device to \field{Freeze} mode indicating
+to the device that no more driver access occurs. In the \field{Freeze} mode,
+no more changes occur in the device context. At this point, the device ensures
+that there will not be any update to the device context.
+
+The member device has a device context which the owner driver can either
+read or write. The member device context consist of any device specific
+data which is needed by the device to resume its operation when the device mode
+is changed from \field{Stop} to \field{Active} or from \field{Freeze}
+to \field{Active}.
+
+Once the device context is read, it is cleared from the device. Typically, on
+the source hypervisor, the owner driver reads the device context once when
+the device is in \field{Active} or \field{Stop} mode and later once the member
+device is in \field{Freeze} mode.
+
+Typically, the device context is read and written one time on the source and
+the destination hypervisor respectively once the device is in \field{Freeze}
+mode. On the destination hypervisor, after writing the device context,
+when the device mode set to \field{Active}, the device uses the most recently
+set device context and resumes the device operation.
+
+In an alternative flow, on the source hypervisor the owner driver may choose
+to read the device context first time while the device is in \field{Active} mode
+and second time once the device is in \field{Freeze} mode. Similarly, on the
+destination hypervisor writes the device context first time while the device
+is still running in \field{Active} mode on the source hypervisor and writes
+the device context second time while the device is in \field{Freeze} mode.
+This flow may result in very short setup time as the device context likely
+have minimal changes from the previously written device context. This flow may
+reduce the device migration time significantly and may have near constant
+device activation time regardless of number of virtqueues, resources and
+passthough devices in use by the migrating virtual machine.
+
+The owner driver can discard any partially read or written device context when
+any of the device migration flow should be aborted.
diff --git a/admin.tex b/admin.tex
index 0803c26..6eeef58 100644
--- a/admin.tex
+++ b/admin.tex
@@ -297,6 +297,7 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
might differ between different group types.
\input{admin-cmds-legacy-interface.tex}
+\input{admin-cmds-device-migration.tex}
\devicenormative{\subsubsection}{Group administration commands}{Basic Facilities of a Virtio Device / Device groups / Group administration commands}
--
2.34.1
This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.
In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.
Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [virtio-comment] [PATCH 2/8] admin: Redefine reserved2 as command specific output
2023-09-09 14:29 [virtio-comment] [PATCH 0/8] Introduce device migration supporting commands Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 1/8] admin: Add theory of operation for device migration Parav Pandit
@ 2023-09-09 14:29 ` Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 3/8] device-context: Define the device context fields for device migration Parav Pandit
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Parav Pandit @ 2023-09-09 14:29 UTC (permalink / raw)
To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit
Currently when a command wants to get two distinct types of data in
the result, such as one consumed by the driver, other to be zero
copied to some user buffers, the driver needs to prepare an
extra descriptor for driver consumed field. When such a field is
<= 4 bytes, extra descriptor is an overhead.
virtio_admin_command already has 4B of reserved for the device
writable area. Utilize it to define as device writable output.
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
admin.tex | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/admin.tex b/admin.tex
index 6eeef58..c86813d 100644
--- a/admin.tex
+++ b/admin.tex
@@ -90,8 +90,7 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
/* Device-writable part */
le16 status;
le16 status_qualifier;
- /* unused, reserved for future extensions */
- u8 reserved2[4];
+ u8 command_specific_output[4];
u8 command_specific_result[];
};
\end{lstlisting}
@@ -192,11 +191,15 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
\hline
\end{tabularx}
-Each command uses a different \field{command_specific_data} and
-\field{command_specific_result} structures and the length of
+Each command uses a different \field{command_specific_data},
+\field{command_specific_output} and
+\field{command_specific_result} fields. The length of
\field{command_specific_data} and \field{command_specific_result}
-depends on these structures and is described separately or is
-implicit in the structure description.
+depends on the command and is described separately or is
+implicit in the structure description. The \field{command_specific_output}
+describes any command specific output which is up to 4 bytes size. The
+\field{command_specific_output} contain one or more command specific
+fields.
Before sending any group administration commands to the device, the driver
needs to communicate to the device which commands it is going to
--
2.34.1
This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.
In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.
Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [virtio-comment] [PATCH 3/8] device-context: Define the device context fields for device migration
2023-09-09 14:29 [virtio-comment] [PATCH 0/8] Introduce device migration supporting commands Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 1/8] admin: Add theory of operation for device migration Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 2/8] admin: Redefine reserved2 as command specific output Parav Pandit
@ 2023-09-09 14:29 ` Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 4/8] admin: Add device migration admin commands Parav Pandit
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Parav Pandit @ 2023-09-09 14:29 UTC (permalink / raw)
To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit
Define the device context and its fields for purpose of device
migration. The device context is read and written by the owner driver
on source and destination hypervisor respectively.
Device context fields will experience a rapid growth post this initial
version to cover many details of the device.
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
content.tex | 1 +
device-context.tex | 107 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 108 insertions(+)
create mode 100644 device-context.tex
diff --git a/content.tex b/content.tex
index 0a62dce..2698931 100644
--- a/content.tex
+++ b/content.tex
@@ -503,6 +503,7 @@ \section{Exporting Objects}\label{sec:Basic Facilities of a Virtio Device / Expo
UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
\input{admin.tex}
+\input{device-context.tex}
\chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation}
diff --git a/device-context.tex b/device-context.tex
new file mode 100644
index 0000000..656eea4
--- /dev/null
+++ b/device-context.tex
@@ -0,0 +1,107 @@
+\section{Device Context}\label{sec:Basic Facilities of a Virtio Device / Device Context}
+
+The device context holds the information that a owner driver can use
+to setup a member device and resume its operation. The device context
+of a member device is read or written by the owner driver using
+administration commands.
+
+\begin{lstlisting}
+struct virtio_dev_ctx_field_tlv {
+ le32 type;
+ le32 reserved;
+ le64 length;
+ u8 value[];
+};
+
+struct virtio_dev_ctx {
+ le32 field_count;
+ struct virtio_dev_ctx_field_tlv fields[];
+};
+
+\end{lstlisting}
+
+The \field{struct virtio_dev_ctx} is the device context of a member device.
+The \field{field_count} indicates how many instances of
+\field{struct virtio_dev_ctx_field_tlv} are present.
+
+The \field{struct virtio_dev_ctx_field_tlv} consist of \field{type} indicating
+what data is contained in the \field{value} of length \field{length}.
+The valid values for \field{type} can be found in the following table:
+
+\begin{tabularx}{\textwidth}{ |l||l|X| }
+\hline
+type & Name & Description \\
+\hline \hline
+0x0 & VIRTIO_DEV_CTX_PCI_COMMON_CFG & Provides common configuration space of device for PCI transport \\
+\hline
+0x1 & VIRTIO_DEV_CTX_PCI_VQ_CFG & Provides Virtqueue configuration for PCI transport \\
+\hline
+0x2 & VIRTIO_DEV_CTX_VQ_SPLIT_RUNTIME_CFG & Provides Queue run time state \\
+\hline
+0x3 & VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC & Provides list of virtqueue descriptors owned by device \\
+\hline
+0x4 - 0xFFFFFFFF & - & Reserved for future types \\
+\hline
+\end{tabularx}
+
+\subsubsection{Device Context Fields}\label{sec:Basic Facilities of a Virtio Device / Device Context / Device Context Fields}
+
+\paragraph{PCI Common Configuration Context}
+\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ PCI Common Configuration Context}
+
+For the field VIRTIO_DEV_CTX_PCI_COMMON_CFG, \field{type} is set to 0x0.
+The \field{value} is in format of \field{struct virtio_pci_common_cfg}.
+The \field{length} is the length of \field{struct virtio_pci_common_cfg}.
+
+\paragraph{PCI Virtqueue Configuration Context}
+\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ PCI Virtqueue Configuration Context}
+
+For the field VIRTIO_DEV_CTX_PCI_VQ_CFG, \field{type} is set to 0x1.
+The \field{value} is in format of \field{struct virtio_dev_ctx_pci_vq_cfg}.
+The \field{length} is the length of \field{struct virtio_dev_ctx_pci_vq_cfg}.
+
+\begin{lstlisting}
+struct virtio_dev_ctx_pci_vq_cfg {
+ le16 vq_index;
+ le16 queue_size;
+ le16 queue_msix_vector;
+ le64 queue_desc;
+ le64 queue_driver;
+ le64 queue_device;
+};
+\end{lstlisting}
+
+\paragraph{Virtqueue Split Mode Runtime Context}
+\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ Virtqueue Split Mode Runtime Context}
+
+For the field VIRTIO_DEV_CTX_VQ_SPLIT_RUNTIME_CFG, \field{type} is set to 0x2.
+The \field{value} is in format of \field{struct virtio_dev_ctx_vq_split_runtime}.
+The \field{length} is the length of \field{struct virtio_dev_ctx_vq_split_runtime}.
+
+\begin{lstlisting}
+struct virtio_dev_ctx_vq_split_runtime {
+ le16 vq_index;
+ le16 dev_avail_idx;
+ u8 enabled;
+};
+\end{lstlisting}
+
+The \field{dev_avail_idx} indicates the next available index of the virtqueue from which
+the device must start processing the available ring.
+
+\paragraph{Virtqueue Split Mode Device owned Descriptors}
+
+For the field VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC, \field{type} is set to 0x3.
+The \field{value} is in format of \field{struct virtio_dev_ctx_vq_split_runtime}.
+The \field{length} is the length of \field{struct virtio_dev_ctx_vq_split_dev_descs}.
+
+\begin{lstlisting}
+struct virtio_dev_ctx_vq_split_dev_descs {
+ le16 vq_index;
+ le16 desc_count;
+ le16 desc_idx[];
+};
+\end{lstlisting}
+
+The \field{desc_idx} contains indices of the descriptors in \field{desc_count} of a
+virtqueue identified by \field{vq_index} which is owned by the device.
--
2.34.1
This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.
In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.
Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [virtio-comment] [PATCH 4/8] admin: Add device migration admin commands
2023-09-09 14:29 [virtio-comment] [PATCH 0/8] Introduce device migration supporting commands Parav Pandit
` (2 preceding siblings ...)
2023-09-09 14:29 ` [virtio-comment] [PATCH 3/8] device-context: Define the device context fields for device migration Parav Pandit
@ 2023-09-09 14:29 ` Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 5/8] admin: Add requirements of device migration commands Parav Pandit
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Parav Pandit @ 2023-09-09 14:29 UTC (permalink / raw)
To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit
A passthrough device is mapped to the guest VM. A passthrough device
accessed by the driver can undergo its own device reset and for PCI
transport it can undergo its PCI FLR while the guest VM migration is
ongoing.
The passhtrough device may not have any direct channel through which
device migration related administrative tasks can be done, and even if
it may have such adminstative task must not be interrupted by the
device reset or VF FLR flow initiated by the passthrough device.
Hence, the owner driver which administers the member devices,
facilitate the device migration flow.
Add device migration administration commands that owner driver can use
for the passthrough device.
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
admin-cmds-device-migration.tex | 201 +++++++++++++++++++++++++++++++-
1 file changed, 200 insertions(+), 1 deletion(-)
diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index f839af4..b7bfc09 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -65,7 +65,8 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
read or write. The member device context consist of any device specific
data which is needed by the device to resume its operation when the device mode
is changed from \field{Stop} to \field{Active} or from \field{Freeze}
-to \field{Active}.
+to \field{Active}. The device context is described in section
+\ref{sec:Basic Facilities of a Virtio Device / Device Context}.
Once the device context is read, it is cleared from the device. Typically, on
the source hypervisor, the owner driver reads the device context once when
@@ -92,3 +93,201 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
The owner driver can discard any partially read or written device context when
any of the device migration flow should be aborted.
+
+The owner driver uses following device migration group administration commands.
+
+\begin{enumerate}
+\item Device Mode Get Command
+\item Device Mode Set Command
+\item Device Context Size Get Command
+\item Device Context Read Command
+\item Device Context Write Command
+\item Device Context Discard Command
+\end{enumerate}
+
+These commands are currently only defined for the SR-IOV group type.
+
+\paragraph{Device Mode Get Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Mode Get Command}
+
+This command reads the mode of the device.
+For the command VIRTIO_ADMIN_CMD_DEV_MODE_GET, \field{opcode}
+is set to 0x7.
+The \field{group_member_id} refers to the member device to be accessed.
+This command has no command specific data.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_mode_get_result {
+ u8 mode;
+};
+\end{lstlisting}
+
+When the command completes successfully, \field{command_specific_result}
+is in the format \field{struct virtio_admin_cmd_dev_mode_get_result}
+returned by the device where the device returns the \field{mode} value to
+either \field{Active} or \field{Stop} or \field{Freeze}.
+
+\paragraph{Device Mode Set Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Mode Set Command}
+
+This command sets the mode of the device.
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_dev_mode_set_data} describing the new device mode.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_mode_set_data {
+ u8 mode;
+};
+\end{lstlisting}
+
+For the command VIRTIO_ADMIN_CMD_DEV_MODE_SET, \field{opcode} is set to 0x8.
+The \field{group_member_id} refers to the member device to be accessed.
+The \field{mode} is set to either \field{Active} or \field{Stop} or
+\field{Freeze}.
+
+This command has no command specific result. When the command completes
+successfully, device is set in the new \field{mode}. When the command fails
+the device stays in the previous mode.
+
+\paragraph{Device Context Size Get Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Size Get Command}
+
+This command returns the remaining estimated device context size. The
+driver can query the remaining estimated device context size
+for the current mode or for the \field{Freeze} mode. While
+reading the device context using VIRTIO_ADMIN_CMD_DEV_CTX_READ command, the
+actual device context size may differ than what is being returned by
+this command. After reading the device context using command
+VIRTIO_ADMIN_CMD_DEV_CTX_READ, the remaining estimated context size
+usually reduces by amount of device context read by the driver using
+VIRTIO_ADMIN_CMD_DEV_CTX_READ command. If the device context is updated
+rapidly the remaining estimated context size may also increase even after
+reading the device context using VIRTIO_ADMIN_CMD_DEV_CTX_READ command.
+
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET, \field{opcode} is set to 0x9.
+The \field{group_member_id} refers to the member device to be accessed.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_ctx_size_get_data {
+ u8 freeze_mode;
+};
+\end{lstlisting}
+
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_dev_ctx_size_get_data}.
+When \field{freeze_mode} is set to 1, the device returns the estimated
+device context size when the device will be in \field{Freeze} mode.
+As the device context is read from the device, the remaining estimated
+context size may decrease. For example, member device mode is
+\field{Stop}, the device has estimated total device context size
+as 12KB; the device would return 12KB for the first
+VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET command, once the driver has
+already read 8KB of device context data using
+VIRTIO_ADMIN_CMD_DEV_CTX_READ command, and the remaining data is
+4KB, hence the device returns 4KB in the subsequent
+VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET command.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_ctx_size_get_result {
+ le64 size;
+};
+\end{lstlisting}
+
+When the command completes successfully, \field{command_specific_result} is in
+the format \field{struct virtio_admin_cmd_dev_ctx_size_get_result}.
+
+Once the device context is fully read, this command returns zero for
+\field{size} until the new device context is generated.
+
+\paragraph{Device Context Read Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Read Command}
+
+This command reads the current device context.
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_READ, \field{opcode} is set to 0xa.
+The \field{group_member_id} refers to the member device to be accessed.
+
+This command has no command specific data.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_ctx_rd_len {
+ le32 context_len;
+};
+
+struct virtio_admin_cmd_dev_ctx_rd_result {
+ u8 data[];
+};
+\end{lstlisting}
+
+When the command completes successfully, \field{command_specific_result}
+is in the format \field{struct virtio_admin_cmd_dev_ctx_rd_result}
+returned by the device containing the device context data and
+\field{command_specific_output} is in format of
+\field{struct virtio_admin_cmd_dev_ctx_rd_len} containing length of
+context data returned by the device in the command response. When the length
+returned is zero or when the returned context data is less the data requested by
+the driver, the device do not have any device context data left that the device
+can report, at this point the device context stream ends.
+
+The driver can read the whole device context data using one or multiple
+commands. When the device context does not fit in the
+\field{command_specific_result}, driver reads the subsequent remaining
+bytes using one or more subsequent commands.
+
+\paragraph{Device Context Write Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Write Command}
+
+This command writes the device context data. The device context can be written
+only when the device mode is \field{Freeze}.
+
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_WRITE, \field{opcode}
+is set to 0xb.
+The \field{group_member_id} refers to the member device to be accessed.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_ctx_wr_data {
+ u8 data[];
+};
+\end{lstlisting}
+
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_legacy_common_cfg_wr_data} describing
+the access to be performed.
+
+This command has no command specific result.
+The device fails the command when command is executed when the device mode
+is other than \field{Freeze}.
+
+The written device context is effective when the device mode is changed
+from \field{Freeze} to \field{Stop} or from \field{Freeze} to \field{Active}.
+
+The driver can write the whole device context using one or multiple
+commands. When the device context does not fit in one command result the
+driver writes the subsequent remaining bytes using one or more subsequent
+commands.
+
+\paragraph{Device Context Discard Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Discard Command}
+
+This command discards any partial device context that is yet to be read
+by the driver and it also discards any device context that is partially written.
+This command can be used by the driver to abort any device context migration
+flow when there may have been any partial context read or write operations
+have occurred.
+
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD, \field{opcode}
+is set to 0xc.
+The \field{group_member_id} refers to the member device to be accessed.
+
+This command has no command specific data.
+This command has no command specific result.
+
+Once this command completes successfully, the device context is
+discarded. If the device context that is discarded was part of the write
+operation, once this command completes, the device functions as if the device
+context was never written. If the device context that is discarded was part
+of the read operation, once this command completes, the device functions as if
+the device context was never read in the given device mode. Once the device
+context is discarded, in subsequent VIRTIO_ADMIN_CMD_DEV_CTX_READ command,
+the device returns new device context entry. Once the device context is
+discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command writes a new device
+context.
--
2.34.1
This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.
In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.
Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [virtio-comment] [PATCH 5/8] admin: Add requirements of device migration commands
2023-09-09 14:29 [virtio-comment] [PATCH 0/8] Introduce device migration supporting commands Parav Pandit
` (3 preceding siblings ...)
2023-09-09 14:29 ` [virtio-comment] [PATCH 4/8] admin: Add device migration admin commands Parav Pandit
@ 2023-09-09 14:29 ` Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 6/8] admin: Add theory of operation for write recording commands Parav Pandit
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Parav Pandit @ 2023-09-09 14:29 UTC (permalink / raw)
To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit
Add device and driver side requirements for the device migration
commands.
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
admin-cmds-device-migration.tex | 102 ++++++++++++++++++++++++++++++++
admin.tex | 14 ++++-
2 files changed, 115 insertions(+), 1 deletion(-)
diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index b7bfc09..88e1af9 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -291,3 +291,105 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
the device returns new device context entry. Once the device context is
discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command writes a new device
context.
+
+\devicenormative{\paragraph}{Device Migration}{Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration}
+
+A device MUST either support all of, or none of
+VIRTIO_ADMIN_CMD_DEV_MODE_GET,
+VIRTIO_ADMIN_CMD_DEV_MODE_SET,
+VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET,
+VIRTIO_ADMIN_CMD_DEV_READ,
+VIRTIO_ADMIN_CMD_DEV_WRITE and
+VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD commands.
+
+When the device \field{mode} supplied in the command
+VIRTIO_ADMIN_CMD_DEV_MODE_SET is same as what the mode in the device, the device
+MUST complete the command successfully.
+
+The device MUST fail the command VIRTIO_ADMIN_CMD_DEV_MODE_SET when the \field{mode}
+is other than \field{Active} or \field{Stop} or \field{Freeze}.
+
+When changing the device mode using the command VIRTIO_ADMIN_CMD_DEV_MODE_SET,
+if the command fails, the device MUST retain the current device mode.
+
+The device MUST fail VIRTIO_ADMIN_CMD_DEV_MODE_SET command when \field{mode}
+is set to \field{Active} or \field{Stop} and if the device context is
+partially read or written using VIRTIO_ADMIN_CMD_DEV_CTX_READ and
+VIRTIO_ADMIN_CMD_DEV_CTX_WRITE commands respectively.
+
+When VIRTIO_ADMIN_CMD_DEV_CTX_READ command is received multiple times
+in a given mode, and when the complete device context is already read by the
+driver, on subsequent reception of command VIRTIO_ADMIN_CMD_DEV_CTX_READ,
+the device MUST complete the command successfully with
+\field{context_len} set to zero.
+
+The device MUST support reading the device context when the device is
+in any mode \field{Active} or \field{Stop} or \field{Freeze} using command
+VIRTIO_ADMIN_CMD_DEV_CTX_READ.
+
+When the device is in any of the mode, and if the device context is read
+partially using VIRTIO_ADMIN_CMD_DEV_CTX_READ command, the device MUST discard
+the device context when VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD command is executed;
+In subsequent execution of VIRTIO_ADMIN_CMD_DEV_CTX_READ and
+VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET, the device MUST return the remaining
+estimated device context size and the device context respectively for the
+current mode as if VIRTIO_ADMIN_CMD_DEV_CTX_READ was never received by the
+device for the current device mode.
+
+The device MUST support writing the complete device context multiple times
+by the command VIRTIO_ADMIN_CMD_DEV_CTX_WRITE.
+
+The device MUST fail VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command when the device
+mode is not \field{Freeze}.
+
+When the device is in \field{Freeze} mode, and if any device context is
+written partially by VIRTIO_ADMIN_CMD_DEV_CTX_WRITE, the device MUST discard
+the device context when VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD
+command is executed, i.e. the device functions as if the command
+VIRTIO_ADMIN_CMD_DEV_CTX_WRITE was never received.
+
+For the SR-IOV group type, when the device context is read using
+VIRTIO_ADMIN_CMD_DEV_CTX_READ from one device and written to the anoother device
+using VIRTIO_ADMIN_CMD_DEV_CTX_WRITE, the driver MUST read and write
+device context only if the device PCI subsystem vendor id and device id
+match for both the devices.
+
+For the SR-IOV group type, a function level reset(FLR) operation MUST set the
+device mode to \field{Active}.
+
+For the SR-IOV group type, when the device is in \field{Freeze} mode, any
+write access to configuration space MUST not update any fields and any
+configuration space read MAY return any value.
+
+For the SR-IOV group type, regardless of the membe device \field{mode}, all
+the PCI transport level registers MUST be always accessible and the member device
+MUST function the same way for all the PCI transport level registers
+regardless of the member device mode.
+
+For the SR-IOV group type, for the VIRTIO_PCI_CAP_PCI_CFG capability area,
+the device MUST ignore writes when the device mode is set to \field{Freeze}
+and on receiving the reads, the device MUST function same regardless of the
+device mode is \field{Active} or \field{Stop} or \field{Freeze}.
+
+\drivernormative{\paragraph}{Device Migration}{Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration}
+
+The driver SHOULD read the complete device context using one or multiple
+VIRTIO_ADMIN_CMD_DEV_CTX_READ commands.
+
+The driver MAY write the device context before changing the device mode from
+\field{Freeze} to \field{Stop} or from \field{Freeze} to \field{Active};
+the driver MUST write a complete device context using one or multiple
+VIRTIO_ADMIN_CMD_DEV_CTX_WRITE commands.
+
+The driver MUST NOT change the device mode to \field{Stop} or \field{Active}
+in the command VIRTIO_ADMIN_CMD_DEV_MODE_SET when device context is
+partially written.
+
+For the SR-IOV group type, the driver SHOULD NOT access device configuration
+space described in section
+\ref{sec:Basic Facilities of a Virtio Device / Device Configuration Space}
+when the device mode is set to \field{Freeze}.
+
+For the SR-IOV group type, the driver MUST NOT write into the
+VIRTIO_PCI_CAP_PCI_CFG capability area when the device mode is set to
+\field{Freeze}.
diff --git a/admin.tex b/admin.tex
index c86813d..3429c4e 100644
--- a/admin.tex
+++ b/admin.tex
@@ -126,7 +126,19 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
\hline
0x0006 & VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO & Query the notification region information \\
\hline
-0x0007 - 0x7FFF & - & Commands using \field{struct virtio_admin_cmd} \\
+0x0007 & VIRTIO_ADMIN_CMD_DEV_MODE_GET & Query the device mode \\
+\hline
+0x0008 & VIRTIO_ADMIN_CMD_DEV_MODE_SET & Set the device mode \\
+\hline
+0x0009 & VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET & Query the device context size \\
+\hline
+0x000a & VIRTIO_ADMIN_CMD_DEV_CTX_READ & Read the device context data \\
+\hline
+0x000b & VIRTIO_ADMIN_CMD_DEV_CTX_WRITE & Write the device context data \\
+\hline
+0x000c & VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD & Clear the device context data \\
+\hline
+0x000d - 0x7FFF & - & Commands using \field{struct virtio_admin_cmd} \\
\hline
0x8000 - 0xFFFF & - & Reserved for future commands (possibly using a different structure) \\
\hline
--
2.34.1
This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.
In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.
Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [virtio-comment] [PATCH 6/8] admin: Add theory of operation for write recording commands
2023-09-09 14:29 [virtio-comment] [PATCH 0/8] Introduce device migration supporting commands Parav Pandit
` (4 preceding siblings ...)
2023-09-09 14:29 ` [virtio-comment] [PATCH 5/8] admin: Add requirements of device migration commands Parav Pandit
@ 2023-09-09 14:29 ` Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 7/8] admin: Add " Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 8/8] admin: Add requirements of write reporting commands Parav Pandit
7 siblings, 0 replies; 9+ messages in thread
From: Parav Pandit @ 2023-09-09 14:29 UTC (permalink / raw)
To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit
During a device migration flow (typically in a precopy phase of the
live migration), a device may write to the guest memory. Some
iommu/hypervisor may not be able to track these written pages.
These pages to be migrated from source to destination hypervisor.
A device which writes to these pages, provides the page address record
of the to the owner device. The owner device starts write
recording for the device and queries all the page addresses written by
the device.
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
admin-cmds-device-migration.tex | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index 88e1af9..e98d552 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -94,6 +94,21 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
The owner driver can discard any partially read or written device context when
any of the device migration flow should be aborted.
+During the device migration flow, a passthrough device may write data to the
+guest virtual machine memory, a source hypervisor needs to keep track of these
+written memory to migrate such memory to destination hypervisor.
+Some systems may not be able to keep track of such memory write addresses at
+hypervisor level. In such a scenario, a device records and reports these
+written memory addresses to the owner device. Such an address is named as
+IO virtual address (IOVA). The owner driver enables write recording for one or
+more IOVA ranges per device during device migration flow. The owner driver
+periodically queries these written IOVA records from the device. As the driver
+reads the written IOVA records, the device clears those records from the device.
+Once the device reports zero or small number of written IOVA records, the device
+mode is set to \field{Stop} or \field{Freeze}. Once the device is set to \field{Stop}
+or \field{Freeze} mode, and once all the IOVA records are read, the driver stops
+the write recording in the device.
+
The owner driver uses following device migration group administration commands.
\begin{enumerate}
--
2.34.1
This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.
In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.
Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [virtio-comment] [PATCH 7/8] admin: Add write recording commands
2023-09-09 14:29 [virtio-comment] [PATCH 0/8] Introduce device migration supporting commands Parav Pandit
` (5 preceding siblings ...)
2023-09-09 14:29 ` [virtio-comment] [PATCH 6/8] admin: Add theory of operation for write recording commands Parav Pandit
@ 2023-09-09 14:29 ` Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 8/8] admin: Add requirements of write reporting commands Parav Pandit
7 siblings, 0 replies; 9+ messages in thread
From: Parav Pandit @ 2023-09-09 14:29 UTC (permalink / raw)
To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit
When migrating a virtual machine with passthrough
virtio devices, the virtio device may write into the guest
memory. Some systems may not be able to keep track of these
pages efficiently.
To facilitate such a system, a device provides the record
of pages which are written by the device. This commands
connect to the vfio framework at [1].
The owner driver configures the member device for list of address
ranges for which it expects write recording and reporting by the device.
The owner driver periodically queries the written pages address record
which gets cleared from the device upon reading it.
When the write records reduces over the time, at one point write recording
is stopped after the device mode is set to FREEZE.
[1] https://elixir.bootlin.com/linux/v6.4-rc1/source/include/uapi/linux/vfio.h#L1207
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
admin-cmds-device-migration.tex | 146 ++++++++++++++++++++++++++++++--
admin.tex | 10 ++-
2 files changed, 146 insertions(+), 10 deletions(-)
diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index e98d552..49835eb 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -97,15 +97,16 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
During the device migration flow, a passthrough device may write data to the
guest virtual machine memory, a source hypervisor needs to keep track of these
written memory to migrate such memory to destination hypervisor.
-Some systems may not be able to keep track of such memory write addresses at
-hypervisor level. In such a scenario, a device records and reports these
-written memory addresses to the owner device. Such an address is named as
-IO virtual address (IOVA). The owner driver enables write recording for one or
-more IOVA ranges per device during device migration flow. The owner driver
-periodically queries these written IOVA records from the device. As the driver
-reads the written IOVA records, the device clears those records from the device.
-Once the device reports zero or small number of written IOVA records, the device
-mode is set to \field{Stop} or \field{Freeze}. Once the device is set to \field{Stop}
+Some systems may not be able to keep track of such
+memory writes at addresses at hypervisor level. In such a scenario, a device
+records and reports these written memory addresses to the owner device. Such an
+address is named as IO virtual address (IOVA). The owner driver enables write
+recording for one or more IOVA ranges per device during device migration
+flow. The owner driver periodically queries these written IOVA records from
+the device. As the driver reads the written IOVA records,
+the device clears those records from the device. Once the device reports
+zero or small number of written IOVA records, the device is set to
+\field{Stop} or \field{Freeze} mode. Once the device is set to \field{Stop}
or \field{Freeze} mode, and once all the IOVA records are read, the driver stops
the write recording in the device.
@@ -118,6 +119,10 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
\item Device Context Read Command
\item Device Context Write Command
\item Device Context Discard Command
+\item Device Write Record Capabilities Query Command
+\item Device Write Records Start Command
+\item Device Write Records Stop Command
+\item Device Write Records Read Command
\end{enumerate}
These commands are currently only defined for the SR-IOV group type.
@@ -307,6 +312,129 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command writes a new device
context.
+\paragraph{Device Write Record Capabilities Query Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Record Capabilities Query Command}
+
+This command reads the device write record capabilities.
+For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY, \field{opcode}
+is set to 0xd.
+The \field{group_member_id} refers to the member device to be accessed.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_write_record_cap_result {
+ le32 supported_iova_page_size_bitmap;
+ le32 supported_iova_ranges;
+};
+\end{lstlisting}
+
+When the command completes successfully, \field{command_specific_result}
+is in the format \field{struct virtio_admin_cmd_dev_write_record_cap_result}
+returned by the device. The \field{supported_iova_page_size_bitmap} indicates
+the granularity at which the device can record IOVA ranges. the minimum
+granularity can be 4KB. Bit 0 corresponds to 4KB, bit 1 corresponds to 8KB, bit 31
+corresponds to 4TB. The device supports at least one page granularity.
+The device support one or more IOVA page granularity; for each IOVA page
+granularity, the device sets corresponding bit in the
+\field{supported_iova_page_size_bitmap}. The \field{supported_iova_ranges}
+indicates how many unique (non overlapping) IOVA ranges can be recorded by
+the device.
+
+\paragraph{Device Write Records Start Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Records Start Command}
+
+This command starts the write recording in the device for the specified IOVA
+ranges.
+
+For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START, \field{opcode}
+is set to 0xe.
+The \field{group_member_id} refers to the member device to be accessed.
+
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_write_record_start_data}.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_write_record_start_entry {
+ le64 iova;
+ le64 page_count;
+};
+
+struct virtio_admin_cmd_write_record_start_data {
+ le64 page_size;
+ le32 count;
+ u8 reserved[4];
+ struct virtio_admin_cmd_write_record_start_entry entries[];
+};
+
+\end{lstlisting}
+
+The \field{count} is set to indicate number of valid \field{entries}.
+The \field{iova} indicates the start IOVA address. The \field{page_count}
+indicates number of pages of size \field{page_size} starting from \field{iova}
+to record for write reporting. VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
+command contains unique i.e. non overlapping IOVA range entries.
+Whenever a memory write occurs by the device in the supplied IOVA range, the
+device records the actual IOVA and number of bytes written to the IOVA.
+These write records can be read by the
+the driver using VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ command.
+
+This command has no command specific result.
+
+\paragraph{Device Write Record Stop Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Record Stop Command}
+
+This command stops the write recording in the device for IOVA ranges
+which were previously started using VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
+command.
+
+For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP, \field{opcode}
+is set to 0xf.
+The \field{group_member_id} refers to the member device to be accessed.
+
+This command does not have any command specific data.
+This command has no command specific result.
+
+\paragraph{Device Write Records Read Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Records Read Command}
+
+This command reads the device write records for which the write recording is
+previously started using VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.
+
+For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ, \field{opcode}
+is set to 0x10.
+The \field{group_member_id} refers to the member device to be accessed.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_write_records_read_data {
+ le64 iova;
+ le64 length;
+};
+
+struct virtio_admin_cmd_dev_write_records_cnt {
+ le32 count;
+};
+
+struct virtio_admin_cmd_dev_write_records_result {
+ le64 iova_entries[];
+};
+\end{lstlisting}
+
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_write_records_read_data}. The driver
+sets the \field {iova} indicating the start IOVA address for up to the
+\field{length} number of bytes. The supplied IOVA range same or smaller
+than the range supplied when write recording is started by the driver
+in VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.
+
+When the command completes successfully, \field{command_specific_result}
+is in the format \field{struct virtio_admin_cmd_dev_write_records_result}
+and \field{command_specific_result} is in format of
+\field{struct virtio_admin_cmd_dev_write_records_cnt} containing number
+of write records returned by the device. When the command completes
+successfully, the write records which are returned in the result are
+cleared from the device and same records cannot be read again. When new
+writes occur at same IOVA range or at different once, those records can be read
+as new write records.
+
\devicenormative{\paragraph}{Device Migration}{Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration}
A device MUST either support all of, or none of
diff --git a/admin.tex b/admin.tex
index 3429c4e..cffd85e 100644
--- a/admin.tex
+++ b/admin.tex
@@ -138,7 +138,15 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
\hline
0x000c & VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD & Clear the device context data \\
\hline
-0x000d - 0x7FFF & - & Commands using \field{struct virtio_admin_cmd} \\
+0x000d & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY & Query Write recording capabilities \\
+\hline
+0x000e & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START & Start Write recording in the device \\
+\hline
+0x000f & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP & Stop all write recording in the device \\
+\hline
+0x0010 & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ & Read and clear write records from the device \\
+\hline
+0x0011 - 0x7FFF & - & Commands using \field{struct virtio_admin_cmd} \\
\hline
0x8000 - 0xFFFF & - & Reserved for future commands (possibly using a different structure) \\
\hline
--
2.34.1
This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.
In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.
Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [virtio-comment] [PATCH 8/8] admin: Add requirements of write reporting commands
2023-09-09 14:29 [virtio-comment] [PATCH 0/8] Introduce device migration supporting commands Parav Pandit
` (6 preceding siblings ...)
2023-09-09 14:29 ` [virtio-comment] [PATCH 7/8] admin: Add " Parav Pandit
@ 2023-09-09 14:29 ` Parav Pandit
7 siblings, 0 replies; 9+ messages in thread
From: Parav Pandit @ 2023-09-09 14:29 UTC (permalink / raw)
To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit
Add device and driver requirements for the write reporting commands.
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
admin-cmds-device-migration.tex | 36 +++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index 49835eb..09e772a 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -514,6 +514,34 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
and on receiving the reads, the device MUST function same regardless of the
device mode is \field{Active} or \field{Stop} or \field{Freeze}.
+A device MUST either support all of, or none of
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY,
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START,
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP and
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ commands.
+
+If the device supports VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY
+command, the device MUST set minimum one bit in the
+\field{supported_iova_page_size_bitmap} and set non zero value in the
+\field{supported_iova_ranges}.
+
+The device MUST fail the VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ and
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP commands
+if the write recording is not started by the driver.
+
+The device MUST fail VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ command
+if the write recording is not started.
+
+For the SR-IOV group type, for the VF member device, VF function level
+reset (FLR) MUST NOT stop write recording on the VF device and it MUST NOT
+clear any write records already gathered by the owner device.
+
+The device MUST clear the write records which are returned in the
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ result. After command completion
+of VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ if new write record is created
+for the same IOVA range, the device MUST report such a write record as
+new entry.
+
\drivernormative{\paragraph}{Device Migration}{Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration}
The driver SHOULD read the complete device context using one or multiple
@@ -536,3 +564,11 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
For the SR-IOV group type, the driver MUST NOT write into the
VIRTIO_PCI_CAP_PCI_CFG capability area when the device mode is set to
\field{Freeze}.
+
+The driver MUST NOT invoke VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
+for overlapping IOVA ranges, each IOVA range supplied in the command or
+across multiple commands MUST be supplying unique ranges.
+
+If the write recording is started by the driver using
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START commands, the driver MUST explicitly
+stop the wrie recording using VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP command.
--
2.34.1
This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.
In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.
Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/
^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-09-09 14:30 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-09 14:29 [virtio-comment] [PATCH 0/8] Introduce device migration supporting commands Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 1/8] admin: Add theory of operation for device migration Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 2/8] admin: Redefine reserved2 as command specific output Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 3/8] device-context: Define the device context fields for device migration Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 4/8] admin: Add device migration admin commands Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 5/8] admin: Add requirements of device migration commands Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 6/8] admin: Add theory of operation for write recording commands Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 7/8] admin: Add " Parav Pandit
2023-09-09 14:29 ` [virtio-comment] [PATCH 8/8] admin: Add requirements of write reporting commands Parav Pandit
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).