All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics
@ 2023-05-04  8:18 zhenwei pi
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview zhenwei pi
                   ` (11 more replies)
  0 siblings, 12 replies; 74+ messages in thread
From: zhenwei pi @ 2023-05-04  8:18 UTC (permalink / raw)
  To: parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong, zhenwei pi

v1 -> v2:
- Suggested by Parav, split a large patch into several small patches.
- Small changes for VQN, add "There is no strict style limitation".
- Move *bytes* field limitation from get/set config opcode section to
  Config Command.

v1:
Introduce Virtio-oF specification, include:
- overview
- Virtio Qualified Name
- Segment Descriptor definition
- Buffer Mapping definition: Stream Transmission and Keyed Transmission
- Command set definition
- opcode definition
- status definition
- transport binding: TCP and RDMA
- device initialization

Previous discussion:
https://lists.oasis-open.org/archives/virtio-comment/202304/msg00442.html

zhenwei pi (11):
  transport-fabrics: introduce Virtio Over Fabrics overview
  transport-fabrics: introduce Virtio Qualified Name
  transport-fabircs: introduce Segment Descriptor Definition
  transport-fabrics: introduce Stream Transmission
  transport-fabrics: introduce Keyed Transmission
  transport-fabrics: introduce command set
  transport-fabrics: introduce opcodes
  transport-fabrics: introduce status of completion
  transport-fabrics: add TCP&RDMA binding
  transport-fabrics: add device initialization
  transport-fabrics: support inline data for keyed transmission

 content.tex           |    1 +
 transport-fabrics.tex | 1021 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 1022 insertions(+)
 create mode 100644 transport-fabrics.tex

-- 
2.25.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview
  2023-05-04  8:18 [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
@ 2023-05-04  8:19 ` zhenwei pi
  2023-05-04  8:57   ` David Hildenbrand
                     ` (3 more replies)
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 02/11] transport-fabrics: introduce Virtio Qualified Name zhenwei pi
                   ` (10 subsequent siblings)
  11 siblings, 4 replies; 74+ messages in thread
From: zhenwei pi @ 2023-05-04  8:19 UTC (permalink / raw)
  To: parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong, zhenwei pi

In the past years, virtio supports lots of device specifications by
PCI/MMIO/CCW. These devices work fine in the virtualization environment.

Introduce Virtio Over Fabrics transport to support "network defined
peripheral devices". With this transport, Many Virtio based devices
transparently work over fabrics. Note that the balloon device may not
make sense. Shared memory regions won't work.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
---
 content.tex           |  1 +
 transport-fabrics.tex | 31 +++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)
 create mode 100644 transport-fabrics.tex

diff --git a/content.tex b/content.tex
index cff548a..f899c3a 100644
--- a/content.tex
+++ b/content.tex
@@ -582,6 +582,7 @@ \chapter{Virtio Transport Options}\label{sec:Virtio Transport Options}
 \input{transport-pci.tex}
 \input{transport-mmio.tex}
 \input{transport-ccw.tex}
+\input{transport-fabrics.tex}
 
 \chapter{Device Types}\label{sec:Device Types}
 
diff --git a/transport-fabrics.tex b/transport-fabrics.tex
new file mode 100644
index 0000000..0dc031b
--- /dev/null
+++ b/transport-fabrics.tex
@@ -0,0 +1,31 @@
+\section{Virtio Over Fabrics}\label{sec:Virtio Transport Options / Virtio Over Fabrics}
+
+This section defines specification to Virtio that enables operation over other
+interconnects. A central goal of Virtio Over Fabrics is to maintain consistency
+with the PCI device, so Virtio based devices transparently work over PCI or
+fabrics.
+
+Virtio Over Fabrics uses reliable connection to transmit data, the reliable
+connection betweens two rules:
+
+\begin{itemize}
+\item An initiator functions as an Virtio Over Fabrics client. An initiator
+typically serves the same purpose to a machine as a Virtio device, issues
+commands to remote side.
+\item A target functions as an Virtio Over Fabrics server. An target typically
+handles commands from the initiator side and responses completions.
+\end{itemize}
+
+Virtio Over Fabrics has the following differences from the PCI based
+specification:
+
+\begin{itemize}
+\item Instead of memory sharing mechanism of virtqueue, there is a one-to-one
+mapping between virtqueue and the reliable connection which executes the vring
+data transmission.
+\item An additional control connection is required to execute control commands
+which is similar to read/write register on a PCI device.
+\item Virtio Over Fabrics does not define an interrupt mechanism that allows an
+initiator to generate a host interrupt. It is the responsibility of the host
+fabric interface to generate host interrupts.
+\end{itemize}
-- 
2.25.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [virtio-comment] [PATCH v2 02/11] transport-fabrics: introduce Virtio Qualified Name
  2023-05-04  8:18 [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview zhenwei pi
@ 2023-05-04  8:19 ` zhenwei pi
  2023-05-31 14:06   ` Stefan Hajnoczi
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 03/11] transport-fabircs: introduce Segment Descriptor Definition zhenwei pi
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-05-04  8:19 UTC (permalink / raw)
  To: parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong, zhenwei pi

Add VQN section. The VQN is a little different from iSCSI/NVMe-oF on
style limitation. Because iSCSI/NVMe-of is storage specific protocol,
the full string IQN(for iSCSI/iSER) and NQN(for NVMe-oF) represents
a "storage access address". However, Virtio Over Fabrics works as
transport layer rather than device layer, a URL style string is better
to Virtio Over Fabrics. For example:
virtio-of://blk-resource/nvme-pool/849a39ad-8d7b-4a7a-adb6-e7407ace532c
virtio-of://blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
...
virtio-of://crypto-resource/25307f22-e5a8-4ea2-b7ca-79f5c3bebc3c

A hunam readable VQN is helpful to maintain/debug/distinguish.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
---
 transport-fabrics.tex | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/transport-fabrics.tex b/transport-fabrics.tex
index 0dc031b..26b0192 100644
--- a/transport-fabrics.tex
+++ b/transport-fabrics.tex
@@ -29,3 +29,19 @@ \section{Virtio Over Fabrics}\label{sec:Virtio Transport Options / Virtio Over F
 initiator to generate a host interrupt. It is the responsibility of the host
 fabric interface to generate host interrupts.
 \end{itemize}
+
+\subsection{Virtio Qualified Name}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Virtio Qualified Name}
+Virtio Qualified Names (VQNs) are used to uniquely describe an initiator or a
+target for the purposes of identification.
+
+A VQN is encoded as a string of Unicode characters with the following
+properties:
+
+\begin{itemize}
+\item The encoding is UTF-8 (refer to RFC 3629).
+\item The characters dash('-'), dot ('.'), slash('/') and colon(':') are used
+in formatting.
+\item The maximum name is 256 bytes in length.
+\item The string is null terminated.
+\item There is no strict style limitation.
+\end{itemize}
-- 
2.25.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [virtio-comment] [PATCH v2 03/11] transport-fabircs: introduce Segment Descriptor Definition
  2023-05-04  8:18 [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview zhenwei pi
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 02/11] transport-fabrics: introduce Virtio Qualified Name zhenwei pi
@ 2023-05-04  8:19 ` zhenwei pi
  2023-05-31 14:23   ` Stefan Hajnoczi
  2023-06-05  2:40   ` [virtio-comment] " Parav Pandit
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 04/11] transport-fabrics: introduce Stream Transmission zhenwei pi
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 74+ messages in thread
From: zhenwei pi @ 2023-05-04  8:19 UTC (permalink / raw)
  To: parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong, zhenwei pi

Introduce segment descriptor to describe the Virtio device buffer
segments.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
---
 transport-fabrics.tex | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/transport-fabrics.tex b/transport-fabrics.tex
index 26b0192..b88acfd 100644
--- a/transport-fabrics.tex
+++ b/transport-fabrics.tex
@@ -45,3 +45,46 @@ \subsection{Virtio Qualified Name}\label{sec:Virtio Transport Options / Virtio O
 \item The string is null terminated.
 \item There is no strict style limitation.
 \end{itemize}
+
+\subsection{Transmission Protocol}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol}
+This section defines transmission protocol for Virtio Over Fabrics. All the
+fields use little endian format.
+
+\subsubsection{Segment Descriptor Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Segment Descriptor Definition}
+Virtio Over Fabrics uses the following structure to describe data segment:
+
+\begin{lstlisting}
+struct virtio_of_vring_desc {
+        le64 addr;
+        le32 length;
+        /* This marks the unique ID within a command, no limitation among inflight commands */
+        le16 id;
+        /* This marks a buffer as keyed transmission (otherwise stream transmission) */
+#define VIRTIO_OF_DESC_F_KEYED     1
+        /* This marks a buffer as device write-only (otherwise device read-only). */
+#define VIRTIO_OF_DESC_F_WRITE     2
+        le16 flags;
+        le32 key;
+};
+\end{lstlisting}
+
+The structure virtio_of_vring_desc is used for both keyed transmission
+(i.e. RDMA) and stream transmission(i.e. TCP). The fields is described as follows:
+
+\begin{tabular}{ |l|l|l| }
+\hline
+Field & keyed transmission & stream transmission \\
+\hline \hline
+addr & Start address of remote memory buffer & Start address within the stream buffer \\
+\hline
+length & The length of remote memory buffer & The length of buffer within the stream \\
+\hline
+id & The ID of this descriptor & The ID of this descriptor \\
+\hline
+flags & both keyed transmission and stream transmission supported & stream transmission only \\
+\hline
+key & Key of the remote Memory Region & Ignore \\
+\hline
+\end{tabular}
+
+Depending on the opcode, a Command contains zero or more structure virtio_of_vring_desc.
-- 
2.25.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [virtio-comment] [PATCH v2 04/11] transport-fabrics: introduce Stream Transmission
  2023-05-04  8:18 [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
                   ` (2 preceding siblings ...)
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 03/11] transport-fabircs: introduce Segment Descriptor Definition zhenwei pi
@ 2023-05-04  8:19 ` zhenwei pi
  2023-05-31 15:20   ` Stefan Hajnoczi
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission zhenwei pi
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-05-04  8:19 UTC (permalink / raw)
  To: parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong, zhenwei pi

Stream transmission is used for stream oriented communication(Ex TCP),
also add virtio-blk read/write 8K example.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
---
 transport-fabrics.tex | 229 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 229 insertions(+)

diff --git a/transport-fabrics.tex b/transport-fabrics.tex
index b88acfd..c02cf26 100644
--- a/transport-fabrics.tex
+++ b/transport-fabrics.tex
@@ -88,3 +88,232 @@ \subsubsection{Segment Descriptor Definition}\label{sec:Virtio Transport Options
 \end{tabular}
 
 Depending on the opcode, a Command contains zero or more structure virtio_of_vring_desc.
+
+\subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Buffer Mapping Definition}
+Virtio Over Fabrics defines two types of buffer mapping rules.
+
+\paragraph{Stream Transmission}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}
+Command, Segment Descriptors, and buffer are transmitted in a stream within a
+connection. The layout in stream:
+
+\begin{lstlisting}
+CMDx contains 0 descriptor, CMDy contains (n - m + 1) descriptors and buffer:
+
+     +-----+     +-----++-----+     +-----++-----+
+ ... | CMDx| ... | CMDy||DESCm| ... |DESCn|| BUF | ...
+     +-----+     +-----++-----+     +-----++-----+
+
+COMPx contains 0 descriptor, COMPy contains (k - j + 1) descriptors and buffer:
+
+     +-----+     +-----++-----+     +-----++-----+
+ ... |COMPx| ... |COMPy||DESCj| ... |DESCk|| BUF | ...
+     +-----+     +-----++-----+     +-----++-----+
+\end{lstlisting}
+
+An example of a virtio-blk write 8K request(total size: sizeof(Command) +
+4 * sizeof(Descriptor) + 8208):
+\begin{lstlisting}
+ COMMAND            +------+
+                    |opcode|  ->  virtio_of_op_vring
+                    +------+
+                    |cmd id|  ->  10
+                    +------+
+                    |length|  ->  8208
+                    +------+
+                    |ndesc |  ->  4
+                    +------+
+                    |rsvd  |
+                    +------+
+
+ DESC0              +------+
+              +-----|addr  |  -> 0
+              |     +------+
+              |     |length|  -> 16 (virtio blk write command)
+              |     +------+
+              |     |id    |  -> 0
+              |     +------+
+              |     |flags |  -> 0
+              |     +------+
+              |
+ DESC1        |     +------+
+              | +---|addr  |  -> 16
+              | |   +------+
+              | |   |length|  -> 4096
+              | |   +------+
+              | |   |id    |  -> 1
+              | |   +------+
+              | |   |flags |  -> 0
+              | |   +------+
+              | |
+ DESC2        | |   +------+
+              | | +-|addr  |  -> 4112
+              | | | +------+
+              | | | |length|  -> 4096
+              | | | +------+
+              | | | |id    |  -> 2
+              | | | +------+
+              | | | |flags |  -> 0
+              | | | +------+
+              | | |
+ DESC3        | | | +------+
+              | | | |addr  |  -> 0
+              | | | +------+
+              | | | |length|  -> 1
+              | | | +------+
+              | | | |id    |  -> 3
+              | | | +------+
+              | | | |flags |  -> VIRTIO_OF_DESC_F_WRITE
+              | | | +------+
+              | | |
+ DATA         +-+-+>+------+  -> 0
+                | | |......|
+                +-+>+------+  -> 16
+                  | |......|
+                  +>+------+  -> 4112
+                    |......|
+                    +------+  -> 8208
+\end{lstlisting}
+
+The Completion of this request(total size: sizeof(Completion) +
+1 * sizeof(Descriptor) + 1):
+\begin{lstlisting}
+ COMPLETION         +------+
+                    |status|  ->  VIRTIO_OF_SUCCESS
+                    +------+
+                    |cmd id|  ->  10
+                    +------+
+                    |ndesc |  ->  1
+                    +------+
+                    |rsvd  |
+                    +------+
+                    |value |  -> 1 (value.u32)
+                    +------+
+
+ DESC0              +------+
+                  +-|addr  |  -> 0
+                  | +------+
+                  | |length|  -> 1
+                  | +------+
+                  | |id    |  -> 3
+                  | +------+
+                  | |flags |  -> VIRTIO_OF_DESC_F_WRITE
+                  | +------+
+                  |
+ DATA             |>+------+  -> 0
+                    |......|
+                    +------+  -> 1
+\end{lstlisting}
+
+Another example of a virtio-blk read 8K request(total size: sizeof(Command) +
+4 * sizeof(Descriptor) + 16):
+\begin{lstlisting}
+ COMMAND            +------+
+                    |opcode|  ->  virtio_of_op_vring
+                    +------+
+                    |cmd id|  ->  14
+                    +------+
+                    |length|  ->  16 (virtio blk read command)
+                    +------+
+                    |ndesc |  ->  4
+                    +------+
+                    |rsvd  |
+                    +------+
+
+ DESC0              +------+
+                  +-|addr  |  -> 0
+                  | +------+
+                  | |length|  -> 16
+                  | +------+
+                  | |id    |  -> 0
+                  | +------+
+                  | |flags |  -> 0
+                  | +------+
+                  |
+ DESC1            | +------+
+                  | |addr  |  -> 0
+                  | +------+
+                  | |length|  -> 4096
+                  | +------+
+                  | |id    |  -> 1
+                  | +------+
+                  | |flags |  -> VIRTIO_OF_DESC_F_WRITE
+                  | +------+
+                  |
+ DESC2            | +------+
+                  | |addr  |  -> 0
+                  | +------+
+                  | |length|  -> 4096
+                  | +------+
+                  | |id    |  -> 2
+                  | +------+
+                  | |flags |  -> VIRTIO_OF_DESC_F_WRITE
+                  | +------+
+                  |
+ DESC3            | +------+
+                  | |addr  |  -> 0
+                  | +------+
+                  | |length|  -> 1
+                  | +------+
+                  | |id    |  -> 3
+                  | +------+
+                  | |flags |  -> VIRTIO_OF_DESC_F_WRITE
+                  | +------+
+                  |
+ DATA             +>+------+  -> 0
+                    |......|
+                    +------+  -> 16
+\end{lstlisting}
+
+The Completion of this request(total size: sizeof(Completion) +
+3 * sizeof(Descriptor) + 8193):
+\begin{lstlisting}
+ COMPLETION         +------+
+                    |status|  ->  VIRTIO_OF_SUCCESS
+                    +------+
+                    |cmd id|  ->  14
+                    +------+
+                    |ndesc |  ->  3
+                    +------+
+                    |rsvd  |
+                    +------+
+                    |value |  -> 8193 (value.u32)
+                    +------+
+
+ DESC0              +------+
+              +-----|addr  |  -> 0
+              |     +------+
+              |     |length|  -> 4096
+              |     +------+
+              |     |id    |  -> 1
+              |     +------+
+              |     |flags |  -> VIRTIO_OF_DESC_F_WRITE
+              |     +------+
+              |
+ DESC1        |     +------+
+              | +---|addr  |  -> 4096
+              | |   +------+
+              | |   |length|  -> 4096
+              | |   +------+
+              | |   |id    |  -> 2
+              | |   +------+
+              | |   |flags |  -> VIRTIO_OF_DESC_F_WRITE
+              | |   +------+
+              | |
+ DESC2        | |   +------+
+              | | +-|addr  |  -> 8192
+              | | | +------+
+              | | | |length|  -> 1
+              | | | +------+
+              | | | |id    |  -> 3
+              | | | +------+
+              | | | |flags |  -> VIRTIO_OF_DESC_F_WRITE
+              | | | +------+
+              | | |
+ DATA         +-+-+>+------+  -> 0
+                | | |......|
+                +-+>+------+  -> 4096
+                  | |......|
+                  +>+------+  -> 8192
+                    |......|
+                    +------+  -> 8193
+\end{lstlisting}
-- 
2.25.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [virtio-comment] [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-05-04  8:18 [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
                   ` (3 preceding siblings ...)
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 04/11] transport-fabrics: introduce Stream Transmission zhenwei pi
@ 2023-05-04  8:19 ` zhenwei pi
  2023-05-31 16:20   ` [virtio-comment] " Stefan Hajnoczi
  2023-06-05  2:41   ` Parav Pandit
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 06/11] transport-fabrics: introduce command set zhenwei pi
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 74+ messages in thread
From: zhenwei pi @ 2023-05-04  8:19 UTC (permalink / raw)
  To: parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong, zhenwei pi

Keyed transmission is used for message oriented communication(Ex RDMA),
also add virtio-blk read/write 8K example.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
---
 transport-fabrics.tex | 178 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 178 insertions(+)

diff --git a/transport-fabrics.tex b/transport-fabrics.tex
index c02cf26..7711321 100644
--- a/transport-fabrics.tex
+++ b/transport-fabrics.tex
@@ -317,3 +317,181 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
                     |......|
                     +------+  -> 8193
 \end{lstlisting}
+
+\paragraph{Keyed Transmission}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}
+Command and Segment Descriptors are transmitted in a message within a
+connection, and buffer is transmitted by remote memory access.  The layout in message:
+
+\begin{lstlisting}
+CMDx contains 0 descriptor, CMDy contains (n - m + 1) descriptors:
+
+     +-----+     +-----++-----+     +-----+
+ ... | CMDx| ... | CMDy||DESCm| ... |DESCn| ...
+     +-----+     +-----++-----+     +-----+
+
+COMPx contains 0 descriptor, COMPy contains 0 descriptor:
+
+     +-----+     +-----+
+ ... |COMPx| ... |COMPy| ...
+     +-----+     +-----+
+\end{lstlisting}
+
+An example of a virtio-blk write 8K request(message size: sizeof(Command) +
+4 * sizeof(Descriptor)):
+\begin{lstlisting}
+ COMMAND            +------+
+                    |opcode|  ->  virtio_of_op_vring
+                    +------+
+                    |cmd id|  ->  10
+                    +------+
+                    |length|  ->  0
+                    +------+
+                    |ndesc |  ->  4
+                    +------+
+                    |rsvd  |
+                    +------+
+
+ DESC0              +------+
+                    |addr  |  -> 0xffff012345670000
+                    +------+
+                    |length|  -> 16 (virtio blk write command)
+                    +------+
+                    |id    |  -> 0
+                    +------+
+                    |flags |  -> VIRTIO_OF_DESC_F_KEYED
+                    +------+
+                    |key   |  -> 0x1234
+                    +------+
+
+ DESC1              +------+
+                    |addr  |  -> 0xffff012345671000
+                    +------+
+                    |length|  -> 4096
+                    +------+
+                    |id    |  -> 1
+                    +------+
+                    |flags |  -> VIRTIO_OF_DESC_F_KEYED
+                    +------+
+                    |key   |  -> 0x1236
+                    +------+
+
+ DESC2              +------+
+                    |addr  |  -> 0xffff012345673000
+                    +------+
+                    |length|  -> 4096
+                    +------+
+                    |id    |  -> 2
+                    +------+
+                    |flags |  -> VIRTIO_OF_DESC_F_KEYED
+                    +------+
+                    |key   |  -> 0x1238
+                    +------+
+
+ DESC3              +------+
+                    |addr  |  -> 0xffff012345677000
+                    +------+
+                    |length|  -> 1
+                    +------+
+                    |id    |  -> 3
+                    +------+
+                    |flags |  -> VIRTIO_OF_DESC_F_KEYED | VIRTIO_OF_DESC_F_WRITE
+                    +------+
+                    |key   |  -> 0x1239
+                    +------+
+\end{lstlisting}
+
+The target handles Command, reads the remote addresses of DESC0/DESC1/DESC2,
+writes the remote address of DESC3, then responses Completion:
+\begin{lstlisting}
+ COMPLETION         +------+
+                    |status|  ->  VIRTIO_OF_SUCCESS
+                    +------+
+                    |cmd id|  ->  10
+                    +------+
+                    |ndesc |  ->  0
+                    +------+
+                    |rsvd  |
+                    +------+
+                    |value |  -> 1 (value.u32)
+                    +------+
+\end{lstlisting}
+
+Another example of a virtio-blk read 8K request(message size: sizeof(Command) +
+4 * sizeof(Descriptor)):
+\begin{lstlisting}
+ COMMAND            +------+
+                    |opcode|  ->  virtio_of_op_vring
+                    +------+
+                    |cmd id|  ->  10
+                    +------+
+                    |length|  ->  0
+                    +------+
+                    |ndesc |  ->  4
+                    +------+
+                    |rsvd  |
+                    +------+
+
+ DESC0              +------+
+                    |addr  |  -> 0xffff012345670000
+                    +------+
+                    |length|  -> 16 (virtio blk write command)
+                    +------+
+                    |id    |  -> 0
+                    +------+
+                    |flags |  -> VIRTIO_OF_DESC_F_KEYED
+                    +------+
+                    |key   |  -> 0x1234
+                    +------+
+
+ DESC1              +------+
+                    |addr  |  -> 0xffff012345671000
+                    +------+
+                    |length|  -> 4096
+                    +------+
+                    |id    |  -> 1
+                    +------+
+                    |flags |  -> VIRTIO_OF_DESC_F_KEYED | VIRTIO_OF_DESC_F_WRITE
+                    +------+
+                    |key   |  -> 0x1236
+                    +------+
+
+ DESC2              +------+
+                    |addr  |  -> 0xffff012345673000
+                    +------+
+                    |length|  -> 4096
+                    +------+
+                    |id    |  -> 2
+                    +------+
+                    |flags |  -> VIRTIO_OF_DESC_F_KEYED | VIRTIO_OF_DESC_F_WRITE
+                    +------+
+                    |key   |  -> 0x1238
+                    +------+
+
+ DESC3              +------+
+                    |addr  |  -> 0xffff012345677000
+                    +------+
+                    |length|  -> 1
+                    +------+
+                    |id    |  -> 3
+                    +------+
+                    |flags |  -> VIRTIO_OF_DESC_F_KEYED | VIRTIO_OF_DESC_F_WRITE
+                    +------+
+                    |key   |  -> 0x1239
+                    +------+
+\end{lstlisting}
+
+The target handles Command, reads the remote address of DESC0, writes the remote
+addresses of DESC1/DESC2/DESC3, then responses Completion:
+\begin{lstlisting}
+ COMPLETION         +------+
+                    |status|  ->  VIRTIO_OF_SUCCESS
+                    +------+
+                    |cmd id|  ->  10
+                    +------+
+                    |ndesc |  ->  0
+                    +------+
+                    |rsvd  |
+                    +------+
+                    |value |  -> 8193 (value.u32)
+                    +------+
+\end{lstlisting}
-- 
2.25.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [virtio-comment] [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-05-04  8:18 [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
                   ` (4 preceding siblings ...)
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission zhenwei pi
@ 2023-05-04  8:19 ` zhenwei pi
  2023-05-31 17:10   ` [virtio-comment] " Stefan Hajnoczi
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 07/11] transport-fabrics: introduce opcodes zhenwei pi
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-05-04  8:19 UTC (permalink / raw)
  To: parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong, zhenwei pi

Introduce command structures for Virtio-oF.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
---
 transport-fabrics.tex | 209 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 209 insertions(+)

diff --git a/transport-fabrics.tex b/transport-fabrics.tex
index 7711321..37f57c6 100644
--- a/transport-fabrics.tex
+++ b/transport-fabrics.tex
@@ -495,3 +495,212 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
                     |value |  -> 8193 (value.u32)
                     +------+
 \end{lstlisting}
+
+\subsubsection{Commands Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition}
+This section defines command structures for Virtio Over Fabrics.
+
+A common structure virtio_of_value is fixed to 8 bytes and MUST be used as one
+of the following format:
+
+\begin{itemize}
+\item u8
+\item le16
+\item le32
+\item le64
+\end{itemize}
+
+\paragraph{Command ID}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Command ID}
+There is command_id(le16) field in each Command and Completion:
+
+\begin{itemize}
+\item Generally the initiator allocates a Command ID and specifies the
+command_id field of a Command, and the target MUST specify the same Command ID
+in command_id field of Completion.
+\item The initiator MUST guarantee each Command ID is unique in the inflight Commands.
+\item Command ID 0xff00 - 0xffff is reserved for control queue to delivery asynchronous event.
+\end{itemize}
+
+The reserved Command ID for control queue is defined as follows:
+
+\begin{tabular}{ |l|l| }
+\hline
+Command ID & Description \\
+\hline \hline
+0xffff & Keepalive. The initiator SHOULD ignore this event \\
+\hline
+0xfffe & Config change. The initiator SHOULD generate config change interrupt to device \\
+\hline
+0xff00 - 0xfffd & Reserved \\
+\hline
+\end{tabular}
+
+\paragraph{Connect Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Connect Command}
+The Connect Command is used to establish Virtio Over Fabrics queue. The control
+queue MUST be established firstly, then the Connect command establishes an
+association between the initiator and the target.
+
+The Target ID of 0xffff is reserved, then:
+\begin{itemize}
+\item The Target ID of 0xffff MUST be specified as the Target ID in a Connect
+Command for the control queue.
+\item The target SHOULD allocate any available Target ID to the initiator,
+and return the allocated Target ID in the Completion.
+\item The returned Target ID MUST be specified as the Target ID, and the Queue ID
+MUST be specified in a Connect Command for the virtqueue.
+\end{itemize}
+
+The Connect Command has following structure:
+
+\begin{lstlisting}
+struct virtio_of_command_connect {
+        le16 opcode;
+        le16 command_id;
+        le16 target_id;
+        le16 queue_id;
+        le16 ndesc;
+#define VIRTIO_OF_CONNECTION_TCP     1
+#define VIRTIO_OF_CONNECTION_RDMA    2
+        u8 oftype;
+        u8 padding[5];
+};
+\end{lstlisting}
+
+The Connect commands MUST contains one Segment Descriptor and one structure
+virtio_of_command_connect to specify Initiator VQN and Target VNQ,
+virtio_of_command_connect has following structure:
+
+\begin{lstlisting}
+struct virtio_of_connect {
+        u8 ivqn[256];
+        u8 tvqn[256];
+        u8 padding[512];
+};
+\end{lstlisting}
+
+\paragraph{Feature Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Feature Command}
+
+The control queue uses Feature Command to get or set features. This command is used for:
+
+\begin{itemize}
+\item The initiator/target features. This is used to negotiate transport layer features.
+\item The driver/device features. This is used to negotiate Virtio Based device
+features which is similar to PCI based device.
+\end{itemize}
+
+The Feature Command has following structure:
+
+\begin{lstlisting}
+struct virtio_of_command_feature {
+        le16 opcode;
+        le16 command_id;
+        le32 feature_select;
+        le64 value;        /* ignore this field on GET */
+};
+\end{lstlisting}
+
+\paragraph{Queue Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Queue Command}
+
+The control queue uses Queue Command to get or set properties on a specific queue.
+The Queue Command has following structure:
+
+\begin{lstlisting}
+struct virtio_of_command_queue {
+        le16 opcode;
+        le16 command_id;
+        le16 queue_id;
+        u8 padding6;
+        u8 padding7;
+        struct virtio_of_value value;   /* ignore this field on GET */
+};
+\end{lstlisting}
+
+
+\paragraph{Config Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Config Command}
+
+The control queue uses Config Command to get or set configure on device.
+The Config Command has following structure:
+
+\begin{lstlisting}
+struct virtio_of_command_config {
+        le16 opcode;
+        le16 command_id;
+        le16 offset;
+        u8 bytes;
+        u8 padding7;
+        struct virtio_of_value value;        /* ignore this field on GET */
+};
+\end{lstlisting}
+
+The bytes field supports on Get only:
+
+\begin{itemize}
+\item 1, then the initiator reads from value field of Completion as u8
+\item 2, then the initiator reads from value field of Completion as le16
+\item 4, then the initiator reads from value field of Completion as le32
+\item 8, then the initiator reads from value field of Completion as le64
+\end{itemize}
+
+The bytes field supports on Set only:
+
+\begin{itemize}
+\item 1, then the initiator specifies the value field of Config Command as u8
+\item 2, then the initiator specifies the value field of Config Command as le16
+\item 4, then the initiator specifies the value field of Config Command as le32
+\item 8, then the initiator specifies the value field of Config Command as le64
+\end{itemize}
+
+\paragraph{Common Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command}
+
+The control queue uses Common Command to get or set common properties on
+device(i.e. get device ID). The Common Command has following structure:
+
+\begin{lstlisting}
+struct virtio_of_command_common {
+        le16 opcode;
+        le16 command_id;
+        u8 padding4;
+        u8 padding5;
+        u8 padding6;
+        u8 padding7;
+        struct virtio_of_value value;        /* ignore this field on GET */
+};
+\end{lstlisting}
+
+
+\paragraph{Vring Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Vring Command}
+
+Both control queue and virtqueue use Vring Command to transmit buffer.
+The Vring Command has following structure:
+
+\begin{lstlisting}
+struct virtio_of_command_vring {
+        le16 opcode;
+        le16 command_id;
+        /* Total buffer size this command contains(not include command&descriptors). */
+        le32 length;
+        /* How many descriptors this command contains */
+        le16 ndesc;
+        u8 padding[6];
+};
+\end{lstlisting}
+
+\paragraph{Completion}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Completion}
+
+The target responses Completion to the initiator to report command status,
+device properties, and transmit buffer. The Completion has following structure:
+
+\begin{lstlisting}
+struct virtio_of_completion {
+        le16 status;
+        le16 command_id;
+        /* How many descriptors this completion contains */
+        le16 ndesc;
+        u8 rsvd6;
+        u8 rsvd7;
+        struct virtio_of_value value;
+};
+\end{lstlisting}
+
+Note that Virtio Over Fabrics does not define an interrupt mechanism, generally
+the initiator receives a Completion, it SHOULD generate a host interrupt
+(if no interrupt suspending on device).
-- 
2.25.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [virtio-comment] [PATCH v2 07/11] transport-fabrics: introduce opcodes
  2023-05-04  8:18 [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
                   ` (5 preceding siblings ...)
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 06/11] transport-fabrics: introduce command set zhenwei pi
@ 2023-05-04  8:19 ` zhenwei pi
  2023-05-31 17:11   ` [virtio-comment] " Stefan Hajnoczi
       [not found]   ` <20230531205508.GA1509630@fedora>
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 08/11] transport-fabrics: introduce status of completion zhenwei pi
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 74+ messages in thread
From: zhenwei pi @ 2023-05-04  8:19 UTC (permalink / raw)
  To: parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong, zhenwei pi

Define opcode with this rule:
The Virtio-oF transport layer commands use 0x0000-0x0fff, and the
device layer commands use 0x1000-0xffff. get/set status/feature/
config use consecutive number.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
---
 transport-fabrics.tex | 134 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 134 insertions(+)

diff --git a/transport-fabrics.tex b/transport-fabrics.tex
index 37f57c6..026ff5f 100644
--- a/transport-fabrics.tex
+++ b/transport-fabrics.tex
@@ -704,3 +704,137 @@ \subsubsection{Commands Definition}\label{sec:Virtio Transport Options / Virtio
 Note that Virtio Over Fabrics does not define an interrupt mechanism, generally
 the initiator receives a Completion, it SHOULD generate a host interrupt
 (if no interrupt suspending on device).
+
+\subsubsection{Opcodes Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition}
+This section defines command opcodes for Virtio Over Fabrics:
+
+\begin{lstlisting}
+#define virtio_of_op_connect               0x0000
+#define virtio_of_op_discconnect           0x0001
+#define virtio_of_op_get_feature           0x0002
+#define virtio_of_op_set_feature           0x0003
+#define virtio_of_op_keepalive             0x0004
+#define virtio_of_op_vring                 0x0fff
+#define virtio_of_op_get_vendor_id         0x1000
+#define virtio_of_op_get_device_id         0x1001
+#define virtio_of_op_get_generation        0x1002
+#define virtio_of_op_get_status            0x1004
+#define virtio_of_op_set_status            0x1005
+#define virtio_of_op_get_device_feature    0x1006
+#define virtio_of_op_set_driver_feature    0x1007
+#define virtio_of_op_get_num_queues        0x1008
+#define virtio_of_op_get_queue_size        0x100a
+#define virtio_of_op_get_config            0x100c
+#define virtio_of_op_set_config            0x100d
+\end{lstlisting}
+
+\paragraph{virtio_of_op_connect}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_connect}
+
+virtio_of_op_connect is used to connect a target for both control queue and virtqueue.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Connect Command}
+and specify the ndesc field as 1, also contains 1 structure virtio_of_vring_desc
+filled by structure virtio_of_command_status.
+
+\paragraph{virtio_of_op_discconnect}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_discconnect}
+
+virtio_of_op_discconnect is used to disconnect a target for both control queue and virtqueue.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command}.
+
+\paragraph{virtio_of_op_get_feature}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_feature}
+
+virtio_of_op_get_feature is used to get features of target for control queue only.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Feature Command}.
+
+\begin{tabular}{ |l|l|l| }
+\hline
+Feature Select & Value & Description \\
+\hline
+virtio_of_feature_max_segment & 0x0 & Get the maximum segments within a Vring Command supported by target \\
+\hline
+\end{tabular}
+
+\paragraph{virtio_of_op_set_feature}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_set_feature}
+
+virtio_of_op_set_feature is used to set features of initiator for control queue only.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Feature Command}.
+
+\paragraph{virtio_of_op_keepalive}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_keepalive}
+
+virtio_of_op_keepalive is used to keep alive with the target for control queue only.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command}.
+
+\paragraph{virtio_of_op_vring}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_vring}
+
+virtio_of_op_vring is used to transmit buffer for both control queue and virtqueue.
+The initiator MUST issues \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Vring Command}
+and specify the ndesc field as the number of buffer segments,
+also contains ndesc structure virtio_of_vring_desc.
+Each structure virtio_of_vring_desc is filled by each buffer segment one by one.
+
+\paragraph{virtio_of_op_get_vendor_id}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_vendor_id}
+
+virtio_of_op_get_vendor_id is used to get vendor id for control queue only.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command},
+and reads from value field of Completion as le32.
+
+\paragraph{virtio_of_op_get_device_id}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_device_id}
+
+virtio_of_op_get_device_id is used to get device id for control queue only.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command},
+and reads from value field of Completion as le32.
+
+\paragraph{virtio_of_op_get_generation}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_generation}
+
+virtio_of_op_get_generation is used to get config generation for control queue only.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command},
+and reads from value field of Completion as le32.
+
+\paragraph{virtio_of_op_get_status}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_status}
+
+virtio_of_op_get_status is used to get device status for control queue only.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command},
+and reads from value field of Completion as le32.
+
+\paragraph{virtio_of_op_set_status}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_set_status}
+
+virtio_of_op_set_status is used to set device status for control queue only.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command},
+and specify the value field of Common Command as le32.
+
+\paragraph{virtio_of_op_get_device_feature}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_device_feature}
+
+virtio_of_op_get_device_feature is used to get device feature for control queue only.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Feature Command},
+and reads from value field of Completion as le64.
+
+\paragraph{virtio_of_op_set_driver_feature}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_set_driver_feature}
+
+virtio_of_op_set_driver_feature is used to set driver feature for control queue only.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Feature Command},
+and specify the value field of Common Command as le64.
+
+The initiator uses feature_select field to select which feature bits to set.
+Value 0x0 selects Feature Bits 0 to 63, 0x1 selects Feature Bits 64 to 128, etc.
+
+\paragraph{virtio_of_op_get_num_queues}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_num_queues}
+
+virtio_of_op_get_num_queues is used to get the number of queues for control queue only.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command},
+and reads from value field of Completion as le16.
+
+\paragraph{virtio_of_op_get_queue_size}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_queue_size}
+
+virtio_of_op_get_queue_size is used to get the size of a specified queue for control queue only.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Queue Command} with specified queue_id,
+and reads from value field of Completion as le16.
+
+\paragraph{virtio_of_op_get_config}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_config}
+
+virtio_of_op_get_config is used to get the config of a device for control queue only.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Config Command} with specified offset and bytes,
+and reads from value field of Completion.
+
+\paragraph{virtio_of_op_set_config}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_set_config}
+
+virtio_of_op_set_config is used to set the config of a device for control queue only.
+The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Config Command} with specified offset and bytes and value fields.
-- 
2.25.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [virtio-comment] [PATCH v2 08/11] transport-fabrics: introduce status of completion
  2023-05-04  8:18 [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
                   ` (6 preceding siblings ...)
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 07/11] transport-fabrics: introduce opcodes zhenwei pi
@ 2023-05-04  8:19 ` zhenwei pi
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 09/11] transport-fabrics: add TCP&RDMA binding zhenwei pi
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 74+ messages in thread
From: zhenwei pi @ 2023-05-04  8:19 UTC (permalink / raw)
  To: parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong, zhenwei pi

Define status of completion, currently status 0-114 has the same
meaning to the linux error code, because these status are used for
many years, they are clear and friendly to linux system developers
(even for other platform developers).

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
---
 transport-fabrics.tex | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/transport-fabrics.tex b/transport-fabrics.tex
index 026ff5f..f563c3e 100644
--- a/transport-fabrics.tex
+++ b/transport-fabrics.tex
@@ -838,3 +838,38 @@ \subsubsection{Opcodes Definition}\label{sec:Virtio Transport Options / Virtio O
 
 virtio_of_op_set_config is used to set the config of a device for control queue only.
 The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Config Command} with specified offset and bytes and value fields.
+
+\subsubsection{Status Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Status Definition}
+This section defines status for Virtio Over Fabrics Completion.
+
+\begin{lstlisting}
+#define VIRTIO_OF_SUCCESS       0
+#define VIRTIO_OF_EPERM         1
+#define VIRTIO_OF_ENOENT        2
+#define VIRTIO_OF_EIO           5
+#define VIRTIO_OF_ENXIO         6
+#define VIRTIO_OF_E2BIG         7
+#define VIRTIO_OF_ENOMEM        12
+#define VIRTIO_OF_EACCES        13
+#define VIRTIO_OF_EFAULT        14
+#define VIRTIO_OF_EBUSY         16
+#define VIRTIO_OF_EEXIST        17
+#define VIRTIO_OF_ENODEV        19
+#define VIRTIO_OF_EINVAL        22
+#define VIRTIO_OF_ERANGE        34
+#define VIRTIO_OF_ENOSYS        38
+#define VIRTIO_OF_ECHRNG        44
+#define VIRTIO_OF_EUNATCH       49
+#define VIRTIO_OF_EBADE         52
+#define VIRTIO_OF_EBADR         53
+#define VIRTIO_OF_EBADRQC       56
+#define VIRTIO_OF_ENODATA       61
+#define VIRTIO_OF_EPROTO        71
+#define VIRTIO_OF_EBADMSG       74
+#define VIRTIO_OF_ENOTUNIQ      76
+#define VIRTIO_OF_EREMCHG       78
+#define VIRTIO_OF_EUSERS        87
+#define VIRTIO_OF_EOPNOTSUPP    95
+#define VIRTIO_OF_EALREADY      114
+#define VIRTIO_OF_EQUIRK        4096
+\end{lstlisting}
-- 
2.25.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [virtio-comment] [PATCH v2 09/11] transport-fabrics: add TCP&RDMA binding
  2023-05-04  8:18 [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
                   ` (7 preceding siblings ...)
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 08/11] transport-fabrics: introduce status of completion zhenwei pi
@ 2023-05-04  8:19 ` zhenwei pi
       [not found]   ` <20230531210255.GC1509630@fedora>
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 10/11] transport-fabrics: add device initialization zhenwei pi
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-05-04  8:19 UTC (permalink / raw)
  To: parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong, zhenwei pi

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
---
 transport-fabrics.tex | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/transport-fabrics.tex b/transport-fabrics.tex
index f563c3e..c47a744 100644
--- a/transport-fabrics.tex
+++ b/transport-fabrics.tex
@@ -873,3 +873,12 @@ \subsubsection{Status Definition}\label{sec:Virtio Transport Options / Virtio Ov
 #define VIRTIO_OF_EALREADY      114
 #define VIRTIO_OF_EQUIRK        4096
 \end{lstlisting}
+
+\subsection{Transport Binding}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transport Binding}
+\subsubsection{TCP}\label{sec:Virtio Transport Options / Virtio Over Fabrics / ransport Binding / TCP}
+TCP MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}
+~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}.
+
+\subsubsection{RDMA}\label{sec:Virtio Transport Options / Virtio Over Fabrics / ransport Binding / RDMA}
+RDMA MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}
+~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}.
-- 
2.25.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [virtio-comment] [PATCH v2 10/11] transport-fabrics: add device initialization
  2023-05-04  8:18 [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
                   ` (8 preceding siblings ...)
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 09/11] transport-fabrics: add TCP&RDMA binding zhenwei pi
@ 2023-05-04  8:19 ` zhenwei pi
       [not found]   ` <20230531210925.GD1509630@fedora>
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 11/11] transport-fabrics: support inline data for keyed transmission zhenwei pi
  2023-05-29  0:56 ` [virtio-comment] PING: [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
  11 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-05-04  8:19 UTC (permalink / raw)
  To: parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong, zhenwei pi

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
---
 transport-fabrics.tex | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/transport-fabrics.tex b/transport-fabrics.tex
index c47a744..af35622 100644
--- a/transport-fabrics.tex
+++ b/transport-fabrics.tex
@@ -882,3 +882,27 @@ \subsubsection{TCP}\label{sec:Virtio Transport Options / Virtio Over Fabrics / r
 \subsubsection{RDMA}\label{sec:Virtio Transport Options / Virtio Over Fabrics / ransport Binding / RDMA}
 RDMA MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}
 ~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}.
+
+\subsection{Device Initialization}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Device Initialization}
+\begin{enumerate}
+\item The control queue MUST be established firstly, once the reliable
+connection is ready, the initiator MUST issue
+\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_connect}
+to create association with the target.
+\item The initiator SHOULD issue
+\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_feature}
+to discover the capabilities offered by the target.
+\item The initiator SHOULD issue
+\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_set_feature}
+to negotiate the capabilities.
+\item The initiator SHOULD continue initialization like PCI base devices, i.e. issue
+\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_vendor_id}
+to get the vendor ID.
+\item After discovering the number of virtqueues by
+\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_num_queues},
+the initiator SHOULD create virtqueue one by one by
+\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_connect}.
+\item The virtqueue SHOULD issue
+\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_vring}
+to transmit buffer.
+\end{enumerate}
-- 
2.25.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [virtio-comment] [PATCH v2 11/11] transport-fabrics: support inline data for keyed transmission
  2023-05-04  8:18 [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
                   ` (9 preceding siblings ...)
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 10/11] transport-fabrics: add device initialization zhenwei pi
@ 2023-05-04  8:19 ` zhenwei pi
  2023-05-29  0:56 ` [virtio-comment] PING: [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
  11 siblings, 0 replies; 74+ messages in thread
From: zhenwei pi @ 2023-05-04  8:19 UTC (permalink / raw)
  To: parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong, zhenwei pi

Add transport feature 'virtio_of_feature_stream_size' to negotiate
the inline data size. Lots of Virtio device protocol has small size
field(typically, 'status' to indicate the result of request), to
reduce the network RTT, support inline data for keyed transmission.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
---
 transport-fabrics.tex | 113 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 113 insertions(+)

diff --git a/transport-fabrics.tex b/transport-fabrics.tex
index af35622..1e76bc6 100644
--- a/transport-fabrics.tex
+++ b/transport-fabrics.tex
@@ -496,6 +496,117 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
                     +------+
 \end{lstlisting}
 
+For effective transmission, stream Segment Descriptors and keyed Segment
+Descriptors are allowed to use together in a single command.
+
+\begin{itemize}
+\item The initiator MAY discover the maximum stream transmission size of a
+command supported by target. See \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_feature}~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_feature}.
+\item The initiator MAY set the maximum stream transmission size of a command
+supported by initiator. See \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_set_feature}~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_set_feature}.
+\end{itemize}
+
+An example of 16 bytes maximum stream transmission size supported by target,
+and 1 byte maximum stream transmission size supported by initiator, a virtio-blk
+read 8K request(total size: sizeof(Command) + 4 * sizeof(Descriptor) + 16):
+\begin{lstlisting}
+ COMMAND            +------+
+                    |opcode|  ->  virtio_of_op_vring
+                    +------+
+                    |cmd id|  ->  100
+                    +------+
+                    |length|  ->  16 (virtio blk read command)
+                    +------+
+                    |ndesc |  ->  4
+                    +------+
+                    |rsvd  |
+                    +------+
+
+ DESC0              +------+
+                  +-|addr  |  -> 0
+                  | +------+
+                  | |length|  -> 16
+                  | +------+
+                  | |id    |  -> 0
+                  | +------+
+                  | |flags |  -> 0
+                  | +------+
+                  |
+ DESC1            | +------+
+                  | |addr  |  -> 0xffff012345670000
+                  | +------+
+                  | |length|  -> 4096
+                  | +------+
+                  | |id    |  -> 1
+                  | +------+
+                  | |flags |  -> VIRTIO_OF_DESC_F_KEYED | VIRTIO_OF_DESC_F_WRITE
+                  | +------+
+                  | |key   |  -> 0x1238
+                  | +------+
+                  |
+ DESC2            | +------+
+                  | |addr  |  -> 0xffff012345671000
+                  | +------+
+                  | |length|  -> 4096
+                  | +------+
+                  | |id    |  -> 2
+                  | +------+
+                  | |flags |  -> VIRTIO_OF_DESC_F_KEYED | VIRTIO_OF_DESC_F_WRITE
+                  | +------+
+                  | |key   |  -> 0x1239
+                  | +------+
+                  |
+ DESC3            | +------+
+                  | |addr  |  -> 0xffff012345673000
+                  | +------+
+                  | |length|  -> 1
+                  | +------+
+                  | |id    |  -> 3
+                  | +------+
+                  | |flags |  -> VIRTIO_OF_DESC_F_KEYED | VIRTIO_OF_DESC_F_WRITE
+                  | +------+
+                  | |key   |  -> 0x1233
+                  | +------+
+                  |
+ DATA             +>+------+  -> 0
+                    |......|
+                    +------+  -> 16
+\end{lstlisting}
+
+The target MAY handle Command, reads 16 bytes from request described by DESC0,
+writes the remote addresses of DESC1/DESC2, then responses Completion(total
+size: sizeof(Completion) + sizeof(Descriptor) + 1):
+\begin{lstlisting}
+ COMPLETION         +------+
+                    |status|  ->  VIRTIO_OF_SUCCESS
+                    +------+
+                    |cmd id|  ->  10
+                    +------+
+                    |ndesc |  ->  1
+                    +------+
+                    |rsvd  |
+                    +------+
+                    |value |  -> 1 (value.u32)
+                    +------+
+
+ DESC0              +------+
+                  +-|addr  |  -> 0
+                  | +------+
+                  | |length|  -> 1
+                  | +------+
+                  | |id    |  -> 3
+                  | +------+
+                  | |flags |  -> VIRTIO_OF_DESC_F_WRITE
+                  | +------+
+                  |
+ DATA             |>+------+  -> 0
+                    |......|
+                    +------+  -> 1
+\end{lstlisting}
+
+Note that the target is allowed to write the remote address of DESC3 and
+response Completion only in this example.
+
 \subsubsection{Commands Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition}
 This section defines command structures for Virtio Over Fabrics.
 
@@ -751,6 +862,8 @@ \subsubsection{Opcodes Definition}\label{sec:Virtio Transport Options / Virtio O
 \hline
 virtio_of_feature_max_segment & 0x0 & Get the maximum segments within a Vring Command supported by target \\
 \hline
+virtio_of_feature_stream_size & 0x1 & Get the target/set the initiator stream buffer size of a Command \\
+\hline
 \end{tabular}
 
 \paragraph{virtio_of_op_set_feature}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_set_feature}
-- 
2.25.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview zhenwei pi
@ 2023-05-04  8:57   ` David Hildenbrand
  2023-05-04  9:46     ` zhenwei pi
  2023-05-31 14:00   ` [virtio-comment] " Stefan Hajnoczi
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 74+ messages in thread
From: David Hildenbrand @ 2023-05-04  8:57 UTC (permalink / raw)
  To: zhenwei pi, parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong

On 04.05.23 10:19, zhenwei pi wrote:
> In the past years, virtio supports lots of device specifications by
> PCI/MMIO/CCW. These devices work fine in the virtualization environment.
> 
> Introduce Virtio Over Fabrics transport to support "network defined
> peripheral devices". With this transport, Many Virtio based devices
> transparently work over fabrics. Note that the balloon device may not
> make sense. Shared memory regions won't work.

Anything that involves memory (memory balloon, memory device, shared 
memory) is fully incompatible I would guess.

-- 
Thanks,

David / dhildenb


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview
  2023-05-04  8:57   ` David Hildenbrand
@ 2023-05-04  9:46     ` zhenwei pi
  2023-05-04 10:05       ` Michael S. Tsirkin
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-05-04  9:46 UTC (permalink / raw)
  To: David Hildenbrand, parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong



On 5/4/23 16:57, David Hildenbrand wrote:
> On 04.05.23 10:19, zhenwei pi wrote:
>> In the past years, virtio supports lots of device specifications by
>> PCI/MMIO/CCW. These devices work fine in the virtualization environment.
>>
>> Introduce Virtio Over Fabrics transport to support "network defined
>> peripheral devices". With this transport, Many Virtio based devices
>> transparently work over fabrics. Note that the balloon device may not
>> make sense. Shared memory regions won't work.
> 
> Anything that involves memory (memory balloon, memory device, shared 
> memory) is fully incompatible I would guess.
> 

Hi,

Agree that memory device and shared memory is incompatible.

But memory balloon device, I don't know if it's possible to use in 
future. As far as I can imagine: a user process runs virtio-of target by 
listening localhost, and memory balloon isolates same pages from kernel, 
the user process maps these physical pages into virtual memory ...

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview
  2023-05-04  9:46     ` zhenwei pi
@ 2023-05-04 10:05       ` Michael S. Tsirkin
  2023-05-04 10:12         ` David Hildenbrand
  2023-05-04 10:50         ` Re: " zhenwei pi
  0 siblings, 2 replies; 74+ messages in thread
From: Michael S. Tsirkin @ 2023-05-04 10:05 UTC (permalink / raw)
  To: zhenwei pi
  Cc: David Hildenbrand, parav, stefanha, jasowang, virtio-comment,
	houp, helei.sig11, xinhao.kong

On Thu, May 04, 2023 at 05:46:46PM +0800, zhenwei pi wrote:
> 
> 
> On 5/4/23 16:57, David Hildenbrand wrote:
> > On 04.05.23 10:19, zhenwei pi wrote:
> > > In the past years, virtio supports lots of device specifications by
> > > PCI/MMIO/CCW. These devices work fine in the virtualization environment.
> > > 
> > > Introduce Virtio Over Fabrics transport to support "network defined
> > > peripheral devices". With this transport, Many Virtio based devices
> > > transparently work over fabrics. Note that the balloon device may not
> > > make sense. Shared memory regions won't work.
> > 
> > Anything that involves memory (memory balloon, memory device, shared
> > memory) is fully incompatible I would guess.
> > 
> 
> Hi,
> 
> Agree that memory device and shared memory is incompatible.
> 
> But memory balloon device, I don't know if it's possible to use in future.
> As far as I can imagine: a user process runs virtio-of target by listening
> localhost, and memory balloon isolates same pages from kernel, the user
> process maps these physical pages into virtual memory ...

I don't see anything big that is wrong with balloon. device inflates
balloon, this memory is stolen from guest. isn't this what user
wanted? presumably there's a way for device to send this
information to the hypervisor otherwise it seems kind of
pointless.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview
  2023-05-04 10:05       ` Michael S. Tsirkin
@ 2023-05-04 10:12         ` David Hildenbrand
  2023-05-04 10:50         ` Re: " zhenwei pi
  1 sibling, 0 replies; 74+ messages in thread
From: David Hildenbrand @ 2023-05-04 10:12 UTC (permalink / raw)
  To: Michael S. Tsirkin, zhenwei pi
  Cc: parav, stefanha, jasowang, virtio-comment, houp, helei.sig11,
	xinhao.kong

On 04.05.23 12:05, Michael S. Tsirkin wrote:
> On Thu, May 04, 2023 at 05:46:46PM +0800, zhenwei pi wrote:
>>
>>
>> On 5/4/23 16:57, David Hildenbrand wrote:
>>> On 04.05.23 10:19, zhenwei pi wrote:
>>>> In the past years, virtio supports lots of device specifications by
>>>> PCI/MMIO/CCW. These devices work fine in the virtualization environment.
>>>>
>>>> Introduce Virtio Over Fabrics transport to support "network defined
>>>> peripheral devices". With this transport, Many Virtio based devices
>>>> transparently work over fabrics. Note that the balloon device may not
>>>> make sense. Shared memory regions won't work.
>>>
>>> Anything that involves memory (memory balloon, memory device, shared
>>> memory) is fully incompatible I would guess.
>>>
>>
>> Hi,
>>
>> Agree that memory device and shared memory is incompatible.
>>
>> But memory balloon device, I don't know if it's possible to use in future.
>> As far as I can imagine: a user process runs virtio-of target by listening
>> localhost, and memory balloon isolates same pages from kernel, the user
>> process maps these physical pages into virtual memory ...
> 
> I don't see anything big that is wrong with balloon. device inflates
> balloon, this memory is stolen from guest. isn't this what user
> wanted? presumably there's a way for device to send this
> information to the hypervisor otherwise it seems kind of
> pointless.

If it's really just about sending messages, then even virtio-mem would 
be possible I assume?

-- 
Thanks,

David / dhildenb


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: Re: [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview
  2023-05-04 10:05       ` Michael S. Tsirkin
  2023-05-04 10:12         ` David Hildenbrand
@ 2023-05-04 10:50         ` zhenwei pi
  1 sibling, 0 replies; 74+ messages in thread
From: zhenwei pi @ 2023-05-04 10:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand, parav, stefanha, jasowang, virtio-comment,
	houp, helei.sig11, xinhao.kong



On 5/4/23 18:05, Michael S. Tsirkin wrote:
> On Thu, May 04, 2023 at 05:46:46PM +0800, zhenwei pi wrote:
>>
>>
>> On 5/4/23 16:57, David Hildenbrand wrote:
>>> On 04.05.23 10:19, zhenwei pi wrote:
>>>> In the past years, virtio supports lots of device specifications by
>>>> PCI/MMIO/CCW. These devices work fine in the virtualization environment.
>>>>
>>>> Introduce Virtio Over Fabrics transport to support "network defined
>>>> peripheral devices". With this transport, Many Virtio based devices
>>>> transparently work over fabrics. Note that the balloon device may not
>>>> make sense. Shared memory regions won't work.
>>>
>>> Anything that involves memory (memory balloon, memory device, shared
>>> memory) is fully incompatible I would guess.
>>>
>>
>> Hi,
>>
>> Agree that memory device and shared memory is incompatible.
>>
>> But memory balloon device, I don't know if it's possible to use in future.
>> As far as I can imagine: a user process runs virtio-of target by listening
>> localhost, and memory balloon isolates same pages from kernel, the user
>> process maps these physical pages into virtual memory ...
> 
> I don't see anything big that is wrong with balloon. device inflates
> balloon, this memory is stolen from guest. isn't this what user
> wanted? presumably there's a way for device to send this
> information to the hypervisor otherwise it seems kind of
> pointless.
> 

Hi,

A user process has a chance to steal some pages from a kernel, these 
pages will not be used by kernel(no memory migration, no KSM, no kswap 
on these pages). As of the current situation, this maybe seem pointless ...

I don't disagree with David's point about memory balloon. ^_^

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] PING: [PATCH v2 00/11] Introduce Virtio Over Fabrics
  2023-05-04  8:18 [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
                   ` (10 preceding siblings ...)
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 11/11] transport-fabrics: support inline data for keyed transmission zhenwei pi
@ 2023-05-29  0:56 ` zhenwei pi
  11 siblings, 0 replies; 74+ messages in thread
From: zhenwei pi @ 2023-05-29  0:56 UTC (permalink / raw)
  To: parav, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong

PING!

On 5/4/23 16:18, zhenwei pi wrote:
> v1 -> v2:
> - Suggested by Parav, split a large patch into several small patches.
> - Small changes for VQN, add "There is no strict style limitation".
> - Move *bytes* field limitation from get/set config opcode section to
>    Config Command.
> 
> v1:
> Introduce Virtio-oF specification, include:
> - overview
> - Virtio Qualified Name
> - Segment Descriptor definition
> - Buffer Mapping definition: Stream Transmission and Keyed Transmission
> - Command set definition
> - opcode definition
> - status definition
> - transport binding: TCP and RDMA
> - device initialization
> 
> Previous discussion:
> https://lists.oasis-open.org/archives/virtio-comment/202304/msg00442.html
> 
> zhenwei pi (11):
>    transport-fabrics: introduce Virtio Over Fabrics overview
>    transport-fabrics: introduce Virtio Qualified Name
>    transport-fabircs: introduce Segment Descriptor Definition
>    transport-fabrics: introduce Stream Transmission
>    transport-fabrics: introduce Keyed Transmission
>    transport-fabrics: introduce command set
>    transport-fabrics: introduce opcodes
>    transport-fabrics: introduce status of completion
>    transport-fabrics: add TCP&RDMA binding
>    transport-fabrics: add device initialization
>    transport-fabrics: support inline data for keyed transmission
> 
>   content.tex           |    1 +
>   transport-fabrics.tex | 1021 +++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 1022 insertions(+)
>   create mode 100644 transport-fabrics.tex
> 

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview zhenwei pi
  2023-05-04  8:57   ` David Hildenbrand
@ 2023-05-31 14:00   ` Stefan Hajnoczi
  2023-06-02  1:17     ` [virtio-comment] " zhenwei pi
  2023-06-05  2:39   ` [virtio-comment] " Parav Pandit
  2023-06-05  2:39   ` Parav Pandit
  3 siblings, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-05-31 14:00 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 4691 bytes --]

On Thu, May 04, 2023 at 04:19:00PM +0800, zhenwei pi wrote:
> In the past years, virtio supports lots of device specifications by
> PCI/MMIO/CCW. These devices work fine in the virtualization environment.
> 
> Introduce Virtio Over Fabrics transport to support "network defined
> peripheral devices". With this transport, Many Virtio based devices
> transparently work over fabrics. Note that the balloon device may not
> make sense. Shared memory regions won't work.
> 
> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> ---
>  content.tex           |  1 +
>  transport-fabrics.tex | 31 +++++++++++++++++++++++++++++++
>  2 files changed, 32 insertions(+)
>  create mode 100644 transport-fabrics.tex
> 
> diff --git a/content.tex b/content.tex
> index cff548a..f899c3a 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -582,6 +582,7 @@ \chapter{Virtio Transport Options}\label{sec:Virtio Transport Options}
>  \input{transport-pci.tex}
>  \input{transport-mmio.tex}
>  \input{transport-ccw.tex}
> +\input{transport-fabrics.tex}
>  
>  \chapter{Device Types}\label{sec:Device Types}
>  
> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> new file mode 100644
> index 0000000..0dc031b
> --- /dev/null
> +++ b/transport-fabrics.tex
> @@ -0,0 +1,31 @@
> +\section{Virtio Over Fabrics}\label{sec:Virtio Transport Options / Virtio Over Fabrics}
> +
> +This section defines specification to Virtio that enables operation over other
> +interconnects. A central goal of Virtio Over Fabrics is to maintain consistency
> +with the PCI device, so Virtio based devices transparently work over PCI or
> +fabrics.

The reader wants to know what VIRTIO Over Fabrics is, not how it relates
to other Transports that they may not be very familiar with.

Fabrics is a Transport and any Transport is capable of supporting the
VIRTIO device model. Therefore I don't think the stated aim should be to
match PCI specifically. Just being a Transport is already enough. PCI is
not special.

I suggest something like:

  Virtio Over Fabrics enables operation over interconnects that rely
  primarily on message passing. Supported interconnects include TODO.

> +
> +Virtio Over Fabrics uses reliable connection to transmit data, the reliable

"uses a reliable connection"

> +connection betweens two rules:

"connection facilitates communication between entities playing the following roles:"

> +
> +\begin{itemize}
> +\item An initiator functions as an Virtio Over Fabrics client. An initiator

"as a Virtio ..."

> +typically serves the same purpose to a machine as a Virtio device, issues
> +commands to remote side.

This says that the driver talks to the initiator instead of a local
device and the initiator forwards commands to the actual device on the
remote side?

I find this sentence confusing because I associate the initiator with
the driver, not the device.

Maybe:

  The initiator sends commands from the driver to the target.

> +\item A target functions as an Virtio Over Fabrics server. An target typically

"A target"

> +handles commands from the initiator side and responses completions.

The concept of the device is missing here. For symmetry it may be good
to say something like:

  The target forwards commands to the device and sends responses back to
  the initiator.

> +\end{itemize}
> +
> +Virtio Over Fabrics has the following differences from the PCI based
> +specification:
> +
> +\begin{itemize}
> +\item Instead of memory sharing mechanism of virtqueue, there is a one-to-one
> +mapping between virtqueue and the reliable connection which executes the vring
> +data transmission.
> +\item An additional control connection is required to execute control commands
> +which is similar to read/write register on a PCI device.
> +\item Virtio Over Fabrics does not define an interrupt mechanism that allows an
> +initiator to generate a host interrupt. It is the responsibility of the host
> +fabric interface to generate host interrupts.
> +\end{itemize}

As mentioned above, comparing against PCI requires that the reader is
familiar with PCI. I think it would be preferrable to explain the unique
characteristics of Virtio Over Fabrics in a self-contained way:

  The basic organization of Virtio Over Fabrics is as follows:

  \begin{itemize}
  \item A reliable connection carries control commands that are not specific to a virtqueue.
  \item Each virtqueue has its own reliable connection.
  \item There is no interrupt mechanism since the arrival of data on the fabric already indicates when there is activity.
  \end{itemize}

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [virtio-comment] [PATCH v2 02/11] transport-fabrics: introduce Virtio Qualified Name
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 02/11] transport-fabrics: introduce Virtio Qualified Name zhenwei pi
@ 2023-05-31 14:06   ` Stefan Hajnoczi
  2023-06-02  1:50     ` zhenwei pi
  0 siblings, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-05-31 14:06 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 2638 bytes --]

On Thu, May 04, 2023 at 04:19:01PM +0800, zhenwei pi wrote:
> Add VQN section. The VQN is a little different from iSCSI/NVMe-oF on
> style limitation. Because iSCSI/NVMe-of is storage specific protocol,
> the full string IQN(for iSCSI/iSER) and NQN(for NVMe-oF) represents
> a "storage access address". However, Virtio Over Fabrics works as
> transport layer rather than device layer, a URL style string is better
> to Virtio Over Fabrics. For example:
> virtio-of://blk-resource/nvme-pool/849a39ad-8d7b-4a7a-adb6-e7407ace532c
> virtio-of://blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
> ...
> virtio-of://crypto-resource/25307f22-e5a8-4ea2-b7ca-79f5c3bebc3c

I'm not sure what blk-resource and nvme-pool are in these URLs?

Should the patch mention the virtio-of:// URI scheme?

> 
> A hunam readable VQN is helpful to maintain/debug/distinguish.
> 
> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> ---
>  transport-fabrics.tex | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> index 0dc031b..26b0192 100644
> --- a/transport-fabrics.tex
> +++ b/transport-fabrics.tex
> @@ -29,3 +29,19 @@ \section{Virtio Over Fabrics}\label{sec:Virtio Transport Options / Virtio Over F
>  initiator to generate a host interrupt. It is the responsibility of the host
>  fabric interface to generate host interrupts.
>  \end{itemize}
> +
> +\subsection{Virtio Qualified Name}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Virtio Qualified Name}
> +Virtio Qualified Names (VQNs) are used to uniquely describe an initiator or a
> +target for the purposes of identification.
> +
> +A VQN is encoded as a string of Unicode characters with the following
> +properties:
> +
> +\begin{itemize}
> +\item The encoding is UTF-8 (refer to RFC 3629).
> +\item The characters dash('-'), dot ('.'), slash('/') and colon(':') are used
> +in formatting.
> +\item The maximum name is 256 bytes in length.
> +\item The string is null terminated.

Is the maximum name 255 UTF-8 bytes plus a NUL character? Please state
this in the spec. For example:

  \item The string is NUL terminated.
  \item The maximum name is 256 bytes in length, including the NUL character.

> +\item There is no strict style limitation.

I think it's necessary to define representations for specific fabrics
(e.g. TCP/IP) so that VQNs can be exchanged between different VIRTIO
implementations (VMMs, DPUs, command-line utilities, etc). Otherwise two
different implementations may represent the same address differently.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [virtio-comment] [PATCH v2 03/11] transport-fabircs: introduce Segment Descriptor Definition
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 03/11] transport-fabircs: introduce Segment Descriptor Definition zhenwei pi
@ 2023-05-31 14:23   ` Stefan Hajnoczi
  2023-06-02  3:08     ` zhenwei pi
  2023-06-05  2:40   ` [virtio-comment] " Parav Pandit
  1 sibling, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-05-31 14:23 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 4310 bytes --]

On Thu, May 04, 2023 at 04:19:02PM +0800, zhenwei pi wrote:
> Introduce segment descriptor to describe the Virtio device buffer
> segments.
> 
> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> ---
>  transport-fabrics.tex | 43 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> index 26b0192..b88acfd 100644
> --- a/transport-fabrics.tex
> +++ b/transport-fabrics.tex
> @@ -45,3 +45,46 @@ \subsection{Virtio Qualified Name}\label{sec:Virtio Transport Options / Virtio O
>  \item The string is null terminated.
>  \item There is no strict style limitation.
>  \end{itemize}
> +
> +\subsection{Transmission Protocol}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol}
> +This section defines transmission protocol for Virtio Over Fabrics. All the

What does "transmission protocol" mean? I guess this is what is often
called a network protocol or a wire protocol or just a protocol, but it
wasn't clear to me maybe whether the "transmission protocol" is one
protocol out of a set of protocols that make up Virtio Over Fabrics.

This paragraph should describe which connections use this protocol. For
example:

  This protocol is used for both control and virtqueue connections.

> +fields use little endian format.
> +
> +\subsubsection{Segment Descriptor Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Segment Descriptor Definition}
> +Virtio Over Fabrics uses the following structure to describe data segment:

What is a data segment? I guess it's a message/command/request?

There should be an explanation of how data segments are used. For
example:

  The initiator sends a data segment containing the command to the
  target. The target sends a data segment containing the response to the
  command back to the initiator.

> +
> +\begin{lstlisting}
> +struct virtio_of_vring_desc {

I think the name "vring" should be avoided. The vring is an in-memory
layout for implementing virtqueues where shared memory is available.
Calling it virtio_of_vq_desc makes it clear that Virtio Over Fabrics
does not use vrings to implement virtqueues.

> +        le64 addr;
> +        le32 length;
> +        /* This marks the unique ID within a command, no limitation among inflight commands */

What is a command?

> +        le16 id;
> +        /* This marks a buffer as keyed transmission (otherwise stream transmission) */
> +#define VIRTIO_OF_DESC_F_KEYED     1
> +        /* This marks a buffer as device write-only (otherwise device read-only). */
> +#define VIRTIO_OF_DESC_F_WRITE     2
> +        le16 flags;
> +        le32 key;
> +};
> +\end{lstlisting}
> +
> +The structure virtio_of_vring_desc is used for both keyed transmission
> +(i.e. RDMA) and stream transmission(i.e. TCP). The fields is described as follows:
> +
> +\begin{tabular}{ |l|l|l| }
> +\hline
> +Field & keyed transmission & stream transmission \\
> +\hline \hline
> +addr & Start address of remote memory buffer & Start address within the stream buffer \\

What is a stream buffer?

> +\hline
> +length & The length of remote memory buffer & The length of buffer within the stream \\

I'm not sure what buffer means here. I guess it's not the same as a
virtqueue buffer, it's probably a virtqueue descriptor (element)?

Can you avoid using buffer here since it usually means something else in
Virtio?

> +\hline
> +id & The ID of this descriptor & The ID of this descriptor \\
> +\hline
> +flags & both keyed transmission and stream transmission supported & stream transmission only \\

I'm not sure what this means.

> +\hline
> +key & Key of the remote Memory Region & Ignore \\

Should "Ignore" be "Reserved" so that stream transmission can use this
field for something else in the future?

> +\hline
> +\end{tabular}
> +
> +Depending on the opcode, a Command contains zero or more structure virtio_of_vring_desc.

opcode hasn't been defined yet. I guess that's because the first
virtio_of_vring_desc contains a Command and that has an opcode field?
Please make sure the text is ordered so that terms are defined before
they are used.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [virtio-comment] [PATCH v2 04/11] transport-fabrics: introduce Stream Transmission
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 04/11] transport-fabrics: introduce Stream Transmission zhenwei pi
@ 2023-05-31 15:20   ` Stefan Hajnoczi
  2023-06-02  2:26     ` zhenwei pi
  0 siblings, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-05-31 15:20 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 6590 bytes --]

On Thu, May 04, 2023 at 04:19:03PM +0800, zhenwei pi wrote:
> Stream transmission is used for stream oriented communication(Ex TCP),
> also add virtio-blk read/write 8K example.
> 
> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> ---
>  transport-fabrics.tex | 229 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 229 insertions(+)
> 
> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> index b88acfd..c02cf26 100644
> --- a/transport-fabrics.tex
> +++ b/transport-fabrics.tex
> @@ -88,3 +88,232 @@ \subsubsection{Segment Descriptor Definition}\label{sec:Virtio Transport Options
>  \end{tabular}
>  
>  Depending on the opcode, a Command contains zero or more structure virtio_of_vring_desc.
> +
> +\subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Buffer Mapping Definition}
> +Virtio Over Fabrics defines two types of buffer mapping rules.

What is a buffer? Is it a virtqueue buffer (consisting of one or more
descriptors/elements) or are you using the term for a different concept?

> +
> +\paragraph{Stream Transmission}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}
> +Command, Segment Descriptors, and buffer are transmitted in a stream within a

Is a Segment Descriptor a virtio_of_vring_desc?

> +connection. The layout in stream:
> +
> +\begin{lstlisting}
> +CMDx contains 0 descriptor, CMDy contains (n - m + 1) descriptors and buffer:

"0 descriptors"

> +
> +     +-----+     +-----++-----+     +-----++-----+
> + ... | CMDx| ... | CMDy||DESCm| ... |DESCn|| BUF | ...
> +     +-----+     +-----++-----+     +-----++-----+
> +
> +COMPx contains 0 descriptor, COMPy contains (k - j + 1) descriptors and buffer:

I think this is the first time the concept of a completion (COMP) was
introduced. Please describe commands/completions before using them in
the text.

> +
> +     +-----+     +-----++-----+     +-----++-----+
> + ... |COMPx| ... |COMPy||DESCj| ... |DESCk|| BUF | ...
> +     +-----+     +-----++-----+     +-----++-----+
> +\end{lstlisting}
> +
> +An example of a virtio-blk write 8K request(total size: sizeof(Command) +
> +4 * sizeof(Descriptor) + 8208):
> +\begin{lstlisting}
> + COMMAND            +------+
> +                    |opcode|  ->  virtio_of_op_vring
> +                    +------+
> +                    |cmd id|  ->  10
> +                    +------+
> +                    |length|  ->  8208
> +                    +------+
> +                    |ndesc |  ->  4
> +                    +------+
> +                    |rsvd  |
> +                    +------+
> +
> + DESC0              +------+
> +              +-----|addr  |  -> 0
> +              |     +------+
> +              |     |length|  -> 16 (virtio blk write command)
> +              |     +------+
> +              |     |id    |  -> 0
> +              |     +------+
> +              |     |flags |  -> 0
> +              |     +------+
> +              |
> + DESC1        |     +------+
> +              | +---|addr  |  -> 16
> +              | |   +------+
> +              | |   |length|  -> 4096
> +              | |   +------+
> +              | |   |id    |  -> 1
> +              | |   +------+
> +              | |   |flags |  -> 0
> +              | |   +------+
> +              | |
> + DESC2        | |   +------+
> +              | | +-|addr  |  -> 4112
> +              | | | +------+
> +              | | | |length|  -> 4096
> +              | | | +------+
> +              | | | |id    |  -> 2
> +              | | | +------+
> +              | | | |flags |  -> 0
> +              | | | +------+
> +              | | |
> + DESC3        | | | +------+
> +              | | | |addr  |  -> 0

Is this field 0 in all stream connection VIRTIO_OF_DESC_F_WRITE
descriptors?

> +              | | | +------+
> +              | | | |length|  -> 1
> +              | | | +------+
> +              | | | |id    |  -> 3
> +              | | | +------+
> +              | | | |flags |  -> VIRTIO_OF_DESC_F_WRITE
> +              | | | +------+
> +              | | |
> + DATA         +-+-+>+------+  -> 0
> +                | | |......|
> +                +-+>+------+  -> 16
> +                  | |......|
> +                  +>+------+  -> 4112
> +                    |......|
> +                    +------+  -> 8208
> +\end{lstlisting}
> +
> +The Completion of this request(total size: sizeof(Completion) +
> +1 * sizeof(Descriptor) + 1):
> +\begin{lstlisting}
> + COMPLETION         +------+
> +                    |status|  ->  VIRTIO_OF_SUCCESS
> +                    +------+
> +                    |cmd id|  ->  10
> +                    +------+
> +                    |ndesc |  ->  1
> +                    +------+
> +                    |rsvd  |
> +                    +------+
> +                    |value |  -> 1 (value.u32)

What is this field and what does u32 mean?

> +                    +------+
> +
> + DESC0              +------+
> +                  +-|addr  |  -> 0
> +                  | +------+
> +                  | |length|  -> 1
> +                  | +------+
> +                  | |id    |  -> 3

This has to match with the original descriptor id sent with the Command?

> +                  | +------+
> +                  | |flags |  -> VIRTIO_OF_DESC_F_WRITE
> +                  | +------+
> +                  |
> + DATA             |>+------+  -> 0
> +                    |......|
> +                    +------+  -> 1
> +\end{lstlisting}

I think this is more flexible (and has more protocol overhead) than
necessary. When the device has used a virtqueue buffer, it indicates how
many bytes were used (this can be less than the totaly number of F_WRITE
bytes available). I don't think there is a need to communicate F_WRITE
descriptors, especially in the Completion. Just a Completion with a
'length' field instead of an 'ndesc' field followed by data is enough.

Since VIRTIO has flexible framing
(https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-390004),
there isn't really a need to communicate the F_WRITE descriptors at all,
just the maximum number of used bytes that the initiator expects.

Can you explain why you chose to transmit F_WRITE descriptors in both
the Command and the Completion? Maybe I missed a reason why it's
important.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission zhenwei pi
@ 2023-05-31 16:20   ` Stefan Hajnoczi
  2023-06-01  9:02     ` zhenwei pi
  2023-06-05  2:41   ` Parav Pandit
  1 sibling, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-05-31 16:20 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 1664 bytes --]

On Thu, May 04, 2023 at 04:19:04PM +0800, zhenwei pi wrote:
> Keyed transmission is used for message oriented communication(Ex RDMA),
> also add virtio-blk read/write 8K example.
> 
> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> ---
>  transport-fabrics.tex | 178 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 178 insertions(+)
> 
> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> index c02cf26..7711321 100644
> --- a/transport-fabrics.tex
> +++ b/transport-fabrics.tex
> @@ -317,3 +317,181 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
>                      |......|
>                      +------+  -> 8193
>  \end{lstlisting}
> +
> +\paragraph{Keyed Transmission}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}
> +Command and Segment Descriptors are transmitted in a message within a
> +connection, and buffer is transmitted by remote memory access.  The layout in message:

With RDMA it is theoretically possible to implement virtqueues without
messages in the data path (i.e. by using something similar to vring with
RDMA). Why did you decide to use a mixed messages + RDMA approach
instead of a 100% RDMA approach?

> +
> +\begin{lstlisting}
> +CMDx contains 0 descriptor, CMDy contains (n - m + 1) descriptors:

"0 descriptors"

> +
> +     +-----+     +-----++-----+     +-----+
> + ... | CMDx| ... | CMDy||DESCm| ... |DESCn| ...
> +     +-----+     +-----++-----+     +-----+
> +
> +COMPx contains 0 descriptor, COMPy contains 0 descriptor:

"0 descriptors"

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 06/11] transport-fabrics: introduce command set zhenwei pi
@ 2023-05-31 17:10   ` Stefan Hajnoczi
  2023-06-02  5:15     ` [virtio-comment] " zhenwei pi
  0 siblings, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-05-31 17:10 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 12385 bytes --]

On Thu, May 04, 2023 at 04:19:05PM +0800, zhenwei pi wrote:
> Introduce command structures for Virtio-oF.
> 
> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> ---
>  transport-fabrics.tex | 209 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 209 insertions(+)
> 
> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> index 7711321..37f57c6 100644
> --- a/transport-fabrics.tex
> +++ b/transport-fabrics.tex
> @@ -495,3 +495,212 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
>                      |value |  -> 8193 (value.u32)
>                      +------+
>  \end{lstlisting}
> +
> +\subsubsection{Commands Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition}
> +This section defines command structures for Virtio Over Fabrics.
> +
> +A common structure virtio_of_value is fixed to 8 bytes and MUST be used as one
> +of the following format:
> +
> +\begin{itemize}
> +\item u8
> +\item le16
> +\item le32
> +\item le64
> +\end{itemize}

The way it's written does not document where the u8, u16, u32 bytes are
located and that the unused bytes are 0. I think I understand what you
mean though:

  le64 value = cpu_to_le64((u64)v); /* v is u8, u16, u32, or u64 */

Please clarify.

> +
> +\paragraph{Command ID}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Command ID}
> +There is command_id(le16) field in each Command and Completion:

"is a command_id"

> +
> +\begin{itemize}
> +\item Generally the initiator allocates a Command ID and specifies the

"allocates a Command ID that is unique for all in-flight commands"?

> +command_id field of a Command, and the target MUST specify the same Command ID

The "MUST" statement needs to be in a driver-normative section. You can
keep the sentence in this non-normative section by tweaking it:
"target specifies"

The idea is that all MUST/SHOULD/etc statements are in a separate
device/driver-normative section so that they can be easily reviewed by
device/driver implementers without re-reading the entire text.

> +in command_id field of Completion.
> +\item The initiator MUST guarantee each Command ID is unique in the inflight Commands.

Same here about "MUST".

> +\item Command ID 0xff00 - 0xffff is reserved for control queue to delivery asynchronous event.

"for control queue asynchronous events"

> +\end{itemize}
> +
> +The reserved Command ID for control queue is defined as follows:

"The reserved Command IDs for the control queue are as follows:"

> +
> +\begin{tabular}{ |l|l| }
> +\hline
> +Command ID & Description \\
> +\hline \hline
> +0xffff & Keepalive. The initiator SHOULD ignore this event \\

"Ignored by the initiator." + move the SHOULD statement to a
driver-normative section.

> +\hline
> +0xfffe & Config change. The initiator SHOULD generate config change interrupt to device \\

"Causes the initiator to generate a configuration change notification."

> +\hline
> +0xff00 - 0xfffd & Reserved \\
> +\hline
> +\end{tabular}
> +
> +\paragraph{Connect Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Connect Command}
> +The Connect Command is used to establish Virtio Over Fabrics queue. The control
> +queue MUST be established firstly, then the Connect command establishes an
> +association between the initiator and the target.

Is a "Virtio Over Fabrics queue" different from a virtqueue?

If I understand correctly, the control queue must be established by the
initiator first and then the Connect command is sent to begin
communication between the initiator and the target?

> +
> +The Target ID of 0xffff is reserved, then:

Please move this after the fields have been shown and the purpose of the
Target ID field has been explained.

> +\begin{itemize}
> +\item The Target ID of 0xffff MUST be specified as the Target ID in a Connect
> +Command for the control queue.
> +\item The target SHOULD allocate any available Target ID to the initiator,
> +and return the allocated Target ID in the Completion.
> +\item The returned Target ID MUST be specified as the Target ID, and the Queue ID
> +MUST be specified in a Connect Command for the virtqueue.
> +\end{itemize}

What is the purpose of the Target ID? Is it to allow a server to provide
access to multiple targets over the same connection?

> +
> +The Connect Command has following structure:
> +
> +\begin{lstlisting}
> +struct virtio_of_command_connect {
> +        le16 opcode;
> +        le16 command_id;
> +        le16 target_id;
> +        le16 queue_id;
> +        le16 ndesc;

Where is this field documented?

Why does the initiator send ndesc to the target? Normally a VIRTIO Transport reports the device's max descriptors and then the driver can tell the device to reduce the number of descriptors, if desired.

> +#define VIRTIO_OF_CONNECTION_TCP     1
> +#define VIRTIO_OF_CONNECTION_RDMA    2

What does RDMA mean? I thought RDMA is a general concept that several
fabrics implement (with different details like how addressing works).

> +        u8 oftype;
> +        u8 padding[5];
> +};
> +\end{lstlisting}
> +
> +The Connect commands MUST contains one Segment Descriptor and one structure
> +virtio_of_command_connect to specify Initiator VQN and Target VNQ,
> +virtio_of_command_connect has following structure:

I'm confsued. virtio_of_command_connect was defined above. The struct
defined below is virtio_of_connect. Does this paragraph need to be
updated (virtio_of_command_connect -> virtio_of_connect)?

Why is virtio_of_connect a separate struct and not part of
virtio_of_command_connect?

> +
> +\begin{lstlisting}
> +struct virtio_of_connect {
> +        u8 ivqn[256];
> +        u8 tvqn[256];

If the initiator is already sends tvqn, why also have target_id?

> +        u8 padding[512];
> +};
> +\end{lstlisting}
> +
> +\paragraph{Feature Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Feature Command}
> +
> +The control queue uses Feature Command to get or set features. This command is used for:
> +
> +\begin{itemize}
> +\item The initiator/target features. This is used to negotiate transport layer features.
> +\item The driver/device features. This is used to negotiate Virtio Based device
> +features which is similar to PCI based device.

Please do not make references to the PCI Transport.

> +\end{itemize}
> +
> +The Feature Command has following structure:
> +
> +\begin{lstlisting}
> +struct virtio_of_command_feature {
> +        le16 opcode;
> +        le16 command_id;
> +        le32 feature_select;
> +        le64 value;        /* ignore this field on GET */
> +};
> +\end{lstlisting}

I guess the opcode tells the target whether this is a VIRTIO Features
Get, VIRTIO Features Set, VIRTIO-Over-Fabrics Features Get, or
VIRTIO-Over-Fabrics Features Set command? Please document the opcodes
here and also include a full opcode table somewhere else.

> +
> +\paragraph{Queue Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Queue Command}
> +
> +The control queue uses Queue Command to get or set properties on a specific queue.
> +The Queue Command has following structure:
> +
> +\begin{lstlisting}
> +struct virtio_of_command_queue {
> +        le16 opcode;
> +        le16 command_id;
> +        le16 queue_id;

Does "queue" mean virtqueue here? Or does it also apply to the control
queue? If it's a virtqueue, please call this vq_id.

> +        u8 padding6;
> +        u8 padding7;
> +        struct virtio_of_value value;   /* ignore this field on GET */
> +};
> +\end{lstlisting}

The opcode and their semantics are not documented.

> +\paragraph{Config Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Config Command}
> +
> +The control queue uses Config Command to get or set configure on device.
> +The Config Command has following structure:

I suggest choosing a different name to avoid confusion with the
VIRTIO Configuration Space.

> +
> +\begin{lstlisting}
> +struct virtio_of_command_config {
> +        le16 opcode;
> +        le16 command_id;
> +        le16 offset;
> +        u8 bytes;
> +        u8 padding7;
> +        struct virtio_of_value value;        /* ignore this field on GET */
> +};
> +\end{lstlisting}
> +
> +The bytes field supports on Get only:
> +
> +\begin{itemize}
> +\item 1, then the initiator reads from value field of Completion as u8
> +\item 2, then the initiator reads from value field of Completion as le16
> +\item 4, then the initiator reads from value field of Completion as le32
> +\item 8, then the initiator reads from value field of Completion as le64
> +\end{itemize}
> +
> +The bytes field supports on Set only:
> +
> +\begin{itemize}
> +\item 1, then the initiator specifies the value field of Config Command as u8
> +\item 2, then the initiator specifies the value field of Config Command as le16
> +\item 4, then the initiator specifies the value field of Config Command as le32
> +\item 8, then the initiator specifies the value field of Config Command as le64
> +\end{itemize}

I have no idea what virtio_of_command_config does because the opcodes
aren't documented.

> +
> +\paragraph{Common Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command}
> +
> +The control queue uses Common Command to get or set common properties on
> +device(i.e. get device ID). The Common Command has following structure:
> +
> +\begin{lstlisting}
> +struct virtio_of_command_common {
> +        le16 opcode;
> +        le16 command_id;
> +        u8 padding4;
> +        u8 padding5;
> +        u8 padding6;
> +        u8 padding7;
> +        struct virtio_of_value value;        /* ignore this field on GET */
> +};
> +\end{lstlisting}
> +
> +
> +\paragraph{Vring Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Vring Command}
> +
> +Both control queue and virtqueue use Vring Command to transmit buffer.
> +The Vring Command has following structure:
> +
> +\begin{lstlisting}
> +struct virtio_of_command_vring {
> +        le16 opcode;
> +        le16 command_id;
> +        /* Total buffer size this command contains(not include command&descriptors). */
> +        le32 length;
> +        /* How many descriptors this command contains */
> +        le16 ndesc;
> +        u8 padding[6];
> +};
> +\end{lstlisting}
> +
> +\paragraph{Completion}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Completion}
> +
> +The target responses Completion to the initiator to report command status,
> +device properties, and transmit buffer. The Completion has following structure:
> +
> +\begin{lstlisting}
> +struct virtio_of_completion {
> +        le16 status;
> +        le16 command_id;
> +        /* How many descriptors this completion contains */
> +        le16 ndesc;
> +        u8 rsvd6;
> +        u8 rsvd7;
> +        struct virtio_of_value value;
> +};
> +\end{lstlisting}
> +
> +Note that Virtio Over Fabrics does not define an interrupt mechanism, generally
> +the initiator receives a Completion, it SHOULD generate a host interrupt
> +(if no interrupt suspending on device).

It's not possible to review this patch because these structs aren't used
yet and the opcodes are undefined.

Defining structs that are shared by multiple opcodes might make
implementations cleaner, but I think it makes the spec less clear. I
would rather have a list of all opcodes and each one shows the full
command layout (even if it is duplicated). That way it's very easy to
look up an opcode you are implementing or debugging and check what's
needed. If the command layout is not documented in a single place, then
it takes more effort to figure out how an opcode works.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: [PATCH v2 07/11] transport-fabrics: introduce opcodes
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 07/11] transport-fabrics: introduce opcodes zhenwei pi
@ 2023-05-31 17:11   ` Stefan Hajnoczi
       [not found]   ` <20230531205508.GA1509630@fedora>
  1 sibling, 0 replies; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-05-31 17:11 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 577 bytes --]

On Thu, May 04, 2023 at 04:19:06PM +0800, zhenwei pi wrote:
> Define opcode with this rule:
> The Virtio-oF transport layer commands use 0x0000-0x0fff, and the
> device layer commands use 0x1000-0xffff. get/set status/feature/
> config use consecutive number.
> 
> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> ---
>  transport-fabrics.tex | 134 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 134 insertions(+)

I will continue reviewing this later because I have run out of time.
Feel free to iterate this series in the meantime.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-05-31 16:20   ` [virtio-comment] " Stefan Hajnoczi
@ 2023-06-01  9:02     ` zhenwei pi
  2023-06-01 11:33       ` Stefan Hajnoczi
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-01  9:02 UTC (permalink / raw)
  To: virtio-comment



On 6/1/23 00:20, Stefan Hajnoczi wrote:
> On Thu, May 04, 2023 at 04:19:04PM +0800, zhenwei pi wrote:
>> Keyed transmission is used for message oriented communication(Ex RDMA),
>> also add virtio-blk read/write 8K example.
>>
>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>> ---
>>   transport-fabrics.tex | 178 ++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 178 insertions(+)
>>
>> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
>> index c02cf26..7711321 100644
>> --- a/transport-fabrics.tex
>> +++ b/transport-fabrics.tex
>> @@ -317,3 +317,181 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
>>                       |......|
>>                       +------+  -> 8193
>>   \end{lstlisting}
>> +
>> +\paragraph{Keyed Transmission}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}
>> +Command and Segment Descriptors are transmitted in a message within a
>> +connection, and buffer is transmitted by remote memory access.  The layout in message:
> 
> With RDMA it is theoretically possible to implement virtqueues without
> messages in the data path (i.e. by using something similar to vring with
> RDMA). Why did you decide to use a mixed messages + RDMA approach
> instead of a 100% RDMA approach?
> 

Hi,

To reduce networking RTT. From my experience, a single RDMA 
message(event based) uses at least 6us.
This approach has a chance to send a command(include data segments) by 1 
networking RTT, and receive a completion(include data segments) in 1 
networking RTT. I tried to design a 100% RDMA approach(mapping a vring 
to the remote side, the remote side accesses this vring by RDMA 
READ/WRITE), but I failed to find an idea to achieve.

>> +
>> +\begin{lstlisting}
>> +CMDx contains 0 descriptor, CMDy contains (n - m + 1) descriptors:
> 
> "0 descriptors"
> 
>> +
>> +     +-----+     +-----++-----+     +-----+
>> + ... | CMDx| ... | CMDy||DESCm| ... |DESCn| ...
>> +     +-----+     +-----++-----+     +-----+
>> +
>> +COMPx contains 0 descriptor, COMPy contains 0 descriptor:
> 
> "0 descriptors"

OK, I'll fix this in the next series. Thanks!

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-06-01  9:02     ` zhenwei pi
@ 2023-06-01 11:33       ` Stefan Hajnoczi
  2023-06-01 13:09         ` zhenwei pi
  0 siblings, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-06-01 11:33 UTC (permalink / raw)
  To: zhenwei pi; +Cc: virtio-comment

[-- Attachment #1: Type: text/plain, Size: 3848 bytes --]

On Thu, Jun 01, 2023 at 05:02:45PM +0800, zhenwei pi wrote:
> 
> 
> On 6/1/23 00:20, Stefan Hajnoczi wrote:
> > On Thu, May 04, 2023 at 04:19:04PM +0800, zhenwei pi wrote:
> > > Keyed transmission is used for message oriented communication(Ex RDMA),
> > > also add virtio-blk read/write 8K example.
> > > 
> > > Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> > > ---
> > >   transport-fabrics.tex | 178 ++++++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 178 insertions(+)
> > > 
> > > diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> > > index c02cf26..7711321 100644
> > > --- a/transport-fabrics.tex
> > > +++ b/transport-fabrics.tex
> > > @@ -317,3 +317,181 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
> > >                       |......|
> > >                       +------+  -> 8193
> > >   \end{lstlisting}
> > > +
> > > +\paragraph{Keyed Transmission}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}
> > > +Command and Segment Descriptors are transmitted in a message within a
> > > +connection, and buffer is transmitted by remote memory access.  The layout in message:
> > 
> > With RDMA it is theoretically possible to implement virtqueues without
> > messages in the data path (i.e. by using something similar to vring with
> > RDMA). Why did you decide to use a mixed messages + RDMA approach
> > instead of a 100% RDMA approach?
> > 
> 
> Hi,
> 
> To reduce networking RTT. From my experience, a single RDMA message(event
> based) uses at least 6us.
> This approach has a chance to send a command(include data segments) by 1
> networking RTT, and receive a completion(include data segments) in 1
> networking RTT. I tried to design a 100% RDMA approach(mapping a vring to
> the remote side, the remote side accesses this vring by RDMA READ/WRITE),
> but I failed to find an idea to achieve.

The goal is to minimize the number of RDMA transfers. Each area of
memory should be located on the system that is polling constantly (busy
waiting) and the other side occassionally sends an RDMA WRITE request.

This idea requires bi-directional RDMA where both initiator and target
make memory accessible to the other side. Is this possible?

The target owns the Available Ring, a descriptor table similar to those
used by the Split and Packed Virtqueue layouts that is used by the
driver to submit virtqueue buffers to the device. The target sends a key
to the Available Ring to the initiator during virtqueue setup. The
initiator sends RDMA WRITEs that fill in virtqueue descriptors. Indirect
descriptors are supported, but the target will need to use RDMA READs to
load the indirect descriptor table, so there is overhead. Even regular
non-indirect descriptors have overhead because an RDMA READ is required
to read the payload. The best approach for small virtqueue elements is
to inline the payload in the Available Ring descriptor so no additional
RDMA transfers are needed (this achieves similar effect to your approach
of using messages + RDMA, but with pure RDMA). The target polls the
Available Ring to detect available buffers.

The initiator sends a key to the Used Ring to the target during
virtqueue setup. The target sends RDMA WRITEs that fill in used
elements. The initiator polls the Used Ring to detect used buffers.

I'm not sure if the Used Ring makes sense as RDMA memory. Maybe it's
better to send a message over the reliable connection instead so that
Used Buffer Notifications can support interrupts and not just polling.

This is a new virtqueue layout. It's only worthwhile implementing it if
the Available Ring RDMA performance is significantly better than the
current approach.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-06-01 11:33       ` Stefan Hajnoczi
@ 2023-06-01 13:09         ` zhenwei pi
  2023-06-01 19:13           ` Stefan Hajnoczi
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-01 13:09 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-comment



On 6/1/23 19:33, Stefan Hajnoczi wrote:
> On Thu, Jun 01, 2023 at 05:02:45PM +0800, zhenwei pi wrote:
>>
>>
>> On 6/1/23 00:20, Stefan Hajnoczi wrote:
>>> On Thu, May 04, 2023 at 04:19:04PM +0800, zhenwei pi wrote:
>>>> Keyed transmission is used for message oriented communication(Ex RDMA),
>>>> also add virtio-blk read/write 8K example.
>>>>
>>>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>>>> ---
>>>>    transport-fabrics.tex | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 178 insertions(+)
>>>>
>>>> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
>>>> index c02cf26..7711321 100644
>>>> --- a/transport-fabrics.tex
>>>> +++ b/transport-fabrics.tex
>>>> @@ -317,3 +317,181 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
>>>>                        |......|
>>>>                        +------+  -> 8193
>>>>    \end{lstlisting}
>>>> +
>>>> +\paragraph{Keyed Transmission}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}
>>>> +Command and Segment Descriptors are transmitted in a message within a
>>>> +connection, and buffer is transmitted by remote memory access.  The layout in message:
>>>
>>> With RDMA it is theoretically possible to implement virtqueues without
>>> messages in the data path (i.e. by using something similar to vring with
>>> RDMA). Why did you decide to use a mixed messages + RDMA approach
>>> instead of a 100% RDMA approach?
>>>
>>
>> Hi,
>>
>> To reduce networking RTT. From my experience, a single RDMA message(event
>> based) uses at least 6us.
>> This approach has a chance to send a command(include data segments) by 1
>> networking RTT, and receive a completion(include data segments) in 1
>> networking RTT. I tried to design a 100% RDMA approach(mapping a vring to
>> the remote side, the remote side accesses this vring by RDMA READ/WRITE),
>> but I failed to find an idea to achieve.
> 
> The goal is to minimize the number of RDMA transfers. Each area of
> memory should be located on the system that is polling constantly (busy
> waiting) and the other side occassionally sends an RDMA WRITE request.
> 
> This idea requires bi-directional RDMA where both initiator and target
> make memory accessible to the other side. Is this possible?
> 
> The target owns the Available Ring, a descriptor table similar to those
> used by the Split and Packed Virtqueue layouts that is used by the
> driver to submit virtqueue buffers to the device. The target sends a key
> to the Available Ring to the initiator during virtqueue setup. The
> initiator sends RDMA WRITEs that fill in virtqueue descriptors. Indirect
> descriptors are supported, but the target will need to use RDMA READs to
> load the indirect descriptor table, so there is overhead. Even regular
> non-indirect descriptors have overhead because an RDMA READ is required
> to read the payload. The best approach for small virtqueue elements is
> to inline the payload in the Available Ring descriptor so no additional
> RDMA transfers are needed (this achieves similar effect to your approach
> of using messages + RDMA, but with pure RDMA). The target polls the
> Available Ring to detect available buffers.
> 
> The initiator sends a key to the Used Ring to the target during
> virtqueue setup. The target sends RDMA WRITEs that fill in used
> elements. The initiator polls the Used Ring to detect used buffers.
> 
> I'm not sure if the Used Ring makes sense as RDMA memory. Maybe it's
> better to send a message over the reliable connection instead so that
> Used Buffer Notifications can support interrupts and not just polling.
> 

I guess RDMA WRITE WITH IMM would be fine for this approach.

> This is a new virtqueue layout. It's only worthwhile implementing it if
> the Available Ring RDMA performance is significantly better than the
> current approach.
> 
> Stefan

I agree with your approach to maintain the Vring. If I understand correctly:
an example of virtio-blk write 4k:
1, initiator write the 3 vring desc by RDMA WRITE WITH IMM(IMM Data to 
carry VQ control message), this uses 1 networking RTT.
2, target handles WRITE WITH IMM, reads remote memory from initiator of 
desc[0] and desc[1]. This uses 1 networking RTT. (I did not find the 2 
keys of desc[0] and desc[1] from your approach, but I think this can be 
implemented in step 1 by adding another memory)
3, target handles virtio-blk write request and writes the memory to 
initiator of desc[2] by RDMA WRITE WITH IMM.(IMM Data to carry control 
message). This uses 1 networking RTT.


So we use at lease 3 RTT by this approach. If unfortunately the u32 
imm_data is lack to carry enough control message, we may need more RTT.

Sorry, the previous "I failed to find an idea to achieve." means that I 
failed to find an idea to complete 1 single request in 2 RTT.

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-06-01 13:09         ` zhenwei pi
@ 2023-06-01 19:13           ` Stefan Hajnoczi
  2023-06-01 21:23             ` Stefan Hajnoczi
  0 siblings, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-06-01 19:13 UTC (permalink / raw)
  To: zhenwei pi; +Cc: virtio-comment

[-- Attachment #1: Type: text/plain, Size: 7242 bytes --]

On Thu, Jun 01, 2023 at 09:09:49PM +0800, zhenwei pi wrote:
> 
> 
> On 6/1/23 19:33, Stefan Hajnoczi wrote:
> > On Thu, Jun 01, 2023 at 05:02:45PM +0800, zhenwei pi wrote:
> > > 
> > > 
> > > On 6/1/23 00:20, Stefan Hajnoczi wrote:
> > > > On Thu, May 04, 2023 at 04:19:04PM +0800, zhenwei pi wrote:
> > > > > Keyed transmission is used for message oriented communication(Ex RDMA),
> > > > > also add virtio-blk read/write 8K example.
> > > > > 
> > > > > Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> > > > > ---
> > > > >    transport-fabrics.tex | 178 ++++++++++++++++++++++++++++++++++++++++++
> > > > >    1 file changed, 178 insertions(+)
> > > > > 
> > > > > diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> > > > > index c02cf26..7711321 100644
> > > > > --- a/transport-fabrics.tex
> > > > > +++ b/transport-fabrics.tex
> > > > > @@ -317,3 +317,181 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
> > > > >                        |......|
> > > > >                        +------+  -> 8193
> > > > >    \end{lstlisting}
> > > > > +
> > > > > +\paragraph{Keyed Transmission}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}
> > > > > +Command and Segment Descriptors are transmitted in a message within a
> > > > > +connection, and buffer is transmitted by remote memory access.  The layout in message:
> > > > 
> > > > With RDMA it is theoretically possible to implement virtqueues without
> > > > messages in the data path (i.e. by using something similar to vring with
> > > > RDMA). Why did you decide to use a mixed messages + RDMA approach
> > > > instead of a 100% RDMA approach?
> > > > 
> > > 
> > > Hi,
> > > 
> > > To reduce networking RTT. From my experience, a single RDMA message(event
> > > based) uses at least 6us.

What is the cost of 1 8KB RDMA WRITE vs 2 4KB RDMA WRITES?

I'm asking because if 6us is per RDMA transfer, then it's better to
avoid exposing scatter-gather lists (descriptors) to the other side and
instead provide contiguous memory and accept the cost of memcpy on the
receiving side.

On the other hand, if the cost is mostly determined by the amount of
data transferred, then it's better to expose scatter-gather lists so
data is received in the final memory location where it is consumed.

> > > This approach has a chance to send a command(include data segments) by 1
> > > networking RTT, and receive a completion(include data segments) in 1
> > > networking RTT. I tried to design a 100% RDMA approach(mapping a vring to
> > > the remote side, the remote side accesses this vring by RDMA READ/WRITE),
> > > but I failed to find an idea to achieve.
> > 
> > The goal is to minimize the number of RDMA transfers. Each area of
> > memory should be located on the system that is polling constantly (busy
> > waiting) and the other side occassionally sends an RDMA WRITE request.
> > 
> > This idea requires bi-directional RDMA where both initiator and target
> > make memory accessible to the other side. Is this possible?
> > 
> > The target owns the Available Ring, a descriptor table similar to those
> > used by the Split and Packed Virtqueue layouts that is used by the
> > driver to submit virtqueue buffers to the device. The target sends a key
> > to the Available Ring to the initiator during virtqueue setup. The
> > initiator sends RDMA WRITEs that fill in virtqueue descriptors. Indirect
> > descriptors are supported, but the target will need to use RDMA READs to
> > load the indirect descriptor table, so there is overhead. Even regular
> > non-indirect descriptors have overhead because an RDMA READ is required
> > to read the payload. The best approach for small virtqueue elements is
> > to inline the payload in the Available Ring descriptor so no additional
> > RDMA transfers are needed (this achieves similar effect to your approach
> > of using messages + RDMA, but with pure RDMA). The target polls the
> > Available Ring to detect available buffers.
> > 
> > The initiator sends a key to the Used Ring to the target during
> > virtqueue setup. The target sends RDMA WRITEs that fill in used
> > elements. The initiator polls the Used Ring to detect used buffers.
> > 
> > I'm not sure if the Used Ring makes sense as RDMA memory. Maybe it's
> > better to send a message over the reliable connection instead so that
> > Used Buffer Notifications can support interrupts and not just polling.
> > 
> 
> I guess RDMA WRITE WITH IMM would be fine for this approach.
> 
> > This is a new virtqueue layout. It's only worthwhile implementing it if
> > the Available Ring RDMA performance is significantly better than the
> > current approach.
> > 
> > Stefan
> 
> I agree with your approach to maintain the Vring. If I understand correctly:
> an example of virtio-blk write 4k:
> 1, initiator write the 3 vring desc by RDMA WRITE WITH IMM(IMM Data to carry
> VQ control message), this uses 1 networking RTT.
> 2, target handles WRITE WITH IMM, reads remote memory from initiator of
> desc[0] and desc[1]. This uses 1 networking RTT. (I did not find the 2 keys
> of desc[0] and desc[1] from your approach, but I think this can be
> implemented in step 1 by adding another memory)
> 3, target handles virtio-blk write request and writes the memory to
> initiator of desc[2] by RDMA WRITE WITH IMM.(IMM Data to carry control
> message). This uses 1 networking RTT.
> 
> 
> So we use at lease 3 RTT by this approach. If unfortunately the u32 imm_data
> is lack to carry enough control message, we may need more RTT.
> 
> Sorry, the previous "I failed to find an idea to achieve." means that I
> failed to find an idea to complete 1 single request in 2 RTT.

1 RDMA WRITE WITH IMM for the available buffer + 1 RDMA WRITE WITH IMM
for the used buffer is theoretically possible when all virtqueue
buffer elements are inlined. This way Step 2 can be eliminated.

In theory it's possible to supply multiple available buffers in 1 RDMA
WRITE WITH IMM and complete multiple used buffers in 1 RDMA WRITE WITH
IMM when the virtqueue access pattern allows batching. An optimal RDMA
virtqueue protocol has a 1 RDMA WRITE WITH IMM to N virtqueue buffer
relationship, not a 1:1 relationship.

One more idea to play with: VIRTIO has flexible message framing, so
devices must process a virtqueue buffer the same regardless of whether
it has 1 large element or many small elements. Therefore the virtqueue
RDMA protocol does not need to preserve the virtqueue element count and
sizes from the driver. For example, the target can offer a list of
key/length pairs that the initiator RDMA WRITES the virtqueue buffer
contents into. For a virtio-blk device that would be a struct
virtio_blk_outhdr followed by a large page-aligned buffer for the I/O
buffer data to be transferred. Then the device always a properly aligned
and contiguous buffer. Unfortunately this approach breaks down when the
virtqueue carries requests that are organized very differently, but it
might be useful when there is a most common request type.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-06-01 19:13           ` Stefan Hajnoczi
@ 2023-06-01 21:23             ` Stefan Hajnoczi
  2023-06-02  0:55               ` zhenwei pi
  0 siblings, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-06-01 21:23 UTC (permalink / raw)
  To: zhenwei pi; +Cc: virtio-comment

[-- Attachment #1: Type: text/plain, Size: 2101 bytes --]

On Thu, Jun 01, 2023 at 03:13:53PM -0400, Stefan Hajnoczi wrote:
> On Thu, Jun 01, 2023 at 09:09:49PM +0800, zhenwei pi wrote:
> > On 6/1/23 19:33, Stefan Hajnoczi wrote:
> > > On Thu, Jun 01, 2023 at 05:02:45PM +0800, zhenwei pi wrote:
> > > > On 6/1/23 00:20, Stefan Hajnoczi wrote:
> > > > > On Thu, May 04, 2023 at 04:19:04PM +0800, zhenwei pi wrote:
> One more idea to play with: VIRTIO has flexible message framing, so
> devices must process a virtqueue buffer the same regardless of whether
> it has 1 large element or many small elements. Therefore the virtqueue
> RDMA protocol does not need to preserve the virtqueue element count and
> sizes from the driver. For example, the target can offer a list of
> key/length pairs that the initiator RDMA WRITES the virtqueue buffer
> contents into. For a virtio-blk device that would be a struct
> virtio_blk_outhdr followed by a large page-aligned buffer for the I/O
> buffer data to be transferred. Then the device always a properly aligned
> and contiguous buffer. Unfortunately this approach breaks down when the
> virtqueue carries requests that are organized very differently, but it
> might be useful when there is a most common request type.

I'm not sure if I explained this well. What I'm trying to say is that I
think RDMA benefits when the receiver's memory constraints are visible
to the sender. The sender performs RDMA WRITEs to the locations where
the receiver can efficiently process the data.

This protocol proposal doesn't really take advantage of this approach
because it communicates the virtqueue buffer elements from the initiator
(the sender) to the target (the receiver). That's the wrong way around.

I have never used RDMA myself, so this might be wrong, but as long as
the RDMA API allows the sender to specify a scatter-gather list as
input, then I think the details of the virtqueue buffer elements that
don't have the WRITE flag should never be communicated over the network.
Instead the initiator should RDMA WRITE from the VIRTIO driver's
scatter-gather list to the target's preferred destination instead.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: Re: [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-06-01 21:23             ` Stefan Hajnoczi
@ 2023-06-02  0:55               ` zhenwei pi
  2023-06-05 17:21                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-02  0:55 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-comment



On 6/2/23 05:23, Stefan Hajnoczi wrote:
> On Thu, Jun 01, 2023 at 03:13:53PM -0400, Stefan Hajnoczi wrote:
>> On Thu, Jun 01, 2023 at 09:09:49PM +0800, zhenwei pi wrote:
>>> On 6/1/23 19:33, Stefan Hajnoczi wrote:
>>>> On Thu, Jun 01, 2023 at 05:02:45PM +0800, zhenwei pi wrote:
>>>>> On 6/1/23 00:20, Stefan Hajnoczi wrote:
>>>>>> On Thu, May 04, 2023 at 04:19:04PM +0800, zhenwei pi wrote:
>> One more idea to play with: VIRTIO has flexible message framing, so
>> devices must process a virtqueue buffer the same regardless of whether
>> it has 1 large element or many small elements. Therefore the virtqueue
>> RDMA protocol does not need to preserve the virtqueue element count and
>> sizes from the driver. For example, the target can offer a list of
>> key/length pairs that the initiator RDMA WRITES the virtqueue buffer
>> contents into. For a virtio-blk device that would be a struct
>> virtio_blk_outhdr followed by a large page-aligned buffer for the I/O
>> buffer data to be transferred. Then the device always a properly aligned
>> and contiguous buffer. Unfortunately this approach breaks down when the
>> virtqueue carries requests that are organized very differently, but it
>> might be useful when there is a most common request type.
> 
> I'm not sure if I explained this well. What I'm trying to say is that I
> think RDMA benefits when the receiver's memory constraints are visible
> to the sender. The sender performs RDMA WRITEs to the locations where
> the receiver can efficiently process the data.
> 
> This protocol proposal doesn't really take advantage of this approach
> because it communicates the virtqueue buffer elements from the initiator
> (the sender) to the target (the receiver). That's the wrong way around.
> 
> I have never used RDMA myself, so this might be wrong, but as long as
> the RDMA API allows the sender to specify a scatter-gather list as
> input, then I think the details of the virtqueue buffer elements that
> don't have the WRITE flag should never be communicated over the network.
> Instead the initiator should RDMA WRITE from the VIRTIO driver's
> scatter-gather list to the target's preferred destination instead.
> 
> Stefan

Hi,

I guess I followed your point. "the target can offer a list of 
key/length pairs that the initiator RDMA WRITES the virtqueue buffer 
contents into" seems not good to me, I'd prefer to expose RDMA memory 
region of initiator side only(target side uses RDMA READ/WRITE to 
operate the memory of initiator, this means target side has no need to 
allocate/pin memory buffer).

 From the point of my view, this protocol needs to be effective and 
maintainable, mapping vring mechanism with RDMA WRITE from 2 
directions(initiator to target, and target to initiator) leads high 
complexity ...

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview
  2023-05-31 14:00   ` [virtio-comment] " Stefan Hajnoczi
@ 2023-06-02  1:17     ` zhenwei pi
  0 siblings, 0 replies; 74+ messages in thread
From: zhenwei pi @ 2023-06-02  1:17 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong



On 5/31/23 22:00, Stefan Hajnoczi wrote:
> On Thu, May 04, 2023 at 04:19:00PM +0800, zhenwei pi wrote:
>> In the past years, virtio supports lots of device specifications by
>> PCI/MMIO/CCW. These devices work fine in the virtualization environment.
>>
>> Introduce Virtio Over Fabrics transport to support "network defined
>> peripheral devices". With this transport, Many Virtio based devices
>> transparently work over fabrics. Note that the balloon device may not
>> make sense. Shared memory regions won't work.
>>
>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>> ---
>>   content.tex           |  1 +
>>   transport-fabrics.tex | 31 +++++++++++++++++++++++++++++++
>>   2 files changed, 32 insertions(+)
>>   create mode 100644 transport-fabrics.tex
>>
>> diff --git a/content.tex b/content.tex
>> index cff548a..f899c3a 100644
>> --- a/content.tex
>> +++ b/content.tex
>> @@ -582,6 +582,7 @@ \chapter{Virtio Transport Options}\label{sec:Virtio Transport Options}
>>   \input{transport-pci.tex}
>>   \input{transport-mmio.tex}
>>   \input{transport-ccw.tex}
>> +\input{transport-fabrics.tex}
>>   
>>   \chapter{Device Types}\label{sec:Device Types}
>>   
>> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
>> new file mode 100644
>> index 0000000..0dc031b
>> --- /dev/null
>> +++ b/transport-fabrics.tex
>> @@ -0,0 +1,31 @@
>> +\section{Virtio Over Fabrics}\label{sec:Virtio Transport Options / Virtio Over Fabrics}
>> +
>> +This section defines specification to Virtio that enables operation over other
>> +interconnects. A central goal of Virtio Over Fabrics is to maintain consistency
>> +with the PCI device, so Virtio based devices transparently work over PCI or
>> +fabrics.
> 
> The reader wants to know what VIRTIO Over Fabrics is, not how it relates
> to other Transports that they may not be very familiar with.
> 
> Fabrics is a Transport and any Transport is capable of supporting the
> VIRTIO device model. Therefore I don't think the stated aim should be to
> match PCI specifically. Just being a Transport is already enough. PCI is
> not special.
> 
> I suggest something like:
> 
>    Virtio Over Fabrics enables operation over interconnects that rely
>    primarily on message passing. Supported interconnects include TODO.
> 
>> +
>> +Virtio Over Fabrics uses reliable connection to transmit data, the reliable
> 
> "uses a reliable connection"
> 
>> +connection betweens two rules:
> 
> "connection facilitates communication between entities playing the following roles:"
> 
>> +
>> +\begin{itemize}
>> +\item An initiator functions as an Virtio Over Fabrics client. An initiator
> 
> "as a Virtio ..."
> 
>> +typically serves the same purpose to a machine as a Virtio device, issues
>> +commands to remote side.
> 
> This says that the driver talks to the initiator instead of a local
> device and the initiator forwards commands to the actual device on the
> remote side?
> 
> I find this sentence confusing because I associate the initiator with
> the driver, not the device.
> 
> Maybe:
> 
>    The initiator sends commands from the driver to the target.
> 
>> +\item A target functions as an Virtio Over Fabrics server. An target typically
> 
> "A target"
> 
>> +handles commands from the initiator side and responses completions.
> 
> The concept of the device is missing here. For symmetry it may be good
> to say something like:
> 
>    The target forwards commands to the device and sends responses back to
>    the initiator.
> 
>> +\end{itemize}
>> +
>> +Virtio Over Fabrics has the following differences from the PCI based
>> +specification:
>> +
>> +\begin{itemize}
>> +\item Instead of memory sharing mechanism of virtqueue, there is a one-to-one
>> +mapping between virtqueue and the reliable connection which executes the vring
>> +data transmission.
>> +\item An additional control connection is required to execute control commands
>> +which is similar to read/write register on a PCI device.
>> +\item Virtio Over Fabrics does not define an interrupt mechanism that allows an
>> +initiator to generate a host interrupt. It is the responsibility of the host
>> +fabric interface to generate host interrupts.
>> +\end{itemize}
> 
> As mentioned above, comparing against PCI requires that the reader is
> familiar with PCI. I think it would be preferrable to explain the unique
> characteristics of Virtio Over Fabrics in a self-contained way:
> 
>    The basic organization of Virtio Over Fabrics is as follows:
> 
>    \begin{itemize}
>    \item A reliable connection carries control commands that are not specific to a virtqueue.
>    \item Each virtqueue has its own reliable connection.
>    \item There is no interrupt mechanism since the arrival of data on the fabric already indicates when there is activity.
>    \end{itemize}
> 
> Stefan

I'll drop comparing against PCI part, and fix other parts in the next 
version. Thanks!

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: [virtio-comment] [PATCH v2 02/11] transport-fabrics: introduce Virtio Qualified Name
  2023-05-31 14:06   ` Stefan Hajnoczi
@ 2023-06-02  1:50     ` zhenwei pi
  2023-06-05  2:40       ` Parav Pandit
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-02  1:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong



On 5/31/23 22:06, Stefan Hajnoczi wrote:
> On Thu, May 04, 2023 at 04:19:01PM +0800, zhenwei pi wrote:
>> Add VQN section. The VQN is a little different from iSCSI/NVMe-oF on
>> style limitation. Because iSCSI/NVMe-of is storage specific protocol,
>> the full string IQN(for iSCSI/iSER) and NQN(for NVMe-oF) represents
>> a "storage access address". However, Virtio Over Fabrics works as
>> transport layer rather than device layer, a URL style string is better
>> to Virtio Over Fabrics. For example:
>> virtio-of://blk-resource/nvme-pool/849a39ad-8d7b-4a7a-adb6-e7407ace532c
>> virtio-of://blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
>> ...
>> virtio-of://crypto-resource/25307f22-e5a8-4ea2-b7ca-79f5c3bebc3c
> 
> I'm not sure what blk-resource and nvme-pool are in these URLs?
> 
> Should the patch mention the virtio-of:// URI scheme?
> 

Sorry, I missed the address and port. They should be:
virtio-rdma://192.168.1.100:8549/blk-resource/nvme-pool/849a39ad-8d7b-4a7a-adb6-e7407ace532c
virtio-tcp://192.168.1.110/blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
...

This is human readable string. when the software(or hardware) handles 
this, this should be translated into:
transport: RDMA
address: 192.168.1.100
port: 8549 (default port 8549(CRC-16/ARC of "Virtio"))
target VQN: blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1

This section only defines the "VQN" schema, not the resource string schema.

For a process, I think the following two are both fine:
./foo --full-url 
virtio-rdma://192.168.1.100:8549/blk-resource/nvme-pool/849a39ad-8d7b-4a7a-adb6-e7407ace532c
./foo --transport rdma --address 192.168.1.100 --port 8549 --tvqn
blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1

[snip]

> 
> Is the maximum name 255 UTF-8 bytes plus a NUL character? Please state
> this in the spec. For example:
> 
>    \item The string is NUL terminated.
>    \item The maximum name is 256 bytes in length, including the NUL character.
> 
OK, fix this in the next version.

>> +\item There is no strict style limitation.
> 
> I think it's necessary to define representations for specific fabrics
> (e.g. TCP/IP) so that VQNs can be exchanged between different VIRTIO
> implementations (VMMs, DPUs, command-line utilities, etc). Otherwise two
> different implementations may represent the same address differently.
> 
> Stefan

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: [virtio-comment] [PATCH v2 04/11] transport-fabrics: introduce Stream Transmission
  2023-05-31 15:20   ` Stefan Hajnoczi
@ 2023-06-02  2:26     ` zhenwei pi
  2023-06-05 16:11       ` Stefan Hajnoczi
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-02  2:26 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong



On 5/31/23 23:20, Stefan Hajnoczi wrote:
> On Thu, May 04, 2023 at 04:19:03PM +0800, zhenwei pi wrote:
>> Stream transmission is used for stream oriented communication(Ex TCP),
>> also add virtio-blk read/write 8K example.
>>
>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>> ---
>>   transport-fabrics.tex | 229 ++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 229 insertions(+)
>>
>> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
>> index b88acfd..c02cf26 100644
>> --- a/transport-fabrics.tex
>> +++ b/transport-fabrics.tex
>> @@ -88,3 +88,232 @@ \subsubsection{Segment Descriptor Definition}\label{sec:Virtio Transport Options
>>   \end{tabular}
>>   
>>   Depending on the opcode, a Command contains zero or more structure virtio_of_vring_desc.
>> +
>> +\subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Buffer Mapping Definition}
>> +Virtio Over Fabrics defines two types of buffer mapping rules.
> 
> What is a buffer? Is it a virtqueue buffer (consisting of one or more
> descriptors/elements) or are you using the term for a different concept?
> 

I'll use 'descriptor' to describe this only in the next version.

>> +
>> +\paragraph{Stream Transmission}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}
>> +Command, Segment Descriptors, and buffer are transmitted in a stream within a
> 
> Is a Segment Descriptor a virtio_of_vring_desc?
> 
>> +connection. The layout in stream:
>> +
>> +\begin{lstlisting}
>> +CMDx contains 0 descriptor, CMDy contains (n - m + 1) descriptors and buffer:
> 
> "0 descriptors"
> 

OK.
>> +
>> +     +-----+     +-----++-----+     +-----++-----+
>> + ... | CMDx| ... | CMDy||DESCm| ... |DESCn|| BUF | ...
>> +     +-----+     +-----++-----+     +-----++-----+
>> +
>> +COMPx contains 0 descriptor, COMPy contains (k - j + 1) descriptors and buffer:
> 
> I think this is the first time the concept of a completion (COMP) was
> introduced. Please describe commands/completions before using them in
> the text.
> 

OK.
>> +
>> +     +-----+     +-----++-----+     +-----++-----+
>> + ... |COMPx| ... |COMPy||DESCj| ... |DESCk|| BUF | ...
>> +     +-----+     +-----++-----+     +-----++-----+
>> +\end{lstlisting}
>> +
>> +An example of a virtio-blk write 8K request(total size: sizeof(Command) +
>> +4 * sizeof(Descriptor) + 8208):
>> +\begin{lstlisting}
>> + COMMAND            +------+
>> +                    |opcode|  ->  virtio_of_op_vring
>> +                    +------+
>> +                    |cmd id|  ->  10
>> +                    +------+
>> +                    |length|  ->  8208
>> +                    +------+
>> +                    |ndesc |  ->  4
>> +                    +------+
>> +                    |rsvd  |
>> +                    +------+
>> +
>> + DESC0              +------+
>> +              +-----|addr  |  -> 0
>> +              |     +------+
>> +              |     |length|  -> 16 (virtio blk write command)
>> +              |     +------+
>> +              |     |id    |  -> 0
>> +              |     +------+
>> +              |     |flags |  -> 0
>> +              |     +------+
>> +              |
>> + DESC1        |     +------+
>> +              | +---|addr  |  -> 16
>> +              | |   +------+
>> +              | |   |length|  -> 4096
>> +              | |   +------+
>> +              | |   |id    |  -> 1
>> +              | |   +------+
>> +              | |   |flags |  -> 0
>> +              | |   +------+
>> +              | |
>> + DESC2        | |   +------+
>> +              | | +-|addr  |  -> 4112
>> +              | | | +------+
>> +              | | | |length|  -> 4096
>> +              | | | +------+
>> +              | | | |id    |  -> 2
>> +              | | | +------+
>> +              | | | |flags |  -> 0
>> +              | | | +------+
>> +              | | |
>> + DESC3        | | | +------+
>> +              | | | |addr  |  -> 0
> 
> Is this field 0 in all stream connection VIRTIO_OF_DESC_F_WRITE
> descriptors?
> 

Yes. I missed the comment of 'addr' field in the '[PATCH v2 03/11] 
transport-fabircs: introduce Segment Descriptor Definition.

When the flags has VIRTIO_OF_DESC_F_KEYED, the 'addr' means the remote 
address. otherwise the 'addr' means the offset in the stream buffer. 
Because VIRTIO_OF_DESC_F_WRITE is a read descriptor, there is no payload 
in the command, the 'addr' of a read descriptor always 0.

>> +              | | | +------+
>> +              | | | |length|  -> 1
>> +              | | | +------+
>> +              | | | |id    |  -> 3
>> +              | | | +------+
>> +              | | | |flags |  -> VIRTIO_OF_DESC_F_WRITE
>> +              | | | +------+
>> +              | | |
>> + DATA         +-+-+>+------+  -> 0
>> +                | | |......|
>> +                +-+>+------+  -> 16
>> +                  | |......|
>> +                  +>+------+  -> 4112
>> +                    |......|
>> +                    +------+  -> 8208
>> +\end{lstlisting}
>> +
>> +The Completion of this request(total size: sizeof(Completion) +
>> +1 * sizeof(Descriptor) + 1):
>> +\begin{lstlisting}
>> + COMPLETION         +------+
>> +                    |status|  ->  VIRTIO_OF_SUCCESS
>> +                    +------+
>> +                    |cmd id|  ->  10
>> +                    +------+
>> +                    |ndesc |  ->  1
>> +                    +------+
>> +                    |rsvd  |
>> +                    +------+
>> +                    |value |  -> 1 (value.u32)
> 
> What is this field and what does u32 mean?
> 

Same to the virtq_used_elem::len 
(https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-540008).
I need reorder the patch, move completion definition before using.

>> +                    +------+
>> +
>> + DESC0              +------+
>> +                  +-|addr  |  -> 0
>> +                  | +------+
>> +                  | |length|  -> 1
>> +                  | +------+
>> +                  | |id    |  -> 3
> 
> This has to match with the original descriptor id sent with the Command?
> 

Yes.

>> +                  | +------+
>> +                  | |flags |  -> VIRTIO_OF_DESC_F_WRITE
>> +                  | +------+
>> +                  |
>> + DATA             |>+------+  -> 0
>> +                    |......|
>> +                    +------+  -> 1
>> +\end{lstlisting}
> 
> I think this is more flexible (and has more protocol overhead) than
> necessary. When the device has used a virtqueue buffer, it indicates how
> many bytes were used (this can be less than the totaly number of F_WRITE
> bytes available). I don't think there is a need to communicate F_WRITE
> descriptors, especially in the Completion. Just a Completion with a
> 'length' field instead of an 'ndesc' field followed by data is enough.
> 

I guest this is not enough. For example, a initiator want to read 3 
desc: desc0[n bytes], desc1[m bytes], desc2[1 byte]. desc[2] is expected 
to read a u8 status.

the target fills desc0[n - x bytes], desc1[m - y bytes], desc2[1 byte], 
the 'length' is (n - x + m - y + 1), we should decode each descriptor 
and fill the driver buffer correctly.(otherwise, if x + y > 0, desc[2] 
is never filled)

> Since VIRTIO has flexible framing
> (https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-390004),
> there isn't really a need to communicate the F_WRITE descriptors at all,
> just the maximum number of used bytes that the initiator expects.
> 
> Can you explain why you chose to transmit F_WRITE descriptors in both
> the Command and the Completion? Maybe I missed a reason why it's
> important.

Just keep the flags same to the descriptor from the command, give the 
initiator a hint 'this is a read descriptor'.

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: [virtio-comment] [PATCH v2 03/11] transport-fabircs: introduce Segment Descriptor Definition
  2023-05-31 14:23   ` Stefan Hajnoczi
@ 2023-06-02  3:08     ` zhenwei pi
  0 siblings, 0 replies; 74+ messages in thread
From: zhenwei pi @ 2023-06-02  3:08 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong



On 5/31/23 22:23, Stefan Hajnoczi wrote:
> On Thu, May 04, 2023 at 04:19:02PM +0800, zhenwei pi wrote:
>> Introduce segment descriptor to describe the Virtio device buffer
>> segments.
>>
>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>> ---
>>   transport-fabrics.tex | 43 +++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 43 insertions(+)
>>
>> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
>> index 26b0192..b88acfd 100644
>> --- a/transport-fabrics.tex
>> +++ b/transport-fabrics.tex
>> @@ -45,3 +45,46 @@ \subsection{Virtio Qualified Name}\label{sec:Virtio Transport Options / Virtio O
>>   \item The string is null terminated.
>>   \item There is no strict style limitation.
>>   \end{itemize}
>> +
>> +\subsection{Transmission Protocol}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol}
>> +This section defines transmission protocol for Virtio Over Fabrics. All the
> 
> What does "transmission protocol" mean? I guess this is what is often
> called a network protocol or a wire protocol or just a protocol, but it
> wasn't clear to me maybe whether the "transmission protocol" is one
> protocol out of a set of protocols that make up Virtio Over Fabrics.
> 
> This paragraph should describe which connections use this protocol. For
> example:
> 
>    This protocol is used for both control and virtqueue connections.
> 
>> +fields use little endian format.
>> +
>> +\subsubsection{Segment Descriptor Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Segment Descriptor Definition}
>> +Virtio Over Fabrics uses the following structure to describe data segment:
> 
> What is a data segment? I guess it's a message/command/request?
> 
> There should be an explanation of how data segments are used. For
> example:
> 
>    The initiator sends a data segment containing the command to the
>    target. The target sends a data segment containing the response to the
>    command back to the initiator.
> 
>> +
>> +\begin{lstlisting}
>> +struct virtio_of_vring_desc {
> 
> I think the name "vring" should be avoided. The vring is an in-memory
> layout for implementing virtqueues where shared memory is available.
> Calling it virtio_of_vq_desc makes it clear that Virtio Over Fabrics
> does not use vrings to implement virtqueues.
> 
>> +        le64 addr;
>> +        le32 length;
>> +        /* This marks the unique ID within a command, no limitation among inflight commands */
> 
> What is a command?
> 
>> +        le16 id;
>> +        /* This marks a buffer as keyed transmission (otherwise stream transmission) */
>> +#define VIRTIO_OF_DESC_F_KEYED     1
>> +        /* This marks a buffer as device write-only (otherwise device read-only). */
>> +#define VIRTIO_OF_DESC_F_WRITE     2
>> +        le16 flags;
>> +        le32 key;
>> +};
>> +\end{lstlisting}
>> +
>> +The structure virtio_of_vring_desc is used for both keyed transmission
>> +(i.e. RDMA) and stream transmission(i.e. TCP). The fields is described as follows:
>> +
>> +\begin{tabular}{ |l|l|l| }
>> +\hline
>> +Field & keyed transmission & stream transmission \\
>> +\hline \hline
>> +addr & Start address of remote memory buffer & Start address within the stream buffer \\
> 
> What is a stream buffer?
> 
>> +\hline
>> +length & The length of remote memory buffer & The length of buffer within the stream \\
> 
> I'm not sure what buffer means here. I guess it's not the same as a
> virtqueue buffer, it's probably a virtqueue descriptor (element)?
> 
> Can you avoid using buffer here since it usually means something else in
> Virtio?
> 

OK.

>> +\hline
>> +id & The ID of this descriptor & The ID of this descriptor \\
>> +\hline
>> +flags & both keyed transmission and stream transmission supported & stream transmission only \\
> 
> I'm not sure what this means.
> 
>> +\hline
>> +key & Key of the remote Memory Region & Ignore \\
> 
> Should "Ignore" be "Reserved" so that stream transmission can use this
> field for something else in the future?
> 

OK

>> +\hline
>> +\end{tabular}
>> +
>> +Depending on the opcode, a Command contains zero or more structure virtio_of_vring_desc.
> 
> opcode hasn't been defined yet. I guess that's because the first
> virtio_of_vring_desc contains a Command and that has an opcode field?
> Please make sure the text is ordered so that terms are defined before
> they are used.
> 
> Stefan

OK, I think reordering the text is needed in the next version.

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-05-31 17:10   ` [virtio-comment] " Stefan Hajnoczi
@ 2023-06-02  5:15     ` zhenwei pi
  2023-06-05 16:30       ` Stefan Hajnoczi
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-02  5:15 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong



On 6/1/23 01:10, Stefan Hajnoczi wrote:
> On Thu, May 04, 2023 at 04:19:05PM +0800, zhenwei pi wrote:
>> Introduce command structures for Virtio-oF.
>>
>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>> ---
>>   transport-fabrics.tex | 209 ++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 209 insertions(+)
>>
>> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
>> index 7711321..37f57c6 100644
>> --- a/transport-fabrics.tex
>> +++ b/transport-fabrics.tex
>> @@ -495,3 +495,212 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
>>                       |value |  -> 8193 (value.u32)
>>                       +------+
>>   \end{lstlisting}
>> +
>> +\subsubsection{Commands Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition}
>> +This section defines command structures for Virtio Over Fabrics.
>> +
>> +A common structure virtio_of_value is fixed to 8 bytes and MUST be used as one
>> +of the following format:
>> +
>> +\begin{itemize}
>> +\item u8
>> +\item le16
>> +\item le32
>> +\item le64
>> +\end{itemize}
> 
> The way it's written does not document where the u8, u16, u32 bytes are
> located and that the unused bytes are 0. I think I understand what you
> mean though:
> 
>    le64 value = cpu_to_le64((u64)v); /* v is u8, u16, u32, or u64 */
> 
> Please clarify.
> 

I want to describe an union structure of 8 bytes:
union virtio_of_value {
     u8;
     u16;
     u32;
     u64;
};

Depending on the opcode, use the right one.

>> +
>> +\paragraph{Command ID}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Command ID}
>> +There is command_id(le16) field in each Command and Completion:
> 
> "is a command_id"
> 

OK.

>> +
>> +\begin{itemize}
>> +\item Generally the initiator allocates a Command ID and specifies the
> 
> "allocates a Command ID that is unique for all in-flight commands"?
> 

Yes. Will add.

>> +command_id field of a Command, and the target MUST specify the same Command ID
> 
> The "MUST" statement needs to be in a driver-normative section. You can
> keep the sentence in this non-normative section by tweaking it:
> "target specifies"
> 
> The idea is that all MUST/SHOULD/etc statements are in a separate
> device/driver-normative section so that they can be easily reviewed by
> device/driver implementers without re-reading the entire text.
> 

OK.

>> +in command_id field of Completion.
>> +\item The initiator MUST guarantee each Command ID is unique in the inflight Commands.
> 
> Same here about "MUST".
> 
>> +\item Command ID 0xff00 - 0xffff is reserved for control queue to delivery asynchronous event.
> 
> "for control queue asynchronous events"
> 

OK.

>> +\end{itemize}
>> +
>> +The reserved Command ID for control queue is defined as follows:
> 
> "The reserved Command IDs for the control queue are as follows:"
> 
>> +
>> +\begin{tabular}{ |l|l| }
>> +\hline
>> +Command ID & Description \\
>> +\hline \hline
>> +0xffff & Keepalive. The initiator SHOULD ignore this event \\
> 
> "Ignored by the initiator." + move the SHOULD statement to a
> driver-normative section.
> 
>> +\hline
>> +0xfffe & Config change. The initiator SHOULD generate config change interrupt to device \\
> 
> "Causes the initiator to generate a configuration change notification."
> 
>> +\hline
>> +0xff00 - 0xfffd & Reserved \\
>> +\hline
>> +\end{tabular}
>> +
>> +\paragraph{Connect Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Connect Command}
>> +The Connect Command is used to establish Virtio Over Fabrics queue. The control
>> +queue MUST be established firstly, then the Connect command establishes an
>> +association between the initiator and the target.
> 
> Is a "Virtio Over Fabrics queue" different from a virtqueue?
> 
> If I understand correctly, the control queue must be established by the
> initiator first and then the Connect command is sent to begin
> communication between the initiator and the target?
> 

The queue mapping is missing in the '[PATCH v2 01/11] transport-fabrics: 
introduce Virtio Over Fabrics overview', like:
A "Virtio Over Fabrics queue" is a reliable connection between initiator 
and target. There are 2 types of Virtio Over Fabrics queue:
+\begin{itemize}
+\item A single Control queue is required to execute control operations.
+\item 0 or more Virtio Over Fabrics queues map the virtqueues.
+\end{itemize}

>> +
>> +The Target ID of 0xffff is reserved, then:
> 
> Please move this after the fields have been shown and the purpose of the
> Target ID field has been explained.
> 
>> +\begin{itemize}
>> +\item The Target ID of 0xffff MUST be specified as the Target ID in a Connect
>> +Command for the control queue.
>> +\item The target SHOULD allocate any available Target ID to the initiator,
>> +and return the allocated Target ID in the Completion.
>> +\item The returned Target ID MUST be specified as the Target ID, and the Queue ID
>> +MUST be specified in a Connect Command for the virtqueue.
>> +\end{itemize}
> 
> What is the purpose of the Target ID? Is it to allow a server to provide
> access to multiple targets over the same connection?
> 

A target listens on a port, and provides access to 0 or more targets. An 
initiator connect the specific target by TVQN of connect command.
An initiator could connect a single target, multiple initiators could 
connect the same target(typically, shared disk/fs).

>> +
>> +The Connect Command has following structure:
>> +
>> +\begin{lstlisting}
>> +struct virtio_of_command_connect {
>> +        le16 opcode;
>> +        le16 command_id;
>> +        le16 target_id;
>> +        le16 queue_id;
>> +        le16 ndesc;
> 
> Where is this field documented?
> 

OK. Will add.

> Why does the initiator send ndesc to the target? Normally a VIRTIO Transport reports the device's max descriptors and then the driver can tell the device to reduce the number of descriptors, if desired.
> 

A target supports at lease 1 descriptor. The 'ndesc' of struct 
virtio_of_command_connect indicates the full PDU contains: struct 
virtio_of_command_connect + 1 * virtio_of_vq_desc + data.

>> +#define VIRTIO_OF_CONNECTION_TCP     1
>> +#define VIRTIO_OF_CONNECTION_RDMA    2
> 
> What does RDMA mean? I thought RDMA is a general concept that several
> fabrics implement (with different details like how addressing works).
> 

I guest your concern is the difference of IB/RoCE/iWarp ...
We are trying to define the payload protocol here, so I think we can 
ignore the difference of the HCA.

>> +        u8 oftype;
>> +        u8 padding[5];
>> +};
>> +\end{lstlisting}
>> +
>> +The Connect commands MUST contains one Segment Descriptor and one structure
>> +virtio_of_command_connect to specify Initiator VQN and Target VNQ,
>> +virtio_of_command_connect has following structure:
> 
> I'm confsued. virtio_of_command_connect was defined above. The struct
> defined below is virtio_of_connect. Does this paragraph need to be
> updated (virtio_of_command_connect -> virtio_of_connect)?
> 
> Why is virtio_of_connect a separate struct and not part of
> virtio_of_command_connect?
> 

Because I'd like to define all the commands with a fixed length.

>> +
>> +\begin{lstlisting}
>> +struct virtio_of_connect {
>> +        u8 ivqn[256];
>> +        u8 tvqn[256];
> 
> If the initiator is already sends tvqn, why also have target_id?
> 
>> +        u8 padding[512];
>> +};
>> +\end{lstlisting}
>> +
>> +\paragraph{Feature Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Feature Command}
>> +
>> +The control queue uses Feature Command to get or set features. This command is used for:
>> +
>> +\begin{itemize}
>> +\item The initiator/target features. This is used to negotiate transport layer features.
>> +\item The driver/device features. This is used to negotiate Virtio Based device
>> +features which is similar to PCI based device.
> 
> Please do not make references to the PCI Transport.
> 

OK.

>> +\end{itemize}
>> +
>> +The Feature Command has following structure:
>> +
>> +\begin{lstlisting}
>> +struct virtio_of_command_feature {
>> +        le16 opcode;
>> +        le16 command_id;
>> +        le32 feature_select;
>> +        le64 value;        /* ignore this field on GET */
>> +};
>> +\end{lstlisting}
> 
> I guess the opcode tells the target whether this is a VIRTIO Features
> Get, VIRTIO Features Set, VIRTIO-Over-Fabrics Features Get, or
> VIRTIO-Over-Fabrics Features Set command? Please document the opcodes
> here and also include a full opcode table somewhere else.
> 
>> +
>> +\paragraph{Queue Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Queue Command}
>> +
>> +The control queue uses Queue Command to get or set properties on a specific queue.
>> +The Queue Command has following structure:
>> +
>> +\begin{lstlisting}
>> +struct virtio_of_command_queue {
>> +        le16 opcode;
>> +        le16 command_id;
>> +        le16 queue_id;
> 
> Does "queue" mean virtqueue here? Or does it also apply to the control
> queue? If it's a virtqueue, please call this vq_id.
> 
>> +        u8 padding6;
>> +        u8 padding7;
>> +        struct virtio_of_value value;   /* ignore this field on GET */
>> +};
>> +\end{lstlisting}
> 
> The opcode and their semantics are not documented.
> 
>> +\paragraph{Config Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Config Command}
>> +
>> +The control queue uses Config Command to get or set configure on device.
>> +The Config Command has following structure:
> 
> I suggest choosing a different name to avoid confusion with the
> VIRTIO Configuration Space.
> 
>> +
>> +\begin{lstlisting}
>> +struct virtio_of_command_config {
>> +        le16 opcode;
>> +        le16 command_id;
>> +        le16 offset;
>> +        u8 bytes;
>> +        u8 padding7;
>> +        struct virtio_of_value value;        /* ignore this field on GET */
>> +};
>> +\end{lstlisting}
>> +
>> +The bytes field supports on Get only:
>> +
>> +\begin{itemize}
>> +\item 1, then the initiator reads from value field of Completion as u8
>> +\item 2, then the initiator reads from value field of Completion as le16
>> +\item 4, then the initiator reads from value field of Completion as le32
>> +\item 8, then the initiator reads from value field of Completion as le64
>> +\end{itemize}
>> +
>> +The bytes field supports on Set only:
>> +
>> +\begin{itemize}
>> +\item 1, then the initiator specifies the value field of Config Command as u8
>> +\item 2, then the initiator specifies the value field of Config Command as le16
>> +\item 4, then the initiator specifies the value field of Config Command as le32
>> +\item 8, then the initiator specifies the value field of Config Command as le64
>> +\end{itemize}
> 
> I have no idea what virtio_of_command_config does because the opcodes
> aren't documented.
> 
>> +
>> +\paragraph{Common Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command}
>> +
>> +The control queue uses Common Command to get or set common properties on
>> +device(i.e. get device ID). The Common Command has following structure:
>> +
>> +\begin{lstlisting}
>> +struct virtio_of_command_common {
>> +        le16 opcode;
>> +        le16 command_id;
>> +        u8 padding4;
>> +        u8 padding5;
>> +        u8 padding6;
>> +        u8 padding7;
>> +        struct virtio_of_value value;        /* ignore this field on GET */
>> +};
>> +\end{lstlisting}
>> +
>> +
>> +\paragraph{Vring Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Vring Command}
>> +
>> +Both control queue and virtqueue use Vring Command to transmit buffer.
>> +The Vring Command has following structure:
>> +
>> +\begin{lstlisting}
>> +struct virtio_of_command_vring {
>> +        le16 opcode;
>> +        le16 command_id;
>> +        /* Total buffer size this command contains(not include command&descriptors). */
>> +        le32 length;
>> +        /* How many descriptors this command contains */
>> +        le16 ndesc;
>> +        u8 padding[6];
>> +};
>> +\end{lstlisting}
>> +
>> +\paragraph{Completion}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Completion}
>> +
>> +The target responses Completion to the initiator to report command status,
>> +device properties, and transmit buffer. The Completion has following structure:
>> +
>> +\begin{lstlisting}
>> +struct virtio_of_completion {
>> +        le16 status;
>> +        le16 command_id;
>> +        /* How many descriptors this completion contains */
>> +        le16 ndesc;
>> +        u8 rsvd6;
>> +        u8 rsvd7;
>> +        struct virtio_of_value value;
>> +};
>> +\end{lstlisting}
>> +
>> +Note that Virtio Over Fabrics does not define an interrupt mechanism, generally
>> +the initiator receives a Completion, it SHOULD generate a host interrupt
>> +(if no interrupt suspending on device).
> 
> It's not possible to review this patch because these structs aren't used
> yet and the opcodes are undefined.
> 
> Defining structs that are shared by multiple opcodes might make
> implementations cleaner, but I think it makes the spec less clear. I
> would rather have a list of all opcodes and each one shows the full
> command layout (even if it is duplicated). That way it's very easy to
> look up an opcode you are implementing or debugging and check what's
> needed. If the command layout is not documented in a single place, then
> it takes more effort to figure out how an opcode works.
> 
> Stefan

OK, I'll merge the structure definition into the opcode definition.

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: [PATCH v2 07/11] transport-fabrics: introduce opcodes
       [not found]   ` <20230531205508.GA1509630@fedora>
@ 2023-06-02  8:39     ` zhenwei pi
  2023-06-05 16:46       ` Stefan Hajnoczi
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-02  8:39 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong



On 6/1/23 04:55, Stefan Hajnoczi wrote:
> On Thu, May 04, 2023 at 04:19:06PM +0800, zhenwei pi wrote:
>> Define opcode with this rule:
>> The Virtio-oF transport layer commands use 0x0000-0x0fff, and the
>> device layer commands use 0x1000-0xffff.
>> get/set status/feature/
>> config use consecutive number.
>>
>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>> ---
>>   transport-fabrics.tex | 134 ++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 134 insertions(+)
>>
>> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
>> index 37f57c6..026ff5f 100644
>> --- a/transport-fabrics.tex
>> +++ b/transport-fabrics.tex
>> @@ -704,3 +704,137 @@ \subsubsection{Commands Definition}\label{sec:Virtio Transport Options / Virtio
>>   Note that Virtio Over Fabrics does not define an interrupt mechanism, generally
>>   the initiator receives a Completion, it SHOULD generate a host interrupt
>>   (if no interrupt suspending on device).
>> +
>> +\subsubsection{Opcodes Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition}
>> +This section defines command opcodes for Virtio Over Fabrics:
>> +
>> +\begin{lstlisting}
>> +#define virtio_of_op_connect               0x0000
>> +#define virtio_of_op_discconnect           0x0001
> 
> "disconnect"
> 

OK.

>> +#define virtio_of_op_get_feature           0x0002
>> +#define virtio_of_op_set_feature           0x0003
>> +#define virtio_of_op_keepalive             0x0004
>> +#define virtio_of_op_vring                 0x0fff
>> +#define virtio_of_op_get_vendor_id         0x1000
>> +#define virtio_of_op_get_device_id         0x1001
>> +#define virtio_of_op_get_generation        0x1002
>> +#define virtio_of_op_get_status            0x1004
>> +#define virtio_of_op_set_status            0x1005
>> +#define virtio_of_op_get_device_feature    0x1006
>> +#define virtio_of_op_set_driver_feature    0x1007
>> +#define virtio_of_op_get_num_queues        0x1008
>> +#define virtio_of_op_get_queue_size        0x100a
>> +#define virtio_of_op_get_config            0x100c
>> +#define virtio_of_op_set_config            0x100d
>> +\end{lstlisting}
> 
> Is Queue Reset missing?
> https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-280001
> 

Originally, I designed reset as set_status(0). But an explicit reset 
command is better! Add this in next version.

>> +
>> +\paragraph{virtio_of_op_connect}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_connect}
>> +
>> +virtio_of_op_connect is used to connect a target for both control queue and virtqueue.
>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Connect Command}
>> +and specify the ndesc field as 1, also contains 1 structure virtio_of_vring_desc
>> +filled by structure virtio_of_command_status.
> 
> What are the semantics of this command? Is the idea that the initiatior
> will establish 1 TCP connection to the target for every virtqueue (plus
> one for the control queue) and send a virtio_of_op_connect command as
> the first command in the connection in order to indicate that the
> connection is associated with a specific queue?
>

Yes.

> Is there a state machine related to connection and queue lifecycles?
> 

No state machine related to connection. The lifecycles of Virtio-oF 
queues: establish a connection(transport specific), issue connect 
command, issue control/vq command ... issue disconnect command(optional, 
gracefully shutdown), disconnection connection.

>> +
>> +\paragraph{virtio_of_op_discconnect}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_discconnect}
>> +
>> +virtio_of_op_discconnect is used to disconnect a target for both control queue and virtqueue.
>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command}.
> 
> What happens if the initiatior drops the connection without sending a
> virtio_of_op_discconnect command?
> 

A Virtio-of queue could shutdown gracefully by issuing 
virtio_of_op_discconnect command. The inflight commands will be 
completed by target.
Otherwise the inflight commands may be dropped by target.

> Are there resources associated with a connected queue?
> 

A control queue disconnects, the initiator and target should shutdown 
all the virtqueues associated with this control queue.

A connected virtqueue associates a QID for both initiator and target 
side, once a Virtio-of queue disconnects, the QID becomes free, allow 
reconnect.

>> +
>> +\paragraph{virtio_of_op_get_feature}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_feature}
>> +
>> +virtio_of_op_get_feature is used to get features of target for control queue only.
> 
> Does this command fail when sent to a virtqueue instead of the control
> queue?
> 
> By the way, what's the difference between a connection and queue?
> 

I need describe the mapping between a connection and queue in '[PATCH v2 
01/11] transport-fabrics: introduce Virtio Over Fabrics overview', and 
use 'Virtio Over Fabrics queue' only.

>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Feature Command}.
> 
> When? As the first message or immediately after virtio_of_op_connect?
> 

At any time theory. As the first message or immediately after 
virtio_of_op_connect is recommended.

>> +
>> +\begin{tabular}{ |l|l|l| }
>> +\hline
>> +Feature Select & Value & Description \\
>> +\hline
>> +virtio_of_feature_max_segment & 0x0 & Get the maximum segments within a Vring Command supported by target \\
> 
> Does Vring Command mean virtio_of_op_vring? What is the difference
> between "segments" and "descriptors"?
> 
> Does the max_segment value have a type or does the initiator have to
> support up to u64?
> 
> How does max_segment affect the maximum number of virtqueue buffer
> elements?
> 
> What happens when a feature that is not supported by the target is
> queried by the initiator?
> 
> I was expecting a feature bit negotiation mechanism, but it seems the
> "feature" is a parameter value, not just a single but like VIRTIO
> Feature Bits. Please rename this to "parameter", "setting", or similar
> to avoid confusion with Feature Bits.
> 

VIRTIO Feature Bits style is fine. Change into this style in next 
version. Then I'd introduce a new opcode 
virtio_of_op_get_max_segment(required command, no feature bits testing).

>> +\hline
>> +\end{tabular}
>> +
>> +\paragraph{virtio_of_op_set_feature}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_set_feature}
>> +
>> +virtio_of_op_set_feature is used to set features of initiator for control queue only.
> 
> "set features of initiator" sounds like the target uses it to set up the
> initiator, but I think this command is sent from the initiator to the
> target. Maybe:
> 
>    "virtio_of_op_set_feature sets feature values in the target and is
>    sent on the control queue."
> 

OK.

>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Feature Command}.
> 
> There are currently no features defined that can be set using
> virtio_of_op_set_feature?
> 

'[PATCH v2 11/11] transport-fabrics: support inline data for keyed 
transmission' would be the first bit.

>> +\paragraph{virtio_of_op_keepalive}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_keepalive}
>> +
>> +virtio_of_op_keepalive is used to keep alive with the target for control queue only.
>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command}.
> 
> What is the purpose of this command? It is only sent on the control
> queue so its purpose cannot be to actually keep connections alive since
> virtqueue connections would not stay alive.
> 
> Maybe this is really a health check (ping/pong) to detect when the
> control queue becomes unavailable?
> 

Originally I thought that keepalive for control is enough: once the 
control becomes unavailable, shutdown the control queue and all the 
related virtqueues. the virtio_of_op_vring commands detect the virtqueue 
connection implicitly.

This command for both control queue and virtqueue is fine.

>> +\paragraph{}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_vring}
>> +
>> +virtio_of_op_vring is used to transmit buffer for both control queue and virtqueue.
> 
> "buffers"
> 
> My understanding is that the control queue is not a virtqueue. How can a
> vring operation make sense on something that is not a virtqueue?
> 

Oh, my fault. this is for virtqueue only. and virtio_of_op_vring should 
be renamed to virtio_of_op_vq.

> I think the term used by the VIRTIO spec (2.6 Virtqueues) is "supply an
> available buffer to the device" rather than "transmit buffer".
> 

OK.

>> +The initiator MUST issues \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Vring Command}
> 
> "issue"
> 
>> +and specify the ndesc field as the number of buffer segments,
> 
> buffer segments == data segments == segment descriptor (struct virtio_of_vring_desc)?
> 
> Please pick one term and use it consistently.
> 

OK.

>> +also contains ndesc structure virtio_of_vring_desc.
>> +Each structure virtio_of_vring_desc is filled by each buffer segment one by one.
>> +
>> +\paragraph{virtio_of_op_get_vendor_id}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_vendor_id}
>> +
>> +virtio_of_op_get_vendor_id is used to get vendor id for control queue only.
> 
> The spec uses slightly more specific terms that avoid confusion with
> other types of device/vendor IDs: either "get the Virtio Vendor ID" or
> "get the Virtio Subsystem Vendor ID".
> 

OK.

>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command},
>> +and reads from value field of Completion as le32.
>> +
>> +\paragraph{virtio_of_op_get_device_id}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_device_id}
>> +
>> +virtio_of_op_get_device_id is used to get device id for control queue only.
> 
> "get the Virtio Device ID" or "get the Virtio Subsystem Device ID"
> 

OK.

>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command},
>> +and reads from value field of Completion as le32.
>> +
>> +\paragraph{virtio_of_op_get_generation}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_generation}
>> +
>> +virtio_of_op_get_generation is used to get config generation for control queue only.
>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command},
>> +and reads from value field of Completion as le32.
> 
> Maybe every virtio_of_op_get_config completion should include the
> generation counter value. That way fewer roundtrips are required because
> virtio_of_op_get_generation commands are not necessary.
> 
> The advantage to virtio_of_op_get_generation is that it maps nicely to
> existing VIRTIO driver frameworks that expect to read the generation
> counter separately, so I guess it's okay to keep it even if it's
> inefficient over a fabric.
> 
> We could do both, too.
> 

Great! I'd define a new structure for 
virtio_of_op_get_generation(include generation field) and drop 
virtio_of_op_get_generation.

>> +
>> +\paragraph{virtio_of_op_get_status}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_status}
>> +
>> +virtio_of_op_get_status is used to get device status for control queue only.
>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command},
>> +and reads from value field of Completion as le32.
>> +
>> +\paragraph{virtio_of_op_set_status}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_set_status}
>> +
>> +virtio_of_op_set_status is used to set device status for control queue only.
>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command},
>> +and specify the value field of Common Command as le32.
>> +
>> +\paragraph{virtio_of_op_get_device_feature}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_device_feature}
>> +
>> +virtio_of_op_get_device_feature is used to get device feature for control queue only.
>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Feature Command},
>> +and reads from value field of Completion as le64.
> 
> What happens when feature_select is out of range? I guess
> Completion.value is set to 0.
> 

Yes. But I think the feature_select is always in range, bit64-bit 
127(feature_select == 1) is not offered currently, so 
Completion.value.u64 is 0.

> Does virtio_of_op_get_device_feature return the feature bits offered by
> the device or does it update to reflect negotiated feature bits after
> virtio_of_op_set_driver_feature?
> 

virtio_of_op_get_device_feature returns the same feature bits after 
virtio_of_op_set_driver_feature. Because 1) the device feature is 
capability of device, 2) a target may be shared by multi initiators.

For now, I don't see any dependence on getting driver feature. Do you 
have any concern about this?

>> +
>> +\paragraph{virtio_of_op_set_driver_feature}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_set_driver_feature}
>> +
>> +virtio_of_op_set_driver_feature is used to set driver feature for control queue only.
>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Feature Command},
>> +and specify the value field of Common Command as le64.
>> +
>> +The initiator uses feature_select field to select which feature bits to set.
>> +Value 0x0 selects Feature Bits 0 to 63, 0x1 selects Feature Bits 64 to 128, etc.
>> +
>> +\paragraph{virtio_of_op_get_num_queues}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_num_queues}
>> +
>> +virtio_of_op_get_num_queues is used to get the number of queues for control queue only.
>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command},
>> +and reads from value field of Completion as le16.
>> +
>> +\paragraph{virtio_of_op_get_queue_size}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_queue_size}
>> +
>> +virtio_of_op_get_queue_size is used to get the size of a specified queue for control queue only.
>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Queue Command} with specified queue_id,
>> +and reads from value field of Completion as le16.
> 
> Is it possible to set the queue size? For example, the PCI Transport
> allows the driver to lower the queue size but not increase it (see
> 4.1.5.1.3 Virtqueue Configuration).
> 

Agree. Because a target may be shared by multi initiators, it's not 
reasonable to set queue size of target, the queue size only affect this 
initiator itself.
For example, a target supports queue size 1024. initiatorX uses 128 
queue size, and initiatorY uses 1024. Do you have any suggestion about this?

>> +
>> +\paragraph{virtio_of_op_get_config}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_config}
>> +
>> +virtio_of_op_get_config is used to get the config of a device for control queue only.
>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Config Command} with specified offset and bytes,
>> +and reads from value field of Completion.
>> +
>> +\paragraph{virtio_of_op_set_config}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_set_config}
>> +
>> +virtio_of_op_set_config is used to set the config of a device for control queue only.
>> +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Config Command} with specified offset and bytes and value fields.
>> -- 
>> 2.25.1
>>

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: [PATCH v2 09/11] transport-fabrics: add TCP&RDMA binding
       [not found]   ` <20230531210255.GC1509630@fedora>
@ 2023-06-02  9:07     ` zhenwei pi
  2023-06-05 16:57       ` Stefan Hajnoczi
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-02  9:07 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong



On 6/1/23 05:02, Stefan Hajnoczi wrote:
> On Thu, May 04, 2023 at 04:19:08PM +0800, zhenwei pi wrote:
>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>> ---
>>   transport-fabrics.tex | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
>> index f563c3e..c47a744 100644
>> --- a/transport-fabrics.tex
>> +++ b/transport-fabrics.tex
>> @@ -873,3 +873,12 @@ \subsubsection{Status Definition}\label{sec:Virtio Transport Options / Virtio Ov
>>   #define VIRTIO_OF_EALREADY      114
>>   #define VIRTIO_OF_EQUIRK        4096
>>   \end{lstlisting}
>> +
>> +\subsection{Transport Binding}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transport Binding}
>> +\subsubsection{TCP}\label{sec:Virtio Transport Options / Virtio Over Fabrics / ransport Binding / TCP}
>> +TCP MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}
>> +~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}.
>> +
>> +\subsubsection{RDMA}\label{sec:Virtio Transport Options / Virtio Over Fabrics / ransport Binding / RDMA}
>> +RDMA MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}
>> +~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}.
> 
> What about VQN representation, default port numbers, etc? There should
> be enough information here so implementers can create compatible
> implementations.
> 

Already replied in '[PATCH v2 02/11] transport-fabrics: introduce Virtio 
Qualified Name'.

> Is there connection encryption support? It's hard to imagine running a
> plaintext Virtio Over Fabrics TCP connection in a production environment
> due to security concerns.
> 
> Stefan

As far as I can see, 1) an ACL mechanism could be used in the 
engineering implementation without any specification.(ex, a target only 
allows a specific IVQN). 2) authentication may be introduced in the future.

Does the virtqueue buffers need encryption support?

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: [virtio-comment] [PATCH v2 10/11] transport-fabrics: add device initialization
       [not found]   ` <20230531210925.GD1509630@fedora>
@ 2023-06-02  9:11     ` zhenwei pi
  0 siblings, 0 replies; 74+ messages in thread
From: zhenwei pi @ 2023-06-02  9:11 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong



On 6/1/23 05:09, Stefan Hajnoczi wrote:
> On Thu, May 04, 2023 at 04:19:09PM +0800, zhenwei pi wrote:
>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>> ---
>>   transport-fabrics.tex | 24 ++++++++++++++++++++++++
>>   1 file changed, 24 insertions(+)
>>
>> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
>> index c47a744..af35622 100644
>> --- a/transport-fabrics.tex
>> +++ b/transport-fabrics.tex
>> @@ -882,3 +882,27 @@ \subsubsection{TCP}\label{sec:Virtio Transport Options / Virtio Over Fabrics / r
>>   \subsubsection{RDMA}\label{sec:Virtio Transport Options / Virtio Over Fabrics / ransport Binding / RDMA}
>>   RDMA MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}
>>   ~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}.
>> +
>> +\subsection{Device Initialization}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Device Initialization}
>> +\begin{enumerate}
>> +\item The control queue MUST be established firstly, once the reliable
> 
> It's not 100% clear whether the control queue must be established first
> and then the other things happen, or whether the other things have to
> happen in order to establish the control queue.
> 
> Here is one way to reword it:
> 
> "The control queue MUST be established first by connecting from the
> initiator to the target and sending a \nameref{sec:Virtio Transport
> Options / Virtio Over Fabrics / Transmission Protocol / Opcodes
> Definition / virtio_of_op_connect} command to create the association
> with the target."
> 
>> +connection is ready, the initiator MUST issue
>> +\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_connect}
>> +to create association with the target.
>> +\item The initiator SHOULD issue
>> +\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_feature}
>> +to discover the capabilities offered by the target.
>> +\item The initiator SHOULD issue
>> +\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_set_feature}
>> +to negotiate the capabilities.
>> +\item The initiator SHOULD continue initialization like PCI base devices, i.e. issue
> 
> Please use a \nameref to reference to something specific in the spec. It
> would be even better to list the exact steps here instead of referring
> to the PCI Transport.
> 
>> +\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_vendor_id}
>> +to get the vendor ID.
>> +\item After discovering the number of virtqueues by
>> +\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_num_queues},
>> +the initiator SHOULD create virtqueue one by one by
> 
> "virtqueues one by one with"
> 
>> +\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_connect}.
>> +\item The virtqueue SHOULD issue
> 
> "The initiator SHOULD issue"
> 
>> +\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_vring}
>> +to transmit buffer.
> 
> "to supply available buffers to the device"
> 

OK, will fix above flaws.

>> +\end{enumerate}
>> -- 
>> 2.25.1
>>
>>
>> This publicly archived list offers a means to provide input to the
>> OASIS Virtual I/O Device (VIRTIO) TC.
>>
>> In order to verify user consent to the Feedback License terms and
>> to minimize spam in the list archive, subscription is required
>> before posting.
>>
>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>> List help: virtio-comment-help@lists.oasis-open.org
>> List archive: https://lists.oasis-open.org/archives/virtio-comment/
>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
>> Committee: https://www.oasis-open.org/committees/virtio/
>> Join OASIS: https://www.oasis-open.org/join/
>>

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview zhenwei pi
  2023-05-04  8:57   ` David Hildenbrand
  2023-05-31 14:00   ` [virtio-comment] " Stefan Hajnoczi
@ 2023-06-05  2:39   ` Parav Pandit
  2023-06-05  2:39   ` Parav Pandit
  3 siblings, 0 replies; 74+ messages in thread
From: Parav Pandit @ 2023-06-05  2:39 UTC (permalink / raw)
  To: virtio-comment



On 5/4/2023 4:19 AM, zhenwei pi wrote:
> In the past years, virtio supports lots of device specifications by
> PCI/MMIO/CCW. These devices work fine in the virtualization environment.
> 
> Introduce Virtio Over Fabrics transport to support "network defined
> peripheral devices". With this transport, Many Virtio based devices
> transparently work over fabrics. Note that the balloon device may not
> make sense. Shared memory regions won't work.
> 
> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> ---
>   content.tex           |  1 +
>   transport-fabrics.tex | 31 +++++++++++++++++++++++++++++++
>   2 files changed, 32 insertions(+)
>   create mode 100644 transport-fabrics.tex
> 
> diff --git a/content.tex b/content.tex
> index cff548a..f899c3a 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -582,6 +582,7 @@ \chapter{Virtio Transport Options}\label{sec:Virtio Transport Options}
>   \input{transport-pci.tex}
>   \input{transport-mmio.tex}
>   \input{transport-ccw.tex}
> +\input{transport-fabrics.tex}
>   
>   \chapter{Device Types}\label{sec:Device Types}
>   
> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> new file mode 100644
> index 0000000..0dc031b
> --- /dev/null
> +++ b/transport-fabrics.tex
> @@ -0,0 +1,31 @@
> +\section{Virtio Over Fabrics}\label{sec:Virtio Transport Options / Virtio Over Fabrics}
> +
> +This section defines specification to Virtio that enables operation over other
"other" is contextual given the current spec definition.
And once this change is merged its not any "other" transport.

We can probably write it without referring to the PCI transport.

> +interconnects. A central goal of Virtio Over Fabrics is to maintain consistency
> +with the PCI device, so Virtio based devices transparently work over PCI or
> +fabrics.
> +

> +Virtio Over Fabrics uses reliable connection to transmit data, the reliable
> +connection betweens two rules:
> +
between
> +\begin{itemize}
> +\item An initiator functions as an Virtio Over Fabrics client. An initiator
> +typically serves the same purpose to a machine as a Virtio device, issues
> +commands to remote side.
> +\item A target functions as an Virtio Over Fabrics server. An target typically
> +handles commands from the initiator side and responses completions.
> +\end{itemize}
> +
> +Virtio Over Fabrics has the following differences from the PCI based
> +specification:
> +
> +\begin{itemize}
> +\item Instead of memory sharing mechanism of virtqueue, there is a one-to-one
> +mapping between virtqueue and the reliable connection which executes the vring
> +data transmission.
There is no concept of vring in virtio specification beyond 0.9.5 citations.
I guess to say "reliable connection that transports virtio command, 
status and associated data".

> +\item An additional control connection is required to execute control commands
> +which is similar to read/write register on a PCI device.
> +\item Virtio Over Fabrics does not define an interrupt mechanism that allows an
> +initiator to generate a host interrupt. It is the responsibility of the host
> +fabric interface to generate host interrupts.
> +\end{itemize}
In context of fabrics "host" to be replaced with "client" or "initiator".

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview zhenwei pi
                     ` (2 preceding siblings ...)
  2023-06-05  2:39   ` [virtio-comment] " Parav Pandit
@ 2023-06-05  2:39   ` Parav Pandit
  3 siblings, 0 replies; 74+ messages in thread
From: Parav Pandit @ 2023-06-05  2:39 UTC (permalink / raw)
  To: zhenwei pi, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong



On 5/4/2023 4:19 AM, zhenwei pi wrote:
> In the past years, virtio supports lots of device specifications by

> PCI/MMIO/CCW. These devices work fine in the virtualization environment.
> 
> Introduce Virtio Over Fabrics transport to support "network defined
s/network defined/network attached

> peripheral devices". With this transport, Many Virtio based devices
> transparently work over fabrics. 
I am not sure transparently.
Probably just better to say works over fabrics.

> Note that the balloon device may not
> make sense. Shared memory regions won't work.
> 

> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> ---
>   content.tex           |  1 +
>   transport-fabrics.tex | 31 +++++++++++++++++++++++++++++++
>   2 files changed, 32 insertions(+)
>   create mode 100644 transport-fabrics.tex
> 
> diff --git a/content.tex b/content.tex
> index cff548a..f899c3a 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -582,6 +582,7 @@ \chapter{Virtio Transport Options}\label{sec:Virtio Transport Options}
>   \input{transport-pci.tex}
>   \input{transport-mmio.tex}
>   \input{transport-ccw.tex}
> +\input{transport-fabrics.tex}
>   
>   \chapter{Device Types}\label{sec:Device Types}
>   
> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> new file mode 100644
> index 0000000..0dc031b
> --- /dev/null
> +++ b/transport-fabrics.tex
> @@ -0,0 +1,31 @@
> +\section{Virtio Over Fabrics}\label{sec:Virtio Transport Options / Virtio Over Fabrics}
> +
> +This section defines specification to Virtio that enables operation over other
> +interconnects. A central goal of Virtio Over Fabrics is to maintain consistency
We have 3 different terminology.
network based
fabrics
and here interconnects

best to drop and stick to fabrics.

> +with the PCI device, so Virtio based devices transparently work over PCI or
> +fabrics.
> +
> +Virtio Over Fabrics uses reliable connection to transmit data, the reliable
> +connection betweens two rules:
> +
s/betweens/between

"betweens two rules" is not reading right.

Virtio over fabrics uses underlying reliable transport to exchange data 
(as it is received also reliably).

because it may be multiple connections.


> +\begin{itemize}
> +\item An initiator functions as an Virtio Over Fabrics client. An initiator
> +typically serves the same purpose to a machine as a Virtio device, issues
> +commands to remote side.
> +\item A target functions as an Virtio Over Fabrics server. An target typically
> +handles commands from the initiator side and responses completions.
> +\end{itemize}
> +
> +Virtio Over Fabrics has the following differences from the PCI based
> +specification:
> +
> +\begin{itemize}
> +\item Instead of memory sharing mechanism of virtqueue, there is a one-to-one
> +mapping between virtqueue and the reliable connection which executes the vring
> +data transmission.
vring is not a well defined spec term today. It is mostly refers to 
legacy part of the spec.
So need to reword this.

> +\item An additional control connection is required to execute control commands
> +which is similar to read/write register on a PCI device.
> +\item Virtio Over Fabrics does not define an interrupt mechanism that allows an
> +initiator to generate a host interrupt. It is the responsibility of the host
> +fabric interface to generate host interrupts.
Please change to notifications as we try to keep the layer as much as 
possible between interrupt and notifications.

> +\end{itemize}

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [virtio-comment] [PATCH v2 02/11] transport-fabrics: introduce Virtio Qualified Name
  2023-06-02  1:50     ` zhenwei pi
@ 2023-06-05  2:40       ` Parav Pandit
  2023-06-05  7:57         ` zhenwei pi
  2023-06-05 17:05         ` Stefan Hajnoczi
  0 siblings, 2 replies; 74+ messages in thread
From: Parav Pandit @ 2023-06-05  2:40 UTC (permalink / raw)
  To: zhenwei pi, Stefan Hajnoczi
  Cc: mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong



On 6/1/2023 9:50 PM, zhenwei pi wrote:
> 
> 
> On 5/31/23 22:06, Stefan Hajnoczi wrote:
>> On Thu, May 04, 2023 at 04:19:01PM +0800, zhenwei pi wrote:
>>> Add VQN section. The VQN is a little different from iSCSI/NVMe-oF on
>>> style limitation. Because iSCSI/NVMe-of is storage specific protocol,
>>> the full string IQN(for iSCSI/iSER) and NQN(for NVMe-oF) represents
>>> a "storage access address". However, Virtio Over Fabrics works as
>>> transport layer rather than device layer, a URL style string is better
>>> to Virtio Over Fabrics. For example:
>>> virtio-of://blk-resource/nvme-pool/849a39ad-8d7b-4a7a-adb6-e7407ace532c
>>> virtio-of://blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
>>> ...
>>> virtio-of://crypto-resource/25307f22-e5a8-4ea2-b7ca-79f5c3bebc3c
>>
>> I'm not sure what blk-resource and nvme-pool are in these URLs?
>>
>> Should the patch mention the virtio-of:// URI scheme?
>>
> 
> Sorry, I missed the address and port. They should be:
> virtio-rdma://192.168.1.100:8549/blk-resource/nvme-pool/849a39ad-8d7b-4a7a-adb6-e7407ace532c
> virtio-tcp://192.168.1.110/blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1

Since it is device specific resource, may be blk-dev or blk-device reads 
better, as behind this device there are multiple resources.

> ...
> 
> This is human readable string. when the software(or hardware) handles 
> this, this should be translated into:
> transport: RDMA
> address: 192.168.1.100
> port: 8549 (default port 8549(CRC-16/ARC of "Virtio"))
> target VQN: blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
> 
> This section only defines the "VQN" schema, not the resource string schema.
> 
> For a process, I think the following two are both fine:
> ./foo --full-url 
> virtio-rdma://192.168.1.100:8549/blk-resource/nvme-pool/849a39ad-8d7b-4a7a-adb6-e7407ace532c
> ./foo --transport rdma --address 192.168.1.100 --port 8549 --tvqn
> blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
> 
> [snip]
> 
>>
>> Is the maximum name 255 UTF-8 bytes plus a NUL character? Please state
>> this in the spec. For example:
>>
>>    \item The string is NUL terminated.
s/NUL/NULL ?

>>    \item The maximum name is 256 bytes in length, including the NUL 
>> character.
>>
> OK, fix this in the next version.
> 
>>> +\item There is no strict style limitation.
>>
>> I think it's necessary to define representations for specific fabrics
>> (e.g. TCP/IP) so that VQNs can be exchanged between different VIRTIO
>> implementations (VMMs, DPUs, command-line utilities, etc). Otherwise two
>> different implementations may represent the same address differently.
>>
>> Stefan
> 

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: [PATCH v2 03/11] transport-fabircs: introduce Segment Descriptor Definition
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 03/11] transport-fabircs: introduce Segment Descriptor Definition zhenwei pi
  2023-05-31 14:23   ` Stefan Hajnoczi
@ 2023-06-05  2:40   ` Parav Pandit
  1 sibling, 0 replies; 74+ messages in thread
From: Parav Pandit @ 2023-06-05  2:40 UTC (permalink / raw)
  To: zhenwei pi, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong



On 5/4/2023 4:19 AM, zhenwei pi wrote:
> Introduce segment descriptor to describe the Virtio device buffer
> segments.
> 
> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> ---
>   transport-fabrics.tex | 43 +++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 43 insertions(+)
> 
> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> index 26b0192..b88acfd 100644
> --- a/transport-fabrics.tex
> +++ b/transport-fabrics.tex
> @@ -45,3 +45,46 @@ \subsection{Virtio Qualified Name}\label{sec:Virtio Transport Options / Virtio O
>   \item The string is null terminated.
>   \item There is no strict style limitation.
>   \end{itemize}
> +
> +\subsection{Transmission Protocol}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol}
> +This section defines transmission protocol for Virtio Over Fabrics. All the
> +fields use little endian format.
> +
> +\subsubsection{Segment Descriptor Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Segment Descriptor Definition}
> +Virtio Over Fabrics uses the following structure to describe data segment:
> +
> +\begin{lstlisting}
> +struct virtio_of_vring_desc {
vring doesnt seem necessary here.

> +        le64 addr;
> +        le32 length;
> +        /* This marks the unique ID within a command, no limitation among inflight commands */
> +        le16 id;
id and key both seems redundant.
Not sure the need of id in each descriptor.
Yet to read full...

> +        /* This marks a buffer as keyed transmission (otherwise stream transmission) */
> +#define VIRTIO_OF_DESC_F_KEYED     1
> +        /* This marks a buffer as device write-only (otherwise device read-only). */
> +#define VIRTIO_OF_DESC_F_WRITE     2
> +        le16 flags;
> +        le32 key;
> +};
> +\end{lstlisting}
> +
> +The structure virtio_of_vring_desc is used for both keyed transmission
> +(i.e. RDMA) and stream transmission(i.e. TCP). The fields is described as follows:
> +
> +\begin{tabular}{ |l|l|l| }
> +\hline
> +Field & keyed transmission & stream transmission \\
> +\hline \hline
> +addr & Start address of remote memory buffer & Start address within the stream buffer \\
> +\hline
> +length & The length of remote memory buffer & The length of buffer within the stream \\
> +\hline
> +id & The ID of this descriptor & The ID of this descriptor \\
> +\hline
> +flags & both keyed transmission and stream transmission supported & stream transmission only \\
> +\hline
This is probably transport specific, don't see the need to transport 
this bit when the transport type is already known as keyed.

> +key & Key of the remote Memory Region & Ignore \\
> +\hline
> +\end{tabular}
> +
> +Depending on the opcode, a Command contains zero or more structure virtio_of_vring_desc.
Please move above opcode line patch to different patch where opcode is 
introduced.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-05-04  8:19 ` [virtio-comment] [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission zhenwei pi
  2023-05-31 16:20   ` [virtio-comment] " Stefan Hajnoczi
@ 2023-06-05  2:41   ` Parav Pandit
  2023-06-05  8:41     ` zhenwei pi
  1 sibling, 1 reply; 74+ messages in thread
From: Parav Pandit @ 2023-06-05  2:41 UTC (permalink / raw)
  To: zhenwei pi, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong



On 5/4/2023 4:19 AM, zhenwei pi wrote:
> Keyed transmission is used for message oriented communication(Ex RDMA),
> also add virtio-blk read/write 8K example.
> 
> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> ---

> +An example of a virtio-blk write 8K request(message size: sizeof(Command) +
> +4 * sizeof(Descriptor)):
> +\begin{lstlisting}
> + COMMAND            +------+
> +                    |opcode|  ->  virtio_of_op_vring
> +                    +------+
> +                    |cmd id|  ->  10
> +                    +------+
> +                    |length|  ->  0
> +                    +------+
> +                    |ndesc |  ->  4
> +                    +------+
> +                    |rsvd  |
> +                    +------+
> +
> + DESC0              +------+
> +                    |addr  |  -> 0xffff012345670000
> +                    +------+
> +                    |length|  -> 16 (virtio blk write command)
> +                    +------+
> +                    |id    |  -> 0
> +                    +------+

for RDMA this id is not useful. It can be omitted.
still parsing the rest.

if we talk blk as an example, above command descriptor can be of 32 bytes,
such as
struct virtio_of_cmd {
	u8 opcode;
	u8 rsvd;
	le16 cmd_id;
	u8 inline_desc_cnt;
	u8 rsvd[3];
	/* some padding/metadata for long desc list if any */
};

struct virtio_of_rdma_desc {
	le64 addr;
	le32 length;
	le32 rdma_key;
};

struct virtio_rdma_op {
	struct virtio_of_cmd cmd;
	struct virtio_of_rdma_desc desc[1 or 3]; /* count can be negotiated */
};

With this a send and receive queue on initiator and target can exchange, 
cmd descriptor for read/writes.

RDMA allows mapping memory and also chaining it with next send.
This way, memory from 1B to 4GB can be represented using single rdma key 
for data DMA (read or write).

Completion is similarly 8B with status + cmd_id of constant size can be 
received in an RQ.

This is 1 RTT from initiator to target for cmd and response for whole 
4GB data transfer.
Depending on the data size, memory pressure, sharing, outstanding 
commands, etc target can read/write data from initiator memory using 
RDMA read/write addresses.

With this target can also implement poll or event.

RDMA writes do not guarantee data placement visibility in same order on 
the responder side as what is send on the requester side.

I ran out of time. Will review more later in the week.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: [virtio-comment] [PATCH v2 02/11] transport-fabrics: introduce Virtio Qualified Name
  2023-06-05  2:40       ` Parav Pandit
@ 2023-06-05  7:57         ` zhenwei pi
  2023-06-05 17:05         ` Stefan Hajnoczi
  1 sibling, 0 replies; 74+ messages in thread
From: zhenwei pi @ 2023-06-05  7:57 UTC (permalink / raw)
  To: Parav Pandit, Stefan Hajnoczi
  Cc: mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong



On 6/5/23 10:40, Parav Pandit wrote:
> 
> 
> On 6/1/2023 9:50 PM, zhenwei pi wrote:
>>
>>
>> On 5/31/23 22:06, Stefan Hajnoczi wrote:
>>> On Thu, May 04, 2023 at 04:19:01PM +0800, zhenwei pi wrote:
>>>> Add VQN section. The VQN is a little different from iSCSI/NVMe-oF on
>>>> style limitation. Because iSCSI/NVMe-of is storage specific protocol,
>>>> the full string IQN(for iSCSI/iSER) and NQN(for NVMe-oF) represents
>>>> a "storage access address". However, Virtio Over Fabrics works as
>>>> transport layer rather than device layer, a URL style string is better
>>>> to Virtio Over Fabrics. For example:
>>>> virtio-of://blk-resource/nvme-pool/849a39ad-8d7b-4a7a-adb6-e7407ace532c
>>>> virtio-of://blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
>>>> ...
>>>> virtio-of://crypto-resource/25307f22-e5a8-4ea2-b7ca-79f5c3bebc3c
>>>
>>> I'm not sure what blk-resource and nvme-pool are in these URLs?
>>>
>>> Should the patch mention the virtio-of:// URI scheme?
>>>
>>
>> Sorry, I missed the address and port. They should be:
>> virtio-rdma://192.168.1.100:8549/blk-resource/nvme-pool/849a39ad-8d7b-4a7a-adb6-e7407ace532c
>> virtio-tcp://192.168.1.110/blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
> 
> Since it is device specific resource, may be blk-dev or blk-device reads 
> better, as behind this device there are multiple resources.
> 

OK.

>> ...
>>
>> This is human readable string. when the software(or hardware) handles 
>> this, this should be translated into:
>> transport: RDMA
>> address: 192.168.1.100
>> port: 8549 (default port 8549(CRC-16/ARC of "Virtio"))
>> target VQN: blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
>>
>> This section only defines the "VQN" schema, not the resource string 
>> schema.
>>
>> For a process, I think the following two are both fine:
>> ./foo --full-url 
>> virtio-rdma://192.168.1.100:8549/blk-resource/nvme-pool/849a39ad-8d7b-4a7a-adb6-e7407ace532c
>> ./foo --transport rdma --address 192.168.1.100 --port 8549 --tvqn
>> blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
>>
>> [snip]
>>
>>>
>>> Is the maximum name 255 UTF-8 bytes plus a NUL character? Please state
>>> this in the spec. For example:
>>>
>>>    \item The string is NUL terminated.
> s/NUL/NULL ?
> 

OK.

>>>    \item The maximum name is 256 bytes in length, including the NUL 
>>> character.
>>>
>> OK, fix this in the next version.
>>
>>>> +\item There is no strict style limitation.
>>>
>>> I think it's necessary to define representations for specific fabrics
>>> (e.g. TCP/IP) so that VQNs can be exchanged between different VIRTIO
>>> implementations (VMMs, DPUs, command-line utilities, etc). Otherwise two
>>> different implementations may represent the same address differently.
>>>
>>> Stefan
>>
> 
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
> 
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
> 
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: 
> https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
> 

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-06-05  2:41   ` Parav Pandit
@ 2023-06-05  8:41     ` zhenwei pi
  2023-06-05 11:45       ` Parav Pandit
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-05  8:41 UTC (permalink / raw)
  To: Parav Pandit, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong



On 6/5/23 10:41, Parav Pandit wrote:
> 
> 
> On 5/4/2023 4:19 AM, zhenwei pi wrote:
>> Keyed transmission is used for message oriented communication(Ex RDMA),
>> also add virtio-blk read/write 8K example.
>>
>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>> ---
> 
>> +An example of a virtio-blk write 8K request(message size: 
>> sizeof(Command) +
>> +4 * sizeof(Descriptor)):
>> +\begin{lstlisting}
>> + COMMAND            +------+
>> +                    |opcode|  ->  virtio_of_op_vring
>> +                    +------+
>> +                    |cmd id|  ->  10
>> +                    +------+
>> +                    |length|  ->  0
>> +                    +------+
>> +                    |ndesc |  ->  4
>> +                    +------+
>> +                    |rsvd  |
>> +                    +------+
>> +
>> + DESC0              +------+
>> +                    |addr  |  -> 0xffff012345670000
>> +                    +------+
>> +                    |length|  -> 16 (virtio blk write command)
>> +                    +------+
>> +                    |id    |  -> 0
>> +                    +------+
> 
> for RDMA this id is not useful. It can be omitted.
> still parsing the rest.
> 
> if we talk blk as an example, above command descriptor can be of 32 bytes,
> such as
> struct virtio_of_cmd {
>      u8 opcode;
>      u8 rsvd;
>      le16 cmd_id;
>      u8 inline_desc_cnt;
>      u8 rsvd[3];
>      /* some padding/metadata for long desc list if any */
> };
> 
> struct virtio_of_rdma_desc {
>      le64 addr;
>      le32 length;
>      le32 rdma_key;
> };
> 
> struct virtio_rdma_op {
>      struct virtio_of_cmd cmd;
>      struct virtio_of_rdma_desc desc[1 or 3]; /* count can be negotiated */
> };
> 
> With this a send and receive queue on initiator and target can exchange, 
> cmd descriptor for read/writes.
> 

Hi,

Do you mean that separating a Virtio Over RDMA queue into 2 QP, one for 
sending, another one for receiving?

> RDMA allows mapping memory and also chaining it with next send.
> This way, memory from 1B to 4GB can be represented using single rdma key 
> for data DMA (read or write).
> 
> Completion is similarly 8B with status + cmd_id of constant size can be 
> received in an RQ.
> 
> This is 1 RTT from initiator to target for cmd and response for whole 
> 4GB data transfer.
> Depending on the data size, memory pressure, sharing, outstanding 
> commands, etc target can read/write data from initiator memory using 
> RDMA read/write addresses.
> 
> With this target can also implement poll or event.
> 
> RDMA writes do not guarantee data placement visibility in same order on 
> the responder side as what is send on the requester side.
> 
> I ran out of time. Will review more later in the week.
> 
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
> 
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
> 
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: 
> https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
> 

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-06-05  8:41     ` zhenwei pi
@ 2023-06-05 11:45       ` Parav Pandit
  2023-06-05 12:50         ` zhenwei pi
  0 siblings, 1 reply; 74+ messages in thread
From: Parav Pandit @ 2023-06-05 11:45 UTC (permalink / raw)
  To: zhenwei pi, mst, stefanha, jasowang
  Cc: virtio-comment, houp, helei.sig11, xinhao.kong



> From: zhenwei pi <pizhenwei@bytedance.com>
> Sent: Monday, June 5, 2023 4:41 AM
> 
> On 6/5/23 10:41, Parav Pandit wrote:
> >
> >
> > On 5/4/2023 4:19 AM, zhenwei pi wrote:
> >> Keyed transmission is used for message oriented communication(Ex
> >> RDMA), also add virtio-blk read/write 8K example.
> >>
> >> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> >> ---
> >
> >> +An example of a virtio-blk write 8K request(message size:
> >> sizeof(Command) +
> >> +4 * sizeof(Descriptor)):
> >> +\begin{lstlisting}
> >> + COMMAND            +------+
> >> +                    |opcode|  ->  virtio_of_op_vring
> >> +                    +------+
> >> +                    |cmd id|  ->  10
> >> +                    +------+
> >> +                    |length|  ->  0
> >> +                    +------+
> >> +                    |ndesc |  ->  4
> >> +                    +------+
> >> +                    |rsvd  |
> >> +                    +------+
> >> +
> >> + DESC0              +------+
> >> +                    |addr  |  -> 0xffff012345670000
> >> +                    +------+
> >> +                    |length|  -> 16 (virtio blk write command)
> >> +                    +------+
> >> +                    |id    |  -> 0
> >> +                    +------+
> >
> > for RDMA this id is not useful. It can be omitted.
> > still parsing the rest.
> >
> > if we talk blk as an example, above command descriptor can be of 32
> > bytes, such as struct virtio_of_cmd {
> >      u8 opcode;
> >      u8 rsvd;
> >      le16 cmd_id;
> >      u8 inline_desc_cnt;
> >      u8 rsvd[3];
> >      /* some padding/metadata for long desc list if any */ };
> >
> > struct virtio_of_rdma_desc {
> >      le64 addr;
> >      le32 length;
> >      le32 rdma_key;
> > };
> >
> > struct virtio_rdma_op {
> >      struct virtio_of_cmd cmd;
> >      struct virtio_of_rdma_desc desc[1 or 3]; /* count can be
> > negotiated */ };
> >
> > With this a send and receive queue on initiator and target can
> > exchange, cmd descriptor for read/writes.
> >
> 
> Hi,
> 
> Do you mean that separating a Virtio Over RDMA queue into 2 QP, one for
> sending, another one for receiving?
> 
No. just one QP.

Initiator_QP_A -> target_QP_B.

When initiator QP A sends 32B cmd, it lands in the target QP B's receive queue.

After this target can do one or more read/write DMA using RDMA read/write from the initiator's memory.

Finally target_QP_B sends 8B completion, it arrives in the QP_A's receive queue.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RE: [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-06-05 11:45       ` Parav Pandit
@ 2023-06-05 12:50         ` zhenwei pi
  2023-06-05 13:12           ` Parav Pandit
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-05 12:50 UTC (permalink / raw)
  To: Parav Pandit, stefanha
  Cc: mst, virtio-comment, houp, helei.sig11, xinhao.kong, jasowang

On 6/5/23 19:45, Parav Pandit wrote:
> 
> 
>> From: zhenwei pi <pizhenwei@bytedance.com>
>> Sent: Monday, June 5, 2023 4:41 AM
>>
>> On 6/5/23 10:41, Parav Pandit wrote:
>>>
>>>
>>> On 5/4/2023 4:19 AM, zhenwei pi wrote:
>>>> Keyed transmission is used for message oriented communication(Ex
>>>> RDMA), also add virtio-blk read/write 8K example.
>>>>
>>>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>>>> ---
>>>
>>>> +An example of a virtio-blk write 8K request(message size:
>>>> sizeof(Command) +
>>>> +4 * sizeof(Descriptor)):
>>>> +\begin{lstlisting}
>>>> + COMMAND            +------+
>>>> +                    |opcode|  ->  virtio_of_op_vring
>>>> +                    +------+
>>>> +                    |cmd id|  ->  10
>>>> +                    +------+
>>>> +                    |length|  ->  0
>>>> +                    +------+
>>>> +                    |ndesc |  ->  4
>>>> +                    +------+
>>>> +                    |rsvd  |
>>>> +                    +------+
>>>> +
>>>> + DESC0              +------+
>>>> +                    |addr  |  -> 0xffff012345670000
>>>> +                    +------+
>>>> +                    |length|  -> 16 (virtio blk write command)
>>>> +                    +------+
>>>> +                    |id    |  -> 0
>>>> +                    +------+
>>>
>>> for RDMA this id is not useful. It can be omitted.
>>> still parsing the rest.
>>>
>>> if we talk blk as an example, above command descriptor can be of 32
>>> bytes, such as struct virtio_of_cmd {
>>>       u8 opcode;
>>>       u8 rsvd;
>>>       le16 cmd_id;
>>>       u8 inline_desc_cnt;
>>>       u8 rsvd[3];
>>>       /* some padding/metadata for long desc list if any */ };
>>>
>>> struct virtio_of_rdma_desc {
>>>       le64 addr;
>>>       le32 length;
>>>       le32 rdma_key;
>>> };
>>>
>>> struct virtio_rdma_op {
>>>       struct virtio_of_cmd cmd;
>>>       struct virtio_of_rdma_desc desc[1 or 3]; /* count can be
>>> negotiated */ };
>>>
>>> With this a send and receive queue on initiator and target can
>>> exchange, cmd descriptor for read/writes.
>>>
>>
>> Hi,
>>
>> Do you mean that separating a Virtio Over RDMA queue into 2 QP, one for
>> sending, another one for receiving?
>>
> No. just one QP.
> 
> Initiator_QP_A -> target_QP_B.
> 
> When initiator QP A sends 32B cmd, it lands in the target QP B's receive queue.
> 
> After this target can do one or more read/write DMA using RDMA read/write from the initiator's memory.
> 

Hi, I have several questions:
1, how to tell the target to read/write DMA using RDMA read/write? is 
virtio_of_rdma_desc missing?

2, if several virtio_of_rdma_desc arrives, the target need to 
distinguish READ * m + WRITE * n descriptors. but *flags* field has been 
removed ...

3, if I understand correctly, Initiator_QP_A -> Target_QP_B(CMD), 
Target_QP_B(RDMA READ), Target_QP_B(RDMA WRITE), Target_QP_B -> 
Initiator_QP_A(COMP). this uses 4 RTT.

> Finally target_QP_B sends 8B completion, it arrives in the QP_A's receive queue.

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: RE: [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-06-05 12:50         ` zhenwei pi
@ 2023-06-05 13:12           ` Parav Pandit
  2023-06-06  7:13             ` zhenwei pi
  0 siblings, 1 reply; 74+ messages in thread
From: Parav Pandit @ 2023-06-05 13:12 UTC (permalink / raw)
  To: zhenwei pi, stefanha
  Cc: mst, virtio-comment, houp, helei.sig11, xinhao.kong, jasowang



> From: zhenwei pi <pizhenwei@bytedance.com>
> Sent: Monday, June 5, 2023 8:50 AM


> >>> if we talk blk as an example, above command descriptor can be of 32
> >>> bytes, such as struct virtio_of_cmd {
> >>>       u8 opcode;
> >>>       u8 rsvd;
> >>>       le16 cmd_id;
> >>>       u8 inline_desc_cnt;
> >>>       u8 rsvd[3];
> >>>       /* some padding/metadata for long desc list if any */ };
> >>>
> >>> struct virtio_of_rdma_desc {
> >>>       le64 addr;
> >>>       le32 length;
> >>>       le32 rdma_key;
> >>> };
> >>>
> >>> struct virtio_rdma_op {
> >>>       struct virtio_of_cmd cmd;
> >>>       struct virtio_of_rdma_desc desc[1 or 3]; /* count can be
> >>> negotiated */ };
> >>>
> >>> With this a send and receive queue on initiator and target can
> >>> exchange, cmd descriptor for read/writes.
> >>>
> >>
> >> Hi,
> >>
> >> Do you mean that separating a Virtio Over RDMA queue into 2 QP, one
> >> for sending, another one for receiving?
> >>
> > No. just one QP.
> >
> > Initiator_QP_A -> target_QP_B.
> >
> > When initiator QP A sends 32B cmd, it lands in the target QP B's receive
> queue.
> >
> > After this target can do one or more read/write DMA using RDMA read/write
> from the initiator's memory.
> >
> 
> Hi, I have several questions:
> 1, how to tell the target to read/write DMA using RDMA read/write? is
> virtio_of_rdma_desc missing?
> 
Virtio_of_rdma_desc is part of the 32B struct virtio_rdma_op in above example.

> 2, if several virtio_of_rdma_desc arrives, the target need to distinguish READ *
> m + WRITE * n descriptors. but *flags* field has been removed ...
> 
The idea is to not have multiple virtio_of_rdma_desc.
An initiator can represent 1B to 4GB of noncontiguous buffer using a single rdma mkey.
Hence, only one virtio_of_rdma_desc is enough from initiator to target.

> 3, if I understand correctly, Initiator_QP_A -> Target_QP_B(CMD),
> Target_QP_B(RDMA READ), Target_QP_B(RDMA WRITE), Target_QP_B ->
> Initiator_QP_A(COMP). this uses 4 RTT.
> 
RDMA read and writes are for the actual variable size data of 512B, 4K, 1MB etc.

Optionally, a target can expose a constant size buffer where initiator can directly write the data of 512B, 4KB as well.
However, this doesn't scale very well always, but sure it is possible, and it only works for blk write commands.

In a more advanced scheme target can dynamically add such buffers and advertise it to the initiator.
I would think to make it incremental once the basic data flow model is established.

> > Finally target_QP_B sends 8B completion, it arrives in the QP_A's receive
> queue.
> 
> --
> zhenwei pi

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: [virtio-comment] [PATCH v2 04/11] transport-fabrics: introduce Stream Transmission
  2023-06-02  2:26     ` zhenwei pi
@ 2023-06-05 16:11       ` Stefan Hajnoczi
  2023-06-06  3:13         ` zhenwei pi
  0 siblings, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-06-05 16:11 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 2797 bytes --]

On Fri, Jun 02, 2023 at 10:26:48AM +0800, zhenwei pi wrote:
> On 5/31/23 23:20, Stefan Hajnoczi wrote:
> > On Thu, May 04, 2023 at 04:19:03PM +0800, zhenwei pi wrote:
> > > +                  | +------+
> > > +                  | |flags |  -> VIRTIO_OF_DESC_F_WRITE
> > > +                  | +------+
> > > +                  |
> > > + DATA             |>+------+  -> 0
> > > +                    |......|
> > > +                    +------+  -> 1
> > > +\end{lstlisting}
> > 
> > I think this is more flexible (and has more protocol overhead) than
> > necessary. When the device has used a virtqueue buffer, it indicates how
> > many bytes were used (this can be less than the totaly number of F_WRITE
> > bytes available). I don't think there is a need to communicate F_WRITE
> > descriptors, especially in the Completion. Just a Completion with a
> > 'length' field instead of an 'ndesc' field followed by data is enough.
> > 
> 
> I guest this is not enough. For example, a initiator want to read 3 desc:
> desc0[n bytes], desc1[m bytes], desc2[1 byte]. desc[2] is expected to read a
> u8 status.
> 
> the target fills desc0[n - x bytes], desc1[m - y bytes], desc2[1 byte], the
> 'length' is (n - x + m - y + 1), we should decode each descriptor and fill
> the driver buffer correctly.(otherwise, if x + y > 0, desc[2] is never
> filled)

No, the framing really doesn't matter - that's what the spec says, after
all. The framing could be [n, m, 1] like in your example or [1, 1, n-2,
m-1, 1, 1], both are valid. What matters is that the device knows at
which offset the 1-byte status field must be written.

It is the VIRTIO specification that determines how to find the offset,
not the framing of the virtqueue buffer elements. (Again, the spec
explicitly forbids depending on framing.)

In other words, the virtio-blk spec says that the status byte is the
last writeable byte and that's how the device knows the offset. The
framing doesn't matter.

> > Since VIRTIO has flexible framing
> > (https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-390004),
> > there isn't really a need to communicate the F_WRITE descriptors at all,
> > just the maximum number of used bytes that the initiator expects.
> > 
> > Can you explain why you chose to transmit F_WRITE descriptors in both
> > the Command and the Completion? Maybe I missed a reason why it's
> > important.
> 
> Just keep the flags same to the descriptor from the command, give the
> initiator a hint 'this is a read descriptor'.

Sending virtqueue element information across the wire seems inefficient
to me. I think the protocol can be optimized for stream (TCP) and keyed
(RDMA) fabrics by omitting aspects that are not strictly needed.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-06-02  5:15     ` [virtio-comment] " zhenwei pi
@ 2023-06-05 16:30       ` Stefan Hajnoczi
  2023-06-06  1:31         ` [virtio-comment] " zhenwei pi
  2023-06-06  2:02         ` [virtio-comment] " zhenwei pi
  0 siblings, 2 replies; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-06-05 16:30 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 8030 bytes --]

On Fri, Jun 02, 2023 at 01:15:00PM +0800, zhenwei pi wrote:
> 
> 
> On 6/1/23 01:10, Stefan Hajnoczi wrote:
> > On Thu, May 04, 2023 at 04:19:05PM +0800, zhenwei pi wrote:
> > > Introduce command structures for Virtio-oF.
> > > 
> > > Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> > > ---
> > >   transport-fabrics.tex | 209 ++++++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 209 insertions(+)
> > > 
> > > diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> > > index 7711321..37f57c6 100644
> > > --- a/transport-fabrics.tex
> > > +++ b/transport-fabrics.tex
> > > @@ -495,3 +495,212 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
> > >                       |value |  -> 8193 (value.u32)
> > >                       +------+
> > >   \end{lstlisting}
> > > +
> > > +\subsubsection{Commands Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition}
> > > +This section defines command structures for Virtio Over Fabrics.
> > > +
> > > +A common structure virtio_of_value is fixed to 8 bytes and MUST be used as one
> > > +of the following format:
> > > +
> > > +\begin{itemize}
> > > +\item u8
> > > +\item le16
> > > +\item le32
> > > +\item le64
> > > +\end{itemize}
> > 
> > The way it's written does not document where the u8, u16, u32 bytes are
> > located and that the unused bytes are 0. I think I understand what you
> > mean though:
> > 
> >    le64 value = cpu_to_le64((u64)v); /* v is u8, u16, u32, or u64 */
> > 
> > Please clarify.
> > 
> 
> I want to describe an union structure of 8 bytes:
> union virtio_of_value {
>     u8;
>     u16;
>     u32;
>     u64;
> };
> 
> Depending on the opcode, use the right one.

I was trying to point out that the memory layout of C unions is not
portable. Your example does not define the exact in-memory layout of
union virtio_of_value. Here is the first web search result I found about
this topic:

  "Q: And a related question: if you dump unions in binary form to a file,
  and then reload them from the file on a different platform, or with a
  program compiled by a different compiler, are you guaranteed to get
  back what you stored? (I think not, but I'm not sure)

  A: You're right; you're not."

  https://bytes.com/topic/c/answers/220372-unions-storage-abis

In the cpu_to_le64() code example that I gave, the exact in-memory
layout is well-defined. There is no ambiguity.

> > > +\hline
> > > +0xff00 - 0xfffd & Reserved \\
> > > +\hline
> > > +\end{tabular}
> > > +
> > > +\paragraph{Connect Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Connect Command}
> > > +The Connect Command is used to establish Virtio Over Fabrics queue. The control
> > > +queue MUST be established firstly, then the Connect command establishes an
> > > +association between the initiator and the target.
> > 
> > Is a "Virtio Over Fabrics queue" different from a virtqueue?
> > 
> > If I understand correctly, the control queue must be established by the
> > initiator first and then the Connect command is sent to begin
> > communication between the initiator and the target?
> > 
> 
> The queue mapping is missing in the '[PATCH v2 01/11] transport-fabrics:
> introduce Virtio Over Fabrics overview', like:
> A "Virtio Over Fabrics queue" is a reliable connection between initiator and
> target. There are 2 types of Virtio Over Fabrics queue:
> +\begin{itemize}
> +\item A single Control queue is required to execute control operations.
> +\item 0 or more Virtio Over Fabrics queues map the virtqueues.
> +\end{itemize}

That helps, thanks!

> 
> > > +
> > > +The Target ID of 0xffff is reserved, then:
> > 
> > Please move this after the fields have been shown and the purpose of the
> > Target ID field has been explained.
> > 
> > > +\begin{itemize}
> > > +\item The Target ID of 0xffff MUST be specified as the Target ID in a Connect
> > > +Command for the control queue.
> > > +\item The target SHOULD allocate any available Target ID to the initiator,
> > > +and return the allocated Target ID in the Completion.
> > > +\item The returned Target ID MUST be specified as the Target ID, and the Queue ID
> > > +MUST be specified in a Connect Command for the virtqueue.
> > > +\end{itemize}
> > 
> > What is the purpose of the Target ID? Is it to allow a server to provide
> > access to multiple targets over the same connection?
> > 
> 
> A target listens on a port, and provides access to 0 or more targets. An
> initiator connect the specific target by TVQN of connect command.
> An initiator could connect a single target, multiple initiators could
> connect the same target(typically, shared disk/fs).

Why is the target ID separate from the TVQN? If the Target ID is a
separate parameter then users will have to learn additional
syntax/command-line options to specify the TVQN + Target ID and that
syntax may vary between software.

> 
> > > +
> > > +The Connect Command has following structure:
> > > +
> > > +\begin{lstlisting}
> > > +struct virtio_of_command_connect {
> > > +        le16 opcode;
> > > +        le16 command_id;
> > > +        le16 target_id;
> > > +        le16 queue_id;
> > > +        le16 ndesc;
> > 
> > Where is this field documented?
> > 
> 
> OK. Will add.
> 
> > Why does the initiator send ndesc to the target? Normally a VIRTIO Transport reports the device's max descriptors and then the driver can tell the device to reduce the number of descriptors, if desired.
> > 
> 
> A target supports at lease 1 descriptor. The 'ndesc' of struct
> virtio_of_command_connect indicates the full PDU contains: struct
> virtio_of_command_connect + 1 * virtio_of_vq_desc + data.
> 
> > > +#define VIRTIO_OF_CONNECTION_TCP     1
> > > +#define VIRTIO_OF_CONNECTION_RDMA    2
> > 
> > What does RDMA mean? I thought RDMA is a general concept that several
> > fabrics implement (with different details like how addressing works).
> > 
> 
> I guest your concern is the difference of IB/RoCE/iWarp ...
> We are trying to define the payload protocol here, so I think we can ignore
> the difference of the HCA.

I see, maybe this could be called STREAM vs KEYED instead of TCP vs RDMA?

> 
> > > +        u8 oftype;
> > > +        u8 padding[5];
> > > +};
> > > +\end{lstlisting}
> > > +
> > > +The Connect commands MUST contains one Segment Descriptor and one structure
> > > +virtio_of_command_connect to specify Initiator VQN and Target VNQ,
> > > +virtio_of_command_connect has following structure:
> > 
> > I'm confsued. virtio_of_command_connect was defined above. The struct
> > defined below is virtio_of_connect. Does this paragraph need to be
> > updated (virtio_of_command_connect -> virtio_of_connect)?
> > 
> > Why is virtio_of_connect a separate struct and not part of
> > virtio_of_command_connect?
> > 
> 
> Because I'd like to define all the commands with a fixed length.

I don't understand. virtio_of_connect and virtio_of_command_connect are
both fixed-length. Why can't they be unified into 1 fixed-length struct?

> > It's not possible to review this patch because these structs aren't used
> > yet and the opcodes are undefined.
> > 
> > Defining structs that are shared by multiple opcodes might make
> > implementations cleaner, but I think it makes the spec less clear. I
> > would rather have a list of all opcodes and each one shows the full
> > command layout (even if it is duplicated). That way it's very easy to
> > look up an opcode you are implementing or debugging and check what's
> > needed. If the command layout is not documented in a single place, then
> > it takes more effort to figure out how an opcode works.
> > 
> > Stefan
> 
> OK, I'll merge the structure definition into the opcode definition.

Thank you!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: [PATCH v2 07/11] transport-fabrics: introduce opcodes
  2023-06-02  8:39     ` [virtio-comment] " zhenwei pi
@ 2023-06-05 16:46       ` Stefan Hajnoczi
  0 siblings, 0 replies; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-06-05 16:46 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 3791 bytes --]

On Fri, Jun 02, 2023 at 04:39:24PM +0800, zhenwei pi wrote:
> On 6/1/23 04:55, Stefan Hajnoczi wrote:
> > On Thu, May 04, 2023 at 04:19:06PM +0800, zhenwei pi wrote:
> > Does virtio_of_op_get_device_feature return the feature bits offered by
> > the device or does it update to reflect negotiated feature bits after
> > virtio_of_op_set_driver_feature?
> > 
> 
> virtio_of_op_get_device_feature returns the same feature bits after
> virtio_of_op_set_driver_feature. Because 1) the device feature is capability
> of device, 2) a target may be shared by multi initiators.
> 
> For now, I don't see any dependence on getting driver feature. Do you have
> any concern about this?

No, that sounds good. I just want the semantics to be clearly defined
because VIRTIO Transports differ in whether the driver can read back
feature bits after negotiation. Doing so is not necessary because the
Device Status Field already indicates whether or not feature bit
negotiation was successful.

> 
> > > +
> > > +\paragraph{virtio_of_op_set_driver_feature}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_set_driver_feature}
> > > +
> > > +virtio_of_op_set_driver_feature is used to set driver feature for control queue only.
> > > +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Feature Command},
> > > +and specify the value field of Common Command as le64.
> > > +
> > > +The initiator uses feature_select field to select which feature bits to set.
> > > +Value 0x0 selects Feature Bits 0 to 63, 0x1 selects Feature Bits 64 to 128, etc.
> > > +
> > > +\paragraph{virtio_of_op_get_num_queues}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_num_queues}
> > > +
> > > +virtio_of_op_get_num_queues is used to get the number of queues for control queue only.
> > > +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Common Command},
> > > +and reads from value field of Completion as le16.
> > > +
> > > +\paragraph{virtio_of_op_get_queue_size}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Opcodes Definition / virtio_of_op_get_queue_size}
> > > +
> > > +virtio_of_op_get_queue_size is used to get the size of a specified queue for control queue only.
> > > +The initiator MUST issue a \nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Queue Command} with specified queue_id,
> > > +and reads from value field of Completion as le16.
> > 
> > Is it possible to set the queue size? For example, the PCI Transport
> > allows the driver to lower the queue size but not increase it (see
> > 4.1.5.1.3 Virtqueue Configuration).
> > 
> 
> Agree. Because a target may be shared by multi initiators, it's not
> reasonable to set queue size of target, the queue size only affect this
> initiator itself.
> For example, a target supports queue size 1024. initiatorX uses 128 queue
> size, and initiatorY uses 1024. Do you have any suggestion about this?

I assumed that there is a 1:1 mapping between VIRTIO Over Fabrics
Targets (TVQN + Target ID) and VIRTIO Devices. I expected initiatorY's
Connect Command to be rejected by the target when initiatorX is already
connected. Therefore there is no conflict between two initiators
choosing different queue sizes.

Anyway, I see no issue with allowing the initiator to reduce the queue
size. This allows the target to allocate fewer resources to the device
until the next device reset.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: [PATCH v2 09/11] transport-fabrics: add TCP&RDMA binding
  2023-06-02  9:07     ` [virtio-comment] Re: " zhenwei pi
@ 2023-06-05 16:57       ` Stefan Hajnoczi
  2023-06-06  1:41         ` [virtio-comment] " zhenwei pi
  0 siblings, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-06-05 16:57 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 3010 bytes --]

On Fri, Jun 02, 2023 at 05:07:14PM +0800, zhenwei pi wrote:
> 
> 
> On 6/1/23 05:02, Stefan Hajnoczi wrote:
> > On Thu, May 04, 2023 at 04:19:08PM +0800, zhenwei pi wrote:
> > > Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> > > ---
> > >   transport-fabrics.tex | 9 +++++++++
> > >   1 file changed, 9 insertions(+)
> > > 
> > > diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> > > index f563c3e..c47a744 100644
> > > --- a/transport-fabrics.tex
> > > +++ b/transport-fabrics.tex
> > > @@ -873,3 +873,12 @@ \subsubsection{Status Definition}\label{sec:Virtio Transport Options / Virtio Ov
> > >   #define VIRTIO_OF_EALREADY      114
> > >   #define VIRTIO_OF_EQUIRK        4096
> > >   \end{lstlisting}
> > > +
> > > +\subsection{Transport Binding}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transport Binding}
> > > +\subsubsection{TCP}\label{sec:Virtio Transport Options / Virtio Over Fabrics / ransport Binding / TCP}
> > > +TCP MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}
> > > +~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}.
> > > +
> > > +\subsubsection{RDMA}\label{sec:Virtio Transport Options / Virtio Over Fabrics / ransport Binding / RDMA}
> > > +RDMA MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}
> > > +~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}.
> > 
> > What about VQN representation, default port numbers, etc? There should
> > be enough information here so implementers can create compatible
> > implementations.
> > 
> 
> Already replied in '[PATCH v2 02/11] transport-fabrics: introduce Virtio
> Qualified Name'.
> 
> > Is there connection encryption support? It's hard to imagine running a
> > plaintext Virtio Over Fabrics TCP connection in a production environment
> > due to security concerns.
> > 
> > Stefan
> 
> As far as I can see, 1) an ACL mechanism could be used in the engineering
> implementation without any specification.(ex, a target only allows a
> specific IVQN). 2) authentication may be introduced in the future.
> 
> Does the virtqueue buffers need encryption support?

An ACL in the target is still susceptible to attacks on confidentiality
(spying on traffic) and integrity (spoofing, injecting, or corrupting
traffic).

My view is that nowadays anything that goes over the network needs
Transport Layer Security (TLS) built in or something comparable unless
the use cases are clearly limited to scenarios where this is not
necessary. To me it seems like Virtio over Fabarics could be used in
scenarios where encryption is necessary (e.g. to protect user data being
sent over a network).

NVMe-over-TCP supports TLS.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [virtio-comment] [PATCH v2 02/11] transport-fabrics: introduce Virtio Qualified Name
  2023-06-05  2:40       ` Parav Pandit
  2023-06-05  7:57         ` zhenwei pi
@ 2023-06-05 17:05         ` Stefan Hajnoczi
  1 sibling, 0 replies; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-06-05 17:05 UTC (permalink / raw)
  To: Parav Pandit
  Cc: zhenwei pi, mst, jasowang, virtio-comment, houp, helei.sig11,
	xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 2717 bytes --]

On Sun, Jun 04, 2023 at 10:40:06PM -0400, Parav Pandit wrote:
> 
> 
> On 6/1/2023 9:50 PM, zhenwei pi wrote:
> > 
> > 
> > On 5/31/23 22:06, Stefan Hajnoczi wrote:
> > > On Thu, May 04, 2023 at 04:19:01PM +0800, zhenwei pi wrote:
> > > > Add VQN section. The VQN is a little different from iSCSI/NVMe-oF on
> > > > style limitation. Because iSCSI/NVMe-of is storage specific protocol,
> > > > the full string IQN(for iSCSI/iSER) and NQN(for NVMe-oF) represents
> > > > a "storage access address". However, Virtio Over Fabrics works as
> > > > transport layer rather than device layer, a URL style string is better
> > > > to Virtio Over Fabrics. For example:
> > > > virtio-of://blk-resource/nvme-pool/849a39ad-8d7b-4a7a-adb6-e7407ace532c
> > > > virtio-of://blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
> > > > ...
> > > > virtio-of://crypto-resource/25307f22-e5a8-4ea2-b7ca-79f5c3bebc3c
> > > 
> > > I'm not sure what blk-resource and nvme-pool are in these URLs?
> > > 
> > > Should the patch mention the virtio-of:// URI scheme?
> > > 
> > 
> > Sorry, I missed the address and port. They should be:
> > virtio-rdma://192.168.1.100:8549/blk-resource/nvme-pool/849a39ad-8d7b-4a7a-adb6-e7407ace532c
> > virtio-tcp://192.168.1.110/blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
> 
> Since it is device specific resource, may be blk-dev or blk-device reads
> better, as behind this device there are multiple resources.
> 
> > ...
> > 
> > This is human readable string. when the software(or hardware) handles
> > this, this should be translated into:
> > transport: RDMA
> > address: 192.168.1.100
> > port: 8549 (default port 8549(CRC-16/ARC of "Virtio"))
> > target VQN: blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
> > 
> > This section only defines the "VQN" schema, not the resource string schema.
> > 
> > For a process, I think the following two are both fine:
> > ./foo --full-url virtio-rdma://192.168.1.100:8549/blk-resource/nvme-pool/849a39ad-8d7b-4a7a-adb6-e7407ace532c
> > ./foo --transport rdma --address 192.168.1.100 --port 8549 --tvqn
> > blk-resource/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
> > 
> > [snip]
> > 
> > > 
> > > Is the maximum name 255 UTF-8 bytes plus a NUL character? Please state
> > > this in the spec. For example:
> > > 
> > >    \item The string is NUL terminated.
> s/NUL/NULL ?

I like to use the ASCII "NUL" character name because that avoids
confusion with other concepts of nullness in programming:

  "It is often abbreviated as NUL (or NULL, though in some contexts that
  term is used for the null pointer)"

https://en.wikipedia.org/wiki/NUL_character

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: Re: [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-06-02  0:55               ` zhenwei pi
@ 2023-06-05 17:21                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-06-05 17:21 UTC (permalink / raw)
  To: zhenwei pi; +Cc: virtio-comment

[-- Attachment #1: Type: text/plain, Size: 4379 bytes --]

On Fri, Jun 02, 2023 at 08:55:02AM +0800, zhenwei pi wrote:
> 
> 
> On 6/2/23 05:23, Stefan Hajnoczi wrote:
> > On Thu, Jun 01, 2023 at 03:13:53PM -0400, Stefan Hajnoczi wrote:
> > > On Thu, Jun 01, 2023 at 09:09:49PM +0800, zhenwei pi wrote:
> > > > On 6/1/23 19:33, Stefan Hajnoczi wrote:
> > > > > On Thu, Jun 01, 2023 at 05:02:45PM +0800, zhenwei pi wrote:
> > > > > > On 6/1/23 00:20, Stefan Hajnoczi wrote:
> > > > > > > On Thu, May 04, 2023 at 04:19:04PM +0800, zhenwei pi wrote:
> > > One more idea to play with: VIRTIO has flexible message framing, so
> > > devices must process a virtqueue buffer the same regardless of whether
> > > it has 1 large element or many small elements. Therefore the virtqueue
> > > RDMA protocol does not need to preserve the virtqueue element count and
> > > sizes from the driver. For example, the target can offer a list of
> > > key/length pairs that the initiator RDMA WRITES the virtqueue buffer
> > > contents into. For a virtio-blk device that would be a struct
> > > virtio_blk_outhdr followed by a large page-aligned buffer for the I/O
> > > buffer data to be transferred. Then the device always a properly aligned
> > > and contiguous buffer. Unfortunately this approach breaks down when the
> > > virtqueue carries requests that are organized very differently, but it
> > > might be useful when there is a most common request type.
> > 
> > I'm not sure if I explained this well. What I'm trying to say is that I
> > think RDMA benefits when the receiver's memory constraints are visible
> > to the sender. The sender performs RDMA WRITEs to the locations where
> > the receiver can efficiently process the data.
> > 
> > This protocol proposal doesn't really take advantage of this approach
> > because it communicates the virtqueue buffer elements from the initiator
> > (the sender) to the target (the receiver). That's the wrong way around.
> > 
> > I have never used RDMA myself, so this might be wrong, but as long as
> > the RDMA API allows the sender to specify a scatter-gather list as
> > input, then I think the details of the virtqueue buffer elements that
> > don't have the WRITE flag should never be communicated over the network.
> > Instead the initiator should RDMA WRITE from the VIRTIO driver's
> > scatter-gather list to the target's preferred destination instead.
> > 
> > Stefan
> 
> Hi,
> 
> I guess I followed your point. "the target can offer a list of key/length
> pairs that the initiator RDMA WRITES the virtqueue buffer contents into"
> seems not good to me, I'd prefer to expose RDMA memory region of initiator
> side only(target side uses RDMA READ/WRITE to operate the memory of
> initiator, this means target side has no need to allocate/pin memory
> buffer).

Many targets will need to pin memory for the underlying disk I/O anyway.
If the initiator RDMA WRITEs data into the target's pinned memory, then
the target can forward the data to the disk without copies.

But assuming the target doesn't want to pin memory, the protocol can
still be simplified. The initiator sends a VQ_OP command containing:
1. VQ_OP header with a list of <addr, key, len> tuples for WRITE
   virtqueue buffer elements.
2. The contents of the !WRITE virtqueue buffer elements.

Note that this approach does not involve the target sending RDMA READs
because this seems inefficient to me when the ibv_*() APIs allow the
initiator to send the !WRITE virtqueue buffer elements along with the
requests using a scatter-gather list.

The target receives the VQ_OP command and sends RDMA WRITEs to fill in
used buffer elements. The last RDMA WRITEs may need to be WRITE WITH IMM
to efficiently complete the request.

> From the point of my view, this protocol needs to be effective and
> maintainable, mapping vring mechanism with RDMA WRITE from 2
> directions(initiator to target, and target to initiator) leads high
> complexity ...

My concern is that simply mapping vrings to RDMA is inefficient. It is
not necessary for the target to RDMA READ virtqueue buffer elements when
the initiator could include them in its send scatter-gather list
instead.

If we forget about vrings and focus instead on how to offer virtqueue
semantics at the minimal RDMA cost, then I think the protocol would look
more like what I'm describing.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-06-05 16:30       ` Stefan Hajnoczi
@ 2023-06-06  1:31         ` zhenwei pi
  2023-06-06 13:34           ` Stefan Hajnoczi
  2023-06-06  2:02         ` [virtio-comment] " zhenwei pi
  1 sibling, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-06  1:31 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

On 6/6/23 00:30, Stefan Hajnoczi wrote:
> On Fri, Jun 02, 2023 at 01:15:00PM +0800, zhenwei pi wrote:
>>
>>
>> On 6/1/23 01:10, Stefan Hajnoczi wrote:
>>> On Thu, May 04, 2023 at 04:19:05PM +0800, zhenwei pi wrote:
>>>> Introduce command structures for Virtio-oF.
>>>>
>>>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>>>> ---
>>>>    transport-fabrics.tex | 209 ++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 209 insertions(+)
>>>>
>>>> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
>>>> index 7711321..37f57c6 100644
>>>> --- a/transport-fabrics.tex
>>>> +++ b/transport-fabrics.tex
>>>> @@ -495,3 +495,212 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
>>>>                        |value |  -> 8193 (value.u32)
>>>>                        +------+
>>>>    \end{lstlisting}
>>>> +
>>>> +\subsubsection{Commands Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition}
>>>> +This section defines command structures for Virtio Over Fabrics.
>>>> +
>>>> +A common structure virtio_of_value is fixed to 8 bytes and MUST be used as one
>>>> +of the following format:
>>>> +
>>>> +\begin{itemize}
>>>> +\item u8
>>>> +\item le16
>>>> +\item le32
>>>> +\item le64
>>>> +\end{itemize}
>>>
>>> The way it's written does not document where the u8, u16, u32 bytes are
>>> located and that the unused bytes are 0. I think I understand what you
>>> mean though:
>>>
>>>     le64 value = cpu_to_le64((u64)v); /* v is u8, u16, u32, or u64 */
>>>
>>> Please clarify.
>>>
>>
>> I want to describe an union structure of 8 bytes:
>> union virtio_of_value {
>>      u8;
>>      u16;
>>      u32;
>>      u64;
>> };
>>
>> Depending on the opcode, use the right one.
> 
> I was trying to point out that the memory layout of C unions is not
> portable. Your example does not define the exact in-memory layout of
> union virtio_of_value. Here is the first web search result I found about
> this topic:
> 
>    "Q: And a related question: if you dump unions in binary form to a file,
>    and then reload them from the file on a different platform, or with a
>    program compiled by a different compiler, are you guaranteed to get
>    back what you stored? (I think not, but I'm not sure)
> 
>    A: You're right; you're not."
> 
>    https://bytes.com/topic/c/answers/220372-unions-storage-abis
> 
> In the cpu_to_le64() code example that I gave, the exact in-memory
> layout is well-defined. There is no ambiguity.
> 

OK, thanks.

>>>> +\hline
>>>> +0xff00 - 0xfffd & Reserved \\
>>>> +\hline
>>>> +\end{tabular}
>>>> +
>>>> +\paragraph{Connect Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Connect Command}
>>>> +The Connect Command is used to establish Virtio Over Fabrics queue. The control
>>>> +queue MUST be established firstly, then the Connect command establishes an
>>>> +association between the initiator and the target.
>>>
>>> Is a "Virtio Over Fabrics queue" different from a virtqueue?
>>>
>>> If I understand correctly, the control queue must be established by the
>>> initiator first and then the Connect command is sent to begin
>>> communication between the initiator and the target?
>>>
>>
>> The queue mapping is missing in the '[PATCH v2 01/11] transport-fabrics:
>> introduce Virtio Over Fabrics overview', like:
>> A "Virtio Over Fabrics queue" is a reliable connection between initiator and
>> target. There are 2 types of Virtio Over Fabrics queue:
>> +\begin{itemize}
>> +\item A single Control queue is required to execute control operations.
>> +\item 0 or more Virtio Over Fabrics queues map the virtqueues.
>> +\end{itemize}
> 
> That helps, thanks!
> 
>>
>>>> +
>>>> +The Target ID of 0xffff is reserved, then:
>>>
>>> Please move this after the fields have been shown and the purpose of the
>>> Target ID field has been explained.
>>>
>>>> +\begin{itemize}
>>>> +\item The Target ID of 0xffff MUST be specified as the Target ID in a Connect
>>>> +Command for the control queue.
>>>> +\item The target SHOULD allocate any available Target ID to the initiator,
>>>> +and return the allocated Target ID in the Completion.
>>>> +\item The returned Target ID MUST be specified as the Target ID, and the Queue ID
>>>> +MUST be specified in a Connect Command for the virtqueue.
>>>> +\end{itemize}
>>>
>>> What is the purpose of the Target ID? Is it to allow a server to provide
>>> access to multiple targets over the same connection?
>>>
>>
>> A target listens on a port, and provides access to 0 or more targets. An
>> initiator connect the specific target by TVQN of connect command.
>> An initiator could connect a single target, multiple initiators could
>> connect the same target(typically, shared disk/fs).
> 
> Why is the target ID separate from the TVQN? If the Target ID is a
> separate parameter then users will have to learn additional
> syntax/command-line options to specify the TVQN + Target ID and that
> syntax may vary between software.
> 
>>
>>>> +
>>>> +The Connect Command has following structure:
>>>> +
>>>> +\begin{lstlisting}
>>>> +struct virtio_of_command_connect {
>>>> +        le16 opcode;
>>>> +        le16 command_id;
>>>> +        le16 target_id;
>>>> +        le16 queue_id;
>>>> +        le16 ndesc;
>>>
>>> Where is this field documented?
>>>
>>
>> OK. Will add.
>>
>>> Why does the initiator send ndesc to the target? Normally a VIRTIO Transport reports the device's max descriptors and then the driver can tell the device to reduce the number of descriptors, if desired.
>>>
>>
>> A target supports at lease 1 descriptor. The 'ndesc' of struct
>> virtio_of_command_connect indicates the full PDU contains: struct
>> virtio_of_command_connect + 1 * virtio_of_vq_desc + data.
>>
>>>> +#define VIRTIO_OF_CONNECTION_TCP     1
>>>> +#define VIRTIO_OF_CONNECTION_RDMA    2
>>>
>>> What does RDMA mean? I thought RDMA is a general concept that several
>>> fabrics implement (with different details like how addressing works).
>>>
>>
>> I guest your concern is the difference of IB/RoCE/iWarp ...
>> We are trying to define the payload protocol here, so I think we can ignore
>> the difference of the HCA.
> 
> I see, maybe this could be called STREAM vs KEYED instead of TCP vs RDMA?
> 

I'd like to define two PDU mapping rules(in '[PATCH v2 04/11] 
transport-fabrics: introduce Stream Transmission' and '[PATCH v2 05/11] 
transport-fabrics: introduce Keyed Transmission'): STREAM and KEYED. A 
transport protocols need to use one.

Then we can define protocols:
#define VIRTIO_OF_CONNECTION_TCP     1		-> use STREAM
#define VIRTIO_OF_CONNECTION_RDMA    2		-> use KEYED
#define VIRTIO_OF_CONNECTION_TLS     3(in the future)	-> use STREAM
#define VIRTIO_OF_CONNECTION_XXX

>>
>>>> +        u8 oftype;
>>>> +        u8 padding[5];
>>>> +};
>>>> +\end{lstlisting}
>>>> +
>>>> +The Connect commands MUST contains one Segment Descriptor and one structure
>>>> +virtio_of_command_connect to specify Initiator VQN and Target VNQ,
>>>> +virtio_of_command_connect has following structure:
>>>
>>> I'm confsued. virtio_of_command_connect was defined above. The struct
>>> defined below is virtio_of_connect. Does this paragraph need to be
>>> updated (virtio_of_command_connect -> virtio_of_connect)?
>>>
>>> Why is virtio_of_connect a separate struct and not part of
>>> virtio_of_command_connect?
>>>
>>
>> Because I'd like to define all the commands with a fixed length.
> 
> I don't understand. virtio_of_connect and virtio_of_command_connect are
> both fixed-length. Why can't they be unified into 1 fixed-length struct?
> 

For stream protocol, it always work fine.
For keyed protocol, for example RDMA, the target side needs to use 
ibv_post_recv to receive a large size(sizeof virtio_of_command_connect + 
sizeof virtio_of_connect). If the target uses ibv_post_recv to receive 
sizeof(CMD) + sizeof(DESC) * 1, the initiator fails in RDMA SEND.

>>> It's not possible to review this patch because these structs aren't used
>>> yet and the opcodes are undefined.
>>>
>>> Defining structs that are shared by multiple opcodes might make
>>> implementations cleaner, but I think it makes the spec less clear. I
>>> would rather have a list of all opcodes and each one shows the full
>>> command layout (even if it is duplicated). That way it's very easy to
>>> look up an opcode you are implementing or debugging and check what's
>>> needed. If the command layout is not documented in a single place, then
>>> it takes more effort to figure out how an opcode works.
>>>
>>> Stefan
>>
>> OK, I'll merge the structure definition into the opcode definition.
> 
> Thank you!
> 
> Stefan

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: Re: [PATCH v2 09/11] transport-fabrics: add TCP&RDMA binding
  2023-06-05 16:57       ` Stefan Hajnoczi
@ 2023-06-06  1:41         ` zhenwei pi
  2023-06-06 13:51           ` Stefan Hajnoczi
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-06  1:41 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

On 6/6/23 00:57, Stefan Hajnoczi wrote:
> On Fri, Jun 02, 2023 at 05:07:14PM +0800, zhenwei pi wrote:
>>
>>
>> On 6/1/23 05:02, Stefan Hajnoczi wrote:
>>> On Thu, May 04, 2023 at 04:19:08PM +0800, zhenwei pi wrote:
>>>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>>>> ---
>>>>    transport-fabrics.tex | 9 +++++++++
>>>>    1 file changed, 9 insertions(+)
>>>>
>>>> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
>>>> index f563c3e..c47a744 100644
>>>> --- a/transport-fabrics.tex
>>>> +++ b/transport-fabrics.tex
>>>> @@ -873,3 +873,12 @@ \subsubsection{Status Definition}\label{sec:Virtio Transport Options / Virtio Ov
>>>>    #define VIRTIO_OF_EALREADY      114
>>>>    #define VIRTIO_OF_EQUIRK        4096
>>>>    \end{lstlisting}
>>>> +
>>>> +\subsection{Transport Binding}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transport Binding}
>>>> +\subsubsection{TCP}\label{sec:Virtio Transport Options / Virtio Over Fabrics / ransport Binding / TCP}
>>>> +TCP MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}
>>>> +~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}.
>>>> +
>>>> +\subsubsection{RDMA}\label{sec:Virtio Transport Options / Virtio Over Fabrics / ransport Binding / RDMA}
>>>> +RDMA MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}
>>>> +~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}.
>>>
>>> What about VQN representation, default port numbers, etc? There should
>>> be enough information here so implementers can create compatible
>>> implementations.
>>>
>>
>> Already replied in '[PATCH v2 02/11] transport-fabrics: introduce Virtio
>> Qualified Name'.
>>
>>> Is there connection encryption support? It's hard to imagine running a
>>> plaintext Virtio Over Fabrics TCP connection in a production environment
>>> due to security concerns.
>>>
>>> Stefan
>>
>> As far as I can see, 1) an ACL mechanism could be used in the engineering
>> implementation without any specification.(ex, a target only allows a
>> specific IVQN). 2) authentication may be introduced in the future.
>>
>> Does the virtqueue buffers need encryption support?
> 
> An ACL in the target is still susceptible to attacks on confidentiality
> (spying on traffic) and integrity (spoofing, injecting, or corrupting
> traffic).
> 
> My view is that nowadays anything that goes over the network needs
> Transport Layer Security (TLS) built in or something comparable unless
> the use cases are clearly limited to scenarios where this is not
> necessary. To me it seems like Virtio over Fabarics could be used in
> scenarios where encryption is necessary (e.g. to protect user data being
> sent over a network).
> 
> NVMe-over-TCP supports TLS.
> 
> Stefan

Generally, LAN is considered to be secure, using TCP makes sense. TLS is 
needed for WAN.

I prefer to support 2 different transports NVMe-over-TCP and 
NVMe-over-TLS, both use STREAM.

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-06-05 16:30       ` Stefan Hajnoczi
  2023-06-06  1:31         ` [virtio-comment] " zhenwei pi
@ 2023-06-06  2:02         ` zhenwei pi
  2023-06-06 13:44           ` Stefan Hajnoczi
  1 sibling, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-06  2:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong



On 6/6/23 00:30, Stefan Hajnoczi wrote:
[snip]
>>
>>>> +
>>>> +The Target ID of 0xffff is reserved, then:
>>>
>>> Please move this after the fields have been shown and the purpose of the
>>> Target ID field has been explained.
>>>
>>>> +\begin{itemize}
>>>> +\item The Target ID of 0xffff MUST be specified as the Target ID in a Connect
>>>> +Command for the control queue.
>>>> +\item The target SHOULD allocate any available Target ID to the initiator,
>>>> +and return the allocated Target ID in the Completion.
>>>> +\item The returned Target ID MUST be specified as the Target ID, and the Queue ID
>>>> +MUST be specified in a Connect Command for the virtqueue.
>>>> +\end{itemize}
>>>
>>> What is the purpose of the Target ID? Is it to allow a server to provide
>>> access to multiple targets over the same connection?
>>>
>>
>> A target listens on a port, and provides access to 0 or more targets. An
>> initiator connect the specific target by TVQN of connect command.
>> An initiator could connect a single target, multiple initiators could
>> connect the same target(typically, shared disk/fs).
> 
> Why is the target ID separate from the TVQN? If the Target ID is a
> separate parameter then users will have to learn additional
> syntax/command-line options to specify the TVQN + Target ID and that
> syntax may vary between software.
> 

The TVQN is the location of a target, for example:
virtio-tcp://192.168.1.110:8549/blk-dev/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1

A target can be shared by multi initiators, they accesses the target by 
the same address(transport: tcp, ip: 192.168.1.110, port: 8549, TVQN: 
blk-dev/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1):

Initiator_A launches Control Queue, issues connect command, gets the 
Target ID of Target_A(typically, dynamically allocated by target), then 
virtqueues connect to Target_A.

Initiator_B launches Control Queue, issues connect command, gets the 
Target ID of Target_B(typically, dynamically allocated by target), then 
virtqueues connect to Target_B.

...

Once an initiator disconnects, the target should reclaim the Target ID.

[snip]

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: Re: [virtio-comment] [PATCH v2 04/11] transport-fabrics: introduce Stream Transmission
  2023-06-05 16:11       ` Stefan Hajnoczi
@ 2023-06-06  3:13         ` zhenwei pi
  2023-06-06 13:09           ` Stefan Hajnoczi
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-06  3:13 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

On 6/6/23 00:11, Stefan Hajnoczi wrote:
> On Fri, Jun 02, 2023 at 10:26:48AM +0800, zhenwei pi wrote:
>> On 5/31/23 23:20, Stefan Hajnoczi wrote:
>>> On Thu, May 04, 2023 at 04:19:03PM +0800, zhenwei pi wrote:
>>>> +                  | +------+
>>>> +                  | |flags |  -> VIRTIO_OF_DESC_F_WRITE
>>>> +                  | +------+
>>>> +                  |
>>>> + DATA             |>+------+  -> 0
>>>> +                    |......|
>>>> +                    +------+  -> 1
>>>> +\end{lstlisting}
>>>
>>> I think this is more flexible (and has more protocol overhead) than
>>> necessary. When the device has used a virtqueue buffer, it indicates how
>>> many bytes were used (this can be less than the totaly number of F_WRITE
>>> bytes available). I don't think there is a need to communicate F_WRITE
>>> descriptors, especially in the Completion. Just a Completion with a
>>> 'length' field instead of an 'ndesc' field followed by data is enough.
>>>
>>
>> I guest this is not enough. For example, a initiator want to read 3 desc:
>> desc0[n bytes], desc1[m bytes], desc2[1 byte]. desc[2] is expected to read a
>> u8 status.
>>
>> the target fills desc0[n - x bytes], desc1[m - y bytes], desc2[1 byte], the
>> 'length' is (n - x + m - y + 1), we should decode each descriptor and fill
>> the driver buffer correctly.(otherwise, if x + y > 0, desc[2] is never
>> filled)
> 
> No, the framing really doesn't matter - that's what the spec says, after
> all. The framing could be [n, m, 1] like in your example or [1, 1, n-2,
> m-1, 1, 1], both are valid. What matters is that the device knows at
> which offset the 1-byte status field must be written.
> 
> It is the VIRTIO specification that determines how to find the offset,
> not the framing of the virtqueue buffer elements. (Again, the spec
> explicitly forbids depending on framing.)
> 
> In other words, the virtio-blk spec says that the status byte is the
> last writeable byte and that's how the device knows the offset. The
> framing doesn't matter.
> 
>>> Since VIRTIO has flexible framing
>>> (https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-390004),
>>> there isn't really a need to communicate the F_WRITE descriptors at all,
>>> just the maximum number of used bytes that the initiator expects.
>>>
>>> Can you explain why you chose to transmit F_WRITE descriptors in both
>>> the Command and the Completion? Maybe I missed a reason why it's
>>> important.
>>
>> Just keep the flags same to the descriptor from the command, give the
>> initiator a hint 'this is a read descriptor'.
> 
> Sending virtqueue element information across the wire seems inefficient
> to me. I think the protocol can be optimized for stream (TCP) and keyed
> (RDMA) fabrics by omitting aspects that are not strictly needed.
> 
> Stefan

Got it, thanks! By the way, for both command and completion, the 
descriptors are not necessary? A command like:
struct virtio_of_command_vq {
         le16 opcode;
         le16 command_id;
         le32 out_length;
         le32 in_length;
         u8 rsvd[4];
};

This seems enough ...

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: RE: RE: [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-06-05 13:12           ` Parav Pandit
@ 2023-06-06  7:13             ` zhenwei pi
  2023-06-06 21:52               ` Parav Pandit
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-06  7:13 UTC (permalink / raw)
  To: Parav Pandit, stefanha
  Cc: mst, virtio-comment, houp, helei.sig11, xinhao.kong, jasowang,
	zhouhuaping.san



On 6/5/23 21:12, Parav Pandit wrote:
> 
> 
>> From: zhenwei pi <pizhenwei@bytedance.com>
>> Sent: Monday, June 5, 2023 8:50 AM
> 
> 
>>>>> if we talk blk as an example, above command descriptor can be of 32
>>>>> bytes, such as struct virtio_of_cmd {
>>>>>        u8 opcode;
>>>>>        u8 rsvd;
>>>>>        le16 cmd_id;
>>>>>        u8 inline_desc_cnt;
>>>>>        u8 rsvd[3];
>>>>>        /* some padding/metadata for long desc list if any */ };
>>>>>
>>>>> struct virtio_of_rdma_desc {
>>>>>        le64 addr;
>>>>>        le32 length;
>>>>>        le32 rdma_key;
>>>>> };
>>>>>
>>>>> struct virtio_rdma_op {
>>>>>        struct virtio_of_cmd cmd;
>>>>>        struct virtio_of_rdma_desc desc[1 or 3]; /* count can be
>>>>> negotiated */ };
>>>>>
>>>>> With this a send and receive queue on initiator and target can
>>>>> exchange, cmd descriptor for read/writes.
>>>>>
>>>>
>>>> Hi,
>>>>
>>>> Do you mean that separating a Virtio Over RDMA queue into 2 QP, one
>>>> for sending, another one for receiving?
>>>>
>>> No. just one QP.
>>>
>>> Initiator_QP_A -> target_QP_B.
>>>
>>> When initiator QP A sends 32B cmd, it lands in the target QP B's receive
>> queue.
>>>
>>> After this target can do one or more read/write DMA using RDMA read/write
>> from the initiator's memory.
>>>
>>
>> Hi, I have several questions:
>> 1, how to tell the target to read/write DMA using RDMA read/write? is
>> virtio_of_rdma_desc missing?
>>
> Virtio_of_rdma_desc is part of the 32B struct virtio_rdma_op in above example.
> 
>> 2, if several virtio_of_rdma_desc arrives, the target need to distinguish READ *
>> m + WRITE * n descriptors. but *flags* field has been removed ...
>>
> The idea is to not have multiple virtio_of_rdma_desc.
> An initiator can represent 1B to 4GB of noncontiguous buffer using a single rdma mkey.
> Hence, only one virtio_of_rdma_desc is enough from initiator to target.
> 

I have only a few knowledge about this:
https://docs.nvidia.com/networking/pages/viewpage.action?pageId=25138119

And I notice that rdma-core and linux infiniband/core has no standard 
support for this, it seems mlx5 specific. Please correct me if I 
misunderstood...

>> 3, if I understand correctly, Initiator_QP_A -> Target_QP_B(CMD),
>> Target_QP_B(RDMA READ), Target_QP_B(RDMA WRITE), Target_QP_B ->
>> Initiator_QP_A(COMP). this uses 4 RTT.
>>
> RDMA read and writes are for the actual variable size data of 512B, 4K, 1MB etc.
> 
> Optionally, a target can expose a constant size buffer where initiator can directly write the data of 512B, 4KB as well.
> However, this doesn't scale very well always, but sure it is possible, and it only works for blk write commands.
> 
> In a more advanced scheme target can dynamically add such buffers and advertise it to the initiator.
> I would think to make it incremental once the basic data flow model is established.
> 
>>> Finally target_QP_B sends 8B completion, it arrives in the QP_A's receive
>> queue.
>>
>> --
>> zhenwei pi

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: Re: [virtio-comment] [PATCH v2 04/11] transport-fabrics: introduce Stream Transmission
  2023-06-06  3:13         ` zhenwei pi
@ 2023-06-06 13:09           ` Stefan Hajnoczi
  0 siblings, 0 replies; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-06-06 13:09 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 3475 bytes --]

On Tue, Jun 06, 2023 at 11:13:01AM +0800, zhenwei pi wrote:
> On 6/6/23 00:11, Stefan Hajnoczi wrote:
> > On Fri, Jun 02, 2023 at 10:26:48AM +0800, zhenwei pi wrote:
> > > On 5/31/23 23:20, Stefan Hajnoczi wrote:
> > > > On Thu, May 04, 2023 at 04:19:03PM +0800, zhenwei pi wrote:
> > > > > +                  | +------+
> > > > > +                  | |flags |  -> VIRTIO_OF_DESC_F_WRITE
> > > > > +                  | +------+
> > > > > +                  |
> > > > > + DATA             |>+------+  -> 0
> > > > > +                    |......|
> > > > > +                    +------+  -> 1
> > > > > +\end{lstlisting}
> > > > 
> > > > I think this is more flexible (and has more protocol overhead) than
> > > > necessary. When the device has used a virtqueue buffer, it indicates how
> > > > many bytes were used (this can be less than the totaly number of F_WRITE
> > > > bytes available). I don't think there is a need to communicate F_WRITE
> > > > descriptors, especially in the Completion. Just a Completion with a
> > > > 'length' field instead of an 'ndesc' field followed by data is enough.
> > > > 
> > > 
> > > I guest this is not enough. For example, a initiator want to read 3 desc:
> > > desc0[n bytes], desc1[m bytes], desc2[1 byte]. desc[2] is expected to read a
> > > u8 status.
> > > 
> > > the target fills desc0[n - x bytes], desc1[m - y bytes], desc2[1 byte], the
> > > 'length' is (n - x + m - y + 1), we should decode each descriptor and fill
> > > the driver buffer correctly.(otherwise, if x + y > 0, desc[2] is never
> > > filled)
> > 
> > No, the framing really doesn't matter - that's what the spec says, after
> > all. The framing could be [n, m, 1] like in your example or [1, 1, n-2,
> > m-1, 1, 1], both are valid. What matters is that the device knows at
> > which offset the 1-byte status field must be written.
> > 
> > It is the VIRTIO specification that determines how to find the offset,
> > not the framing of the virtqueue buffer elements. (Again, the spec
> > explicitly forbids depending on framing.)
> > 
> > In other words, the virtio-blk spec says that the status byte is the
> > last writeable byte and that's how the device knows the offset. The
> > framing doesn't matter.
> > 
> > > > Since VIRTIO has flexible framing
> > > > (https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-390004),
> > > > there isn't really a need to communicate the F_WRITE descriptors at all,
> > > > just the maximum number of used bytes that the initiator expects.
> > > > 
> > > > Can you explain why you chose to transmit F_WRITE descriptors in both
> > > > the Command and the Completion? Maybe I missed a reason why it's
> > > > important.
> > > 
> > > Just keep the flags same to the descriptor from the command, give the
> > > initiator a hint 'this is a read descriptor'.
> > 
> > Sending virtqueue element information across the wire seems inefficient
> > to me. I think the protocol can be optimized for stream (TCP) and keyed
> > (RDMA) fabrics by omitting aspects that are not strictly needed.
> > 
> > Stefan
> 
> Got it, thanks! By the way, for both command and completion, the descriptors
> are not necessary? A command like:
> struct virtio_of_command_vq {
>         le16 opcode;
>         le16 command_id;
>         le32 out_length;
>         le32 in_length;
>         u8 rsvd[4];
> };
> 
> This seems enough ...

Yes.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-06-06  1:31         ` [virtio-comment] " zhenwei pi
@ 2023-06-06 13:34           ` Stefan Hajnoczi
  2023-06-07  2:58             ` [virtio-comment] " zhenwei pi
  0 siblings, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-06-06 13:34 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 9826 bytes --]

On Tue, Jun 06, 2023 at 09:31:27AM +0800, zhenwei pi wrote:
> On 6/6/23 00:30, Stefan Hajnoczi wrote:
> > On Fri, Jun 02, 2023 at 01:15:00PM +0800, zhenwei pi wrote:
> > > 
> > > 
> > > On 6/1/23 01:10, Stefan Hajnoczi wrote:
> > > > On Thu, May 04, 2023 at 04:19:05PM +0800, zhenwei pi wrote:
> > > > > Introduce command structures for Virtio-oF.
> > > > > 
> > > > > Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> > > > > ---
> > > > >    transport-fabrics.tex | 209 ++++++++++++++++++++++++++++++++++++++++++
> > > > >    1 file changed, 209 insertions(+)
> > > > > 
> > > > > diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> > > > > index 7711321..37f57c6 100644
> > > > > --- a/transport-fabrics.tex
> > > > > +++ b/transport-fabrics.tex
> > > > > @@ -495,3 +495,212 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
> > > > >                        |value |  -> 8193 (value.u32)
> > > > >                        +------+
> > > > >    \end{lstlisting}
> > > > > +
> > > > > +\subsubsection{Commands Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition}
> > > > > +This section defines command structures for Virtio Over Fabrics.
> > > > > +
> > > > > +A common structure virtio_of_value is fixed to 8 bytes and MUST be used as one
> > > > > +of the following format:
> > > > > +
> > > > > +\begin{itemize}
> > > > > +\item u8
> > > > > +\item le16
> > > > > +\item le32
> > > > > +\item le64
> > > > > +\end{itemize}
> > > > 
> > > > The way it's written does not document where the u8, u16, u32 bytes are
> > > > located and that the unused bytes are 0. I think I understand what you
> > > > mean though:
> > > > 
> > > >     le64 value = cpu_to_le64((u64)v); /* v is u8, u16, u32, or u64 */
> > > > 
> > > > Please clarify.
> > > > 
> > > 
> > > I want to describe an union structure of 8 bytes:
> > > union virtio_of_value {
> > >      u8;
> > >      u16;
> > >      u32;
> > >      u64;
> > > };
> > > 
> > > Depending on the opcode, use the right one.
> > 
> > I was trying to point out that the memory layout of C unions is not
> > portable. Your example does not define the exact in-memory layout of
> > union virtio_of_value. Here is the first web search result I found about
> > this topic:
> > 
> >    "Q: And a related question: if you dump unions in binary form to a file,
> >    and then reload them from the file on a different platform, or with a
> >    program compiled by a different compiler, are you guaranteed to get
> >    back what you stored? (I think not, but I'm not sure)
> > 
> >    A: You're right; you're not."
> > 
> >    https://bytes.com/topic/c/answers/220372-unions-storage-abis
> > 
> > In the cpu_to_le64() code example that I gave, the exact in-memory
> > layout is well-defined. There is no ambiguity.
> > 
> 
> OK, thanks.
> 
> > > > > +\hline
> > > > > +0xff00 - 0xfffd & Reserved \\
> > > > > +\hline
> > > > > +\end{tabular}
> > > > > +
> > > > > +\paragraph{Connect Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Connect Command}
> > > > > +The Connect Command is used to establish Virtio Over Fabrics queue. The control
> > > > > +queue MUST be established firstly, then the Connect command establishes an
> > > > > +association between the initiator and the target.
> > > > 
> > > > Is a "Virtio Over Fabrics queue" different from a virtqueue?
> > > > 
> > > > If I understand correctly, the control queue must be established by the
> > > > initiator first and then the Connect command is sent to begin
> > > > communication between the initiator and the target?
> > > > 
> > > 
> > > The queue mapping is missing in the '[PATCH v2 01/11] transport-fabrics:
> > > introduce Virtio Over Fabrics overview', like:
> > > A "Virtio Over Fabrics queue" is a reliable connection between initiator and
> > > target. There are 2 types of Virtio Over Fabrics queue:
> > > +\begin{itemize}
> > > +\item A single Control queue is required to execute control operations.
> > > +\item 0 or more Virtio Over Fabrics queues map the virtqueues.
> > > +\end{itemize}
> > 
> > That helps, thanks!
> > 
> > > 
> > > > > +
> > > > > +The Target ID of 0xffff is reserved, then:
> > > > 
> > > > Please move this after the fields have been shown and the purpose of the
> > > > Target ID field has been explained.
> > > > 
> > > > > +\begin{itemize}
> > > > > +\item The Target ID of 0xffff MUST be specified as the Target ID in a Connect
> > > > > +Command for the control queue.
> > > > > +\item The target SHOULD allocate any available Target ID to the initiator,
> > > > > +and return the allocated Target ID in the Completion.
> > > > > +\item The returned Target ID MUST be specified as the Target ID, and the Queue ID
> > > > > +MUST be specified in a Connect Command for the virtqueue.
> > > > > +\end{itemize}
> > > > 
> > > > What is the purpose of the Target ID? Is it to allow a server to provide
> > > > access to multiple targets over the same connection?
> > > > 
> > > 
> > > A target listens on a port, and provides access to 0 or more targets. An
> > > initiator connect the specific target by TVQN of connect command.
> > > An initiator could connect a single target, multiple initiators could
> > > connect the same target(typically, shared disk/fs).
> > 
> > Why is the target ID separate from the TVQN? If the Target ID is a
> > separate parameter then users will have to learn additional
> > syntax/command-line options to specify the TVQN + Target ID and that
> > syntax may vary between software.
> > 
> > > 
> > > > > +
> > > > > +The Connect Command has following structure:
> > > > > +
> > > > > +\begin{lstlisting}
> > > > > +struct virtio_of_command_connect {
> > > > > +        le16 opcode;
> > > > > +        le16 command_id;
> > > > > +        le16 target_id;
> > > > > +        le16 queue_id;
> > > > > +        le16 ndesc;
> > > > 
> > > > Where is this field documented?
> > > > 
> > > 
> > > OK. Will add.
> > > 
> > > > Why does the initiator send ndesc to the target? Normally a VIRTIO Transport reports the device's max descriptors and then the driver can tell the device to reduce the number of descriptors, if desired.
> > > > 
> > > 
> > > A target supports at lease 1 descriptor. The 'ndesc' of struct
> > > virtio_of_command_connect indicates the full PDU contains: struct
> > > virtio_of_command_connect + 1 * virtio_of_vq_desc + data.
> > > 
> > > > > +#define VIRTIO_OF_CONNECTION_TCP     1
> > > > > +#define VIRTIO_OF_CONNECTION_RDMA    2
> > > > 
> > > > What does RDMA mean? I thought RDMA is a general concept that several
> > > > fabrics implement (with different details like how addressing works).
> > > > 
> > > 
> > > I guest your concern is the difference of IB/RoCE/iWarp ...
> > > We are trying to define the payload protocol here, so I think we can ignore
> > > the difference of the HCA.
> > 
> > I see, maybe this could be called STREAM vs KEYED instead of TCP vs RDMA?
> > 
> 
> I'd like to define two PDU mapping rules(in '[PATCH v2 04/11]
> transport-fabrics: introduce Stream Transmission' and '[PATCH v2 05/11]
> transport-fabrics: introduce Keyed Transmission'): STREAM and KEYED. A
> transport protocols need to use one.
> 
> Then we can define protocols:
> #define VIRTIO_OF_CONNECTION_TCP     1		-> use STREAM
> #define VIRTIO_OF_CONNECTION_RDMA    2		-> use KEYED
> #define VIRTIO_OF_CONNECTION_TLS     3(in the future)	-> use STREAM
> #define VIRTIO_OF_CONNECTION_XXX

It's not clear to me whether TCP actually means TCP/IP or if it actually
means STREAM. For example, if I run Virtio Over Fabrics over AF_VSOCK,
would it use VIRTIO_OF_CONNECTION_TCP although there is no TCP/IP? If
so, then I think the name TCP is misleading and STREAM would be clearer.

> 
> > > 
> > > > > +        u8 oftype;
> > > > > +        u8 padding[5];
> > > > > +};
> > > > > +\end{lstlisting}
> > > > > +
> > > > > +The Connect commands MUST contains one Segment Descriptor and one structure
> > > > > +virtio_of_command_connect to specify Initiator VQN and Target VNQ,
> > > > > +virtio_of_command_connect has following structure:
> > > > 
> > > > I'm confsued. virtio_of_command_connect was defined above. The struct
> > > > defined below is virtio_of_connect. Does this paragraph need to be
> > > > updated (virtio_of_command_connect -> virtio_of_connect)?
> > > > 
> > > > Why is virtio_of_connect a separate struct and not part of
> > > > virtio_of_command_connect?
> > > > 
> > > 
> > > Because I'd like to define all the commands with a fixed length.
> > 
> > I don't understand. virtio_of_connect and virtio_of_command_connect are
> > both fixed-length. Why can't they be unified into 1 fixed-length struct?
> > 
> 
> For stream protocol, it always work fine.
> For keyed protocol, for example RDMA, the target side needs to use
> ibv_post_recv to receive a large size(sizeof virtio_of_command_connect +
> sizeof virtio_of_connect). If the target uses ibv_post_recv to receive
> sizeof(CMD) + sizeof(DESC) * 1, the initiator fails in RDMA SEND.

I read that "A RC connection is very similar to a TCP connection" in the
NVIDIA documentation
(https://docs.nvidia.com/networking/display/RDMAAwareProgrammingv17/Transport+Modes)
and expected SOCK_STREAM semantics for RDMA SEND.

Are you saying ibv_post_send() fails when the receiver's work request
sg_list size is smaller (fewer bytes) than the sender's?

Does the receiver have just 1 WR queued in your example? What happens if
the receiver queues multiple small WRs?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-06-06  2:02         ` [virtio-comment] " zhenwei pi
@ 2023-06-06 13:44           ` Stefan Hajnoczi
  2023-06-07  2:03             ` [virtio-comment] " zhenwei pi
  0 siblings, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-06-06 13:44 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 2875 bytes --]

On Tue, Jun 06, 2023 at 10:02:51AM +0800, zhenwei pi wrote:
> 
> 
> On 6/6/23 00:30, Stefan Hajnoczi wrote:
> [snip]
> > > 
> > > > > +
> > > > > +The Target ID of 0xffff is reserved, then:
> > > > 
> > > > Please move this after the fields have been shown and the purpose of the
> > > > Target ID field has been explained.
> > > > 
> > > > > +\begin{itemize}
> > > > > +\item The Target ID of 0xffff MUST be specified as the Target ID in a Connect
> > > > > +Command for the control queue.
> > > > > +\item The target SHOULD allocate any available Target ID to the initiator,
> > > > > +and return the allocated Target ID in the Completion.
> > > > > +\item The returned Target ID MUST be specified as the Target ID, and the Queue ID
> > > > > +MUST be specified in a Connect Command for the virtqueue.
> > > > > +\end{itemize}
> > > > 
> > > > What is the purpose of the Target ID? Is it to allow a server to provide
> > > > access to multiple targets over the same connection?
> > > > 
> > > 
> > > A target listens on a port, and provides access to 0 or more targets. An
> > > initiator connect the specific target by TVQN of connect command.
> > > An initiator could connect a single target, multiple initiators could
> > > connect the same target(typically, shared disk/fs).
> > 
> > Why is the target ID separate from the TVQN? If the Target ID is a
> > separate parameter then users will have to learn additional
> > syntax/command-line options to specify the TVQN + Target ID and that
> > syntax may vary between software.
> > 
> 
> The TVQN is the location of a target, for example:
> virtio-tcp://192.168.1.110:8549/blk-dev/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
> 
> A target can be shared by multi initiators, they accesses the target by the
> same address(transport: tcp, ip: 192.168.1.110, port: 8549, TVQN:
> blk-dev/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1):
> 
> Initiator_A launches Control Queue, issues connect command, gets the Target
> ID of Target_A(typically, dynamically allocated by target), then virtqueues
> connect to Target_A.
> 
> Initiator_B launches Control Queue, issues connect command, gets the Target
> ID of Target_B(typically, dynamically allocated by target), then virtqueues
> connect to Target_B.

In your example you say "A target can be shared by multi initiators" but
then say "Target_A" and "Target_B", so the two initiators are not really
communicating with the same target?

Maybe instead of Target ID it should be called Device Instance ID? Then
the "Target" is the server that listens on 192.168.1.110:8549 and the
"Device Instance" is the VIRTIO device that the initiator is accessing
through the Target. One Target may contain many Device Instances.

I think that's clearer than calling boths Targets and Device Instances
the same thing.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [virtio-comment] Re: Re: Re: [PATCH v2 09/11] transport-fabrics: add TCP&RDMA binding
  2023-06-06  1:41         ` [virtio-comment] " zhenwei pi
@ 2023-06-06 13:51           ` Stefan Hajnoczi
  2023-06-07  2:15             ` zhenwei pi
  0 siblings, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-06-06 13:51 UTC (permalink / raw)
  To: zhenwei pi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 3919 bytes --]

On Tue, Jun 06, 2023 at 09:41:09AM +0800, zhenwei pi wrote:
> On 6/6/23 00:57, Stefan Hajnoczi wrote:
> > On Fri, Jun 02, 2023 at 05:07:14PM +0800, zhenwei pi wrote:
> > > 
> > > 
> > > On 6/1/23 05:02, Stefan Hajnoczi wrote:
> > > > On Thu, May 04, 2023 at 04:19:08PM +0800, zhenwei pi wrote:
> > > > > Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> > > > > ---
> > > > >    transport-fabrics.tex | 9 +++++++++
> > > > >    1 file changed, 9 insertions(+)
> > > > > 
> > > > > diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> > > > > index f563c3e..c47a744 100644
> > > > > --- a/transport-fabrics.tex
> > > > > +++ b/transport-fabrics.tex
> > > > > @@ -873,3 +873,12 @@ \subsubsection{Status Definition}\label{sec:Virtio Transport Options / Virtio Ov
> > > > >    #define VIRTIO_OF_EALREADY      114
> > > > >    #define VIRTIO_OF_EQUIRK        4096
> > > > >    \end{lstlisting}
> > > > > +
> > > > > +\subsection{Transport Binding}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transport Binding}
> > > > > +\subsubsection{TCP}\label{sec:Virtio Transport Options / Virtio Over Fabrics / ransport Binding / TCP}
> > > > > +TCP MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}
> > > > > +~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}.
> > > > > +
> > > > > +\subsubsection{RDMA}\label{sec:Virtio Transport Options / Virtio Over Fabrics / ransport Binding / RDMA}
> > > > > +RDMA MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}
> > > > > +~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}.
> > > > 
> > > > What about VQN representation, default port numbers, etc? There should
> > > > be enough information here so implementers can create compatible
> > > > implementations.
> > > > 
> > > 
> > > Already replied in '[PATCH v2 02/11] transport-fabrics: introduce Virtio
> > > Qualified Name'.
> > > 
> > > > Is there connection encryption support? It's hard to imagine running a
> > > > plaintext Virtio Over Fabrics TCP connection in a production environment
> > > > due to security concerns.
> > > > 
> > > > Stefan
> > > 
> > > As far as I can see, 1) an ACL mechanism could be used in the engineering
> > > implementation without any specification.(ex, a target only allows a
> > > specific IVQN). 2) authentication may be introduced in the future.
> > > 
> > > Does the virtqueue buffers need encryption support?
> > 
> > An ACL in the target is still susceptible to attacks on confidentiality
> > (spying on traffic) and integrity (spoofing, injecting, or corrupting
> > traffic).
> > 
> > My view is that nowadays anything that goes over the network needs
> > Transport Layer Security (TLS) built in or something comparable unless
> > the use cases are clearly limited to scenarios where this is not
> > necessary. To me it seems like Virtio over Fabarics could be used in
> > scenarios where encryption is necessary (e.g. to protect user data being
> > sent over a network).
> > 
> > NVMe-over-TCP supports TLS.
> > 
> > Stefan
> 
> Generally, LAN is considered to be secure, using TCP makes sense. TLS is
> needed for WAN.

This depends on the security policy of the organization. I don't know
what percentage of organizations trust internal networks, but I'm sure
there is a significant proportion of organizations nowadays where
deploying an unsecured network service is not allowed.

Also, Virtio Over Fabrics (TCP) will work over the internet and some
users may use it for that.

I think including optional TLS support from the beginning is necessary.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: RE: RE: [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
  2023-06-06  7:13             ` zhenwei pi
@ 2023-06-06 21:52               ` Parav Pandit
  0 siblings, 0 replies; 74+ messages in thread
From: Parav Pandit @ 2023-06-06 21:52 UTC (permalink / raw)
  To: zhenwei pi, stefanha
  Cc: mst, virtio-comment, houp, helei.sig11, xinhao.kong, jasowang,
	zhouhuaping.san



> From: zhenwei pi <pizhenwei@bytedance.com>
> Sent: Tuesday, June 6, 2023 3:14 AM
> > The idea is to not have multiple virtio_of_rdma_desc.
> > An initiator can represent 1B to 4GB of noncontiguous buffer using a single
> rdma mkey.
> > Hence, only one virtio_of_rdma_desc is enough from initiator to target.
> >
> 
> I have only a few knowledge about this:
> https://docs.nvidia.com/networking/pages/viewpage.action?pageId=25138119
> 
> And I notice that rdma-core and linux infiniband/core has no standard support
> for this, it seems mlx5 specific. Please correct me if I misunderstood...
> 
It is part of the IB specification and infiniband/core for a long time probably for a decade now.
And multiple hw vendors support it.
It is called as fast register memory regions WQE.
Don’t have the link handy at the moment.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-06-06 13:44           ` Stefan Hajnoczi
@ 2023-06-07  2:03             ` zhenwei pi
  0 siblings, 0 replies; 74+ messages in thread
From: zhenwei pi @ 2023-06-07  2:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong



On 6/6/23 21:44, Stefan Hajnoczi wrote:
> On Tue, Jun 06, 2023 at 10:02:51AM +0800, zhenwei pi wrote:
>>
>>
>> On 6/6/23 00:30, Stefan Hajnoczi wrote:
>> [snip]
>>>>
>>>>>> +
>>>>>> +The Target ID of 0xffff is reserved, then:
>>>>>
>>>>> Please move this after the fields have been shown and the purpose of the
>>>>> Target ID field has been explained.
>>>>>
>>>>>> +\begin{itemize}
>>>>>> +\item The Target ID of 0xffff MUST be specified as the Target ID in a Connect
>>>>>> +Command for the control queue.
>>>>>> +\item The target SHOULD allocate any available Target ID to the initiator,
>>>>>> +and return the allocated Target ID in the Completion.
>>>>>> +\item The returned Target ID MUST be specified as the Target ID, and the Queue ID
>>>>>> +MUST be specified in a Connect Command for the virtqueue.
>>>>>> +\end{itemize}
>>>>>
>>>>> What is the purpose of the Target ID? Is it to allow a server to provide
>>>>> access to multiple targets over the same connection?
>>>>>
>>>>
>>>> A target listens on a port, and provides access to 0 or more targets. An
>>>> initiator connect the specific target by TVQN of connect command.
>>>> An initiator could connect a single target, multiple initiators could
>>>> connect the same target(typically, shared disk/fs).
>>>
>>> Why is the target ID separate from the TVQN? If the Target ID is a
>>> separate parameter then users will have to learn additional
>>> syntax/command-line options to specify the TVQN + Target ID and that
>>> syntax may vary between software.
>>>
>>
>> The TVQN is the location of a target, for example:
>> virtio-tcp://192.168.1.110:8549/blk-dev/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1
>>
>> A target can be shared by multi initiators, they accesses the target by the
>> same address(transport: tcp, ip: 192.168.1.110, port: 8549, TVQN:
>> blk-dev/hdd-pool/238151a7-acd7-4621-bbdf-382ddbccb6a1):
>>
>> Initiator_A launches Control Queue, issues connect command, gets the Target
>> ID of Target_A(typically, dynamically allocated by target), then virtqueues
>> connect to Target_A.
>>
>> Initiator_B launches Control Queue, issues connect command, gets the Target
>> ID of Target_B(typically, dynamically allocated by target), then virtqueues
>> connect to Target_B.
> 
> In your example you say "A target can be shared by multi initiators" but
> then say "Target_A" and "Target_B", so the two initiators are not really
> communicating with the same target?
> 

Yes, the two initiators are actually communicating with the same target 
of two Device Instances.

> Maybe instead of Target ID it should be called Device Instance ID? Then
> the "Target" is the server that listens on 192.168.1.110:8549 and the
> "Device Instance" is the VIRTIO device that the initiator is accessing
> through the Target. One Target may contain many Device Instances.
> 
> I think that's clearer than calling boths Targets and Device Instances
> the same thing.
> 
> Stefan

OK, this seems better! Fix this in next version.

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Re: [virtio-comment] Re: Re: Re: [PATCH v2 09/11] transport-fabrics: add TCP&RDMA binding
  2023-06-06 13:51           ` Stefan Hajnoczi
@ 2023-06-07  2:15             ` zhenwei pi
  0 siblings, 0 replies; 74+ messages in thread
From: zhenwei pi @ 2023-06-07  2:15 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

On 6/6/23 21:51, Stefan Hajnoczi wrote:
> On Tue, Jun 06, 2023 at 09:41:09AM +0800, zhenwei pi wrote:
>> On 6/6/23 00:57, Stefan Hajnoczi wrote:
>>> On Fri, Jun 02, 2023 at 05:07:14PM +0800, zhenwei pi wrote:
>>>>
>>>>
>>>> On 6/1/23 05:02, Stefan Hajnoczi wrote:
>>>>> On Thu, May 04, 2023 at 04:19:08PM +0800, zhenwei pi wrote:
>>>>>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>>>>>> ---
>>>>>>     transport-fabrics.tex | 9 +++++++++
>>>>>>     1 file changed, 9 insertions(+)
>>>>>>
>>>>>> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
>>>>>> index f563c3e..c47a744 100644
>>>>>> --- a/transport-fabrics.tex
>>>>>> +++ b/transport-fabrics.tex
>>>>>> @@ -873,3 +873,12 @@ \subsubsection{Status Definition}\label{sec:Virtio Transport Options / Virtio Ov
>>>>>>     #define VIRTIO_OF_EALREADY      114
>>>>>>     #define VIRTIO_OF_EQUIRK        4096
>>>>>>     \end{lstlisting}
>>>>>> +
>>>>>> +\subsection{Transport Binding}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transport Binding}
>>>>>> +\subsubsection{TCP}\label{sec:Virtio Transport Options / Virtio Over Fabrics / ransport Binding / TCP}
>>>>>> +TCP MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}
>>>>>> +~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Stream Transmission}.
>>>>>> +
>>>>>> +\subsubsection{RDMA}\label{sec:Virtio Transport Options / Virtio Over Fabrics / ransport Binding / RDMA}
>>>>>> +RDMA MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}
>>>>>> +~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Keyed Transmission}.
>>>>>
>>>>> What about VQN representation, default port numbers, etc? There should
>>>>> be enough information here so implementers can create compatible
>>>>> implementations.
>>>>>
>>>>
>>>> Already replied in '[PATCH v2 02/11] transport-fabrics: introduce Virtio
>>>> Qualified Name'.
>>>>
>>>>> Is there connection encryption support? It's hard to imagine running a
>>>>> plaintext Virtio Over Fabrics TCP connection in a production environment
>>>>> due to security concerns.
>>>>>
>>>>> Stefan
>>>>
>>>> As far as I can see, 1) an ACL mechanism could be used in the engineering
>>>> implementation without any specification.(ex, a target only allows a
>>>> specific IVQN). 2) authentication may be introduced in the future.
>>>>
>>>> Does the virtqueue buffers need encryption support?
>>>
>>> An ACL in the target is still susceptible to attacks on confidentiality
>>> (spying on traffic) and integrity (spoofing, injecting, or corrupting
>>> traffic).
>>>
>>> My view is that nowadays anything that goes over the network needs
>>> Transport Layer Security (TLS) built in or something comparable unless
>>> the use cases are clearly limited to scenarios where this is not
>>> necessary. To me it seems like Virtio over Fabarics could be used in
>>> scenarios where encryption is necessary (e.g. to protect user data being
>>> sent over a network).
>>>
>>> NVMe-over-TCP supports TLS.
>>>
>>> Stefan
>>
>> Generally, LAN is considered to be secure, using TCP makes sense. TLS is
>> needed for WAN.
> 
> This depends on the security policy of the organization. I don't know
> what percentage of organizations trust internal networks, but I'm sure
> there is a significant proportion of organizations nowadays where
> deploying an unsecured network service is not allowed.
> 
> Also, Virtio Over Fabrics (TCP) will work over the internet and some
> users may use it for that.
> 
> I think including optional TLS support from the beginning is necessary.
> 
> Stefan

Agree with the optional TLS support. Let's continue the detail 
discussion in '[PATCH v2 06/11] transport-fabrics: introduce command set'.

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-06-06 13:34           ` Stefan Hajnoczi
@ 2023-06-07  2:58             ` zhenwei pi
  2023-06-08 16:41               ` Stefan Hajnoczi
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-07  2:58 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: parav, mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong



On 6/6/23 21:34, Stefan Hajnoczi wrote:
> On Tue, Jun 06, 2023 at 09:31:27AM +0800, zhenwei pi wrote:
>> On 6/6/23 00:30, Stefan Hajnoczi wrote:
>>> On Fri, Jun 02, 2023 at 01:15:00PM +0800, zhenwei pi wrote:
>>>>
>>>>
>>>> On 6/1/23 01:10, Stefan Hajnoczi wrote:
>>>>> On Thu, May 04, 2023 at 04:19:05PM +0800, zhenwei pi wrote:
>>>>>> Introduce command structures for Virtio-oF.
>>>>>>
>>>>>> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
>>>>>> ---
>>>>>>     transport-fabrics.tex | 209 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 209 insertions(+)
>>>>>>
>>>>>> diff --git a/transport-fabrics.tex b/transport-fabrics.tex
>>>>>> index 7711321..37f57c6 100644
>>>>>> --- a/transport-fabrics.tex
>>>>>> +++ b/transport-fabrics.tex
>>>>>> @@ -495,3 +495,212 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
>>>>>>                         |value |  -> 8193 (value.u32)
>>>>>>                         +------+
>>>>>>     \end{lstlisting}
>>>>>> +
>>>>>> +\subsubsection{Commands Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition}
>>>>>> +This section defines command structures for Virtio Over Fabrics.
>>>>>> +
>>>>>> +A common structure virtio_of_value is fixed to 8 bytes and MUST be used as one
>>>>>> +of the following format:
>>>>>> +
>>>>>> +\begin{itemize}
>>>>>> +\item u8
>>>>>> +\item le16
>>>>>> +\item le32
>>>>>> +\item le64
>>>>>> +\end{itemize}
>>>>>
>>>>> The way it's written does not document where the u8, u16, u32 bytes are
>>>>> located and that the unused bytes are 0. I think I understand what you
>>>>> mean though:
>>>>>
>>>>>      le64 value = cpu_to_le64((u64)v); /* v is u8, u16, u32, or u64 */
>>>>>
>>>>> Please clarify.
>>>>>
>>>>
>>>> I want to describe an union structure of 8 bytes:
>>>> union virtio_of_value {
>>>>       u8;
>>>>       u16;
>>>>       u32;
>>>>       u64;
>>>> };
>>>>
>>>> Depending on the opcode, use the right one.
>>>
>>> I was trying to point out that the memory layout of C unions is not
>>> portable. Your example does not define the exact in-memory layout of
>>> union virtio_of_value. Here is the first web search result I found about
>>> this topic:
>>>
>>>     "Q: And a related question: if you dump unions in binary form to a file,
>>>     and then reload them from the file on a different platform, or with a
>>>     program compiled by a different compiler, are you guaranteed to get
>>>     back what you stored? (I think not, but I'm not sure)
>>>
>>>     A: You're right; you're not."
>>>
>>>     https://bytes.com/topic/c/answers/220372-unions-storage-abis
>>>
>>> In the cpu_to_le64() code example that I gave, the exact in-memory
>>> layout is well-defined. There is no ambiguity.
>>>
>>
>> OK, thanks.
>>
>>>>>> +\hline
>>>>>> +0xff00 - 0xfffd & Reserved \\
>>>>>> +\hline
>>>>>> +\end{tabular}
>>>>>> +
>>>>>> +\paragraph{Connect Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Connect Command}
>>>>>> +The Connect Command is used to establish Virtio Over Fabrics queue. The control
>>>>>> +queue MUST be established firstly, then the Connect command establishes an
>>>>>> +association between the initiator and the target.
>>>>>
>>>>> Is a "Virtio Over Fabrics queue" different from a virtqueue?
>>>>>
>>>>> If I understand correctly, the control queue must be established by the
>>>>> initiator first and then the Connect command is sent to begin
>>>>> communication between the initiator and the target?
>>>>>
>>>>
>>>> The queue mapping is missing in the '[PATCH v2 01/11] transport-fabrics:
>>>> introduce Virtio Over Fabrics overview', like:
>>>> A "Virtio Over Fabrics queue" is a reliable connection between initiator and
>>>> target. There are 2 types of Virtio Over Fabrics queue:
>>>> +\begin{itemize}
>>>> +\item A single Control queue is required to execute control operations.
>>>> +\item 0 or more Virtio Over Fabrics queues map the virtqueues.
>>>> +\end{itemize}
>>>
>>> That helps, thanks!
>>>
>>>>
>>>>>> +
>>>>>> +The Target ID of 0xffff is reserved, then:
>>>>>
>>>>> Please move this after the fields have been shown and the purpose of the
>>>>> Target ID field has been explained.
>>>>>
>>>>>> +\begin{itemize}
>>>>>> +\item The Target ID of 0xffff MUST be specified as the Target ID in a Connect
>>>>>> +Command for the control queue.
>>>>>> +\item The target SHOULD allocate any available Target ID to the initiator,
>>>>>> +and return the allocated Target ID in the Completion.
>>>>>> +\item The returned Target ID MUST be specified as the Target ID, and the Queue ID
>>>>>> +MUST be specified in a Connect Command for the virtqueue.
>>>>>> +\end{itemize}
>>>>>
>>>>> What is the purpose of the Target ID? Is it to allow a server to provide
>>>>> access to multiple targets over the same connection?
>>>>>
>>>>
>>>> A target listens on a port, and provides access to 0 or more targets. An
>>>> initiator connect the specific target by TVQN of connect command.
>>>> An initiator could connect a single target, multiple initiators could
>>>> connect the same target(typically, shared disk/fs).
>>>
>>> Why is the target ID separate from the TVQN? If the Target ID is a
>>> separate parameter then users will have to learn additional
>>> syntax/command-line options to specify the TVQN + Target ID and that
>>> syntax may vary between software.
>>>
>>>>
>>>>>> +
>>>>>> +The Connect Command has following structure:
>>>>>> +
>>>>>> +\begin{lstlisting}
>>>>>> +struct virtio_of_command_connect {
>>>>>> +        le16 opcode;
>>>>>> +        le16 command_id;
>>>>>> +        le16 target_id;
>>>>>> +        le16 queue_id;
>>>>>> +        le16 ndesc;
>>>>>
>>>>> Where is this field documented?
>>>>>
>>>>
>>>> OK. Will add.
>>>>
>>>>> Why does the initiator send ndesc to the target? Normally a VIRTIO Transport reports the device's max descriptors and then the driver can tell the device to reduce the number of descriptors, if desired.
>>>>>
>>>>
>>>> A target supports at lease 1 descriptor. The 'ndesc' of struct
>>>> virtio_of_command_connect indicates the full PDU contains: struct
>>>> virtio_of_command_connect + 1 * virtio_of_vq_desc + data.
>>>>
>>>>>> +#define VIRTIO_OF_CONNECTION_TCP     1
>>>>>> +#define VIRTIO_OF_CONNECTION_RDMA    2
>>>>>
>>>>> What does RDMA mean? I thought RDMA is a general concept that several
>>>>> fabrics implement (with different details like how addressing works).
>>>>>
>>>>
>>>> I guest your concern is the difference of IB/RoCE/iWarp ...
>>>> We are trying to define the payload protocol here, so I think we can ignore
>>>> the difference of the HCA.
>>>
>>> I see, maybe this could be called STREAM vs KEYED instead of TCP vs RDMA?
>>>
>>
>> I'd like to define two PDU mapping rules(in '[PATCH v2 04/11]
>> transport-fabrics: introduce Stream Transmission' and '[PATCH v2 05/11]
>> transport-fabrics: introduce Keyed Transmission'): STREAM and KEYED. A
>> transport protocols need to use one.
>>
>> Then we can define protocols:
>> #define VIRTIO_OF_CONNECTION_TCP     1		-> use STREAM
>> #define VIRTIO_OF_CONNECTION_RDMA    2		-> use KEYED
>> #define VIRTIO_OF_CONNECTION_TLS     3(in the future)	-> use STREAM
>> #define VIRTIO_OF_CONNECTION_XXX
> 
> It's not clear to me whether TCP actually means TCP/IP or if it actually
> means STREAM. For example, if I run Virtio Over Fabrics over AF_VSOCK,
> would it use VIRTIO_OF_CONNECTION_TCP although there is no TCP/IP? If
> so, then I think the name TCP is misleading and STREAM would be clearer.
> 

What about dropping 'oftype' field from this command? When the command 
is allowed to issue, the reliable connection is already established, at 
this point, we have enough information about the connection type.

Instead, we define the multiple transports in the following section, like:
\subsection{Transport Binding}\label{sec:Virtio Transport Options / 
Virtio Over Fabrics / Transport Binding}
\subsubsection{TCP/IP}\label{sec:Virtio Transport Options / Virtio Over 
Fabrics / Transport Binding / TCP_IP}
TCP/IP supports both IPv4 and IPv6, it uses \ref{sec:Virtio Transport 
Options / Virtio Over Fabrics / Transmission Protocol / Commands 
Definition / Stream Transmission}
~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / 
Transmission Protocol / Commands Definition / Stream Transmission} ...

\subsubsection{TLS-TCP/IP}\label{sec:Virtio Transport Options / Virtio 
Over Fabrics / Transport Binding / TLS-TCP_IP}
TLS-TCP/IP supports both IPv4 and IPv6 ...

\subsubsection{RDMA}\label{sec:Virtio Transport Options / Virtio Over 
Fabrics / Transport Binding / RDMA}
RDMA MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics / 
Transmission Protocol / Commands Definition / Keyed Transmission}
~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / 
Transmission Protocol / Commands Definition / Keyed Transmission} ...

[\subsubsection{TCP/VSOCK}\label{sec:Virtio Transport Options / Virtio 
Over Fabrics / ransport Binding / TCP_VSOCK} ...]

>>
>>>>
>>>>>> +        u8 oftype;
>>>>>> +        u8 padding[5];
>>>>>> +};
>>>>>> +\end{lstlisting}
>>>>>> +
>>>>>> +The Connect commands MUST contains one Segment Descriptor and one structure
>>>>>> +virtio_of_command_connect to specify Initiator VQN and Target VNQ,
>>>>>> +virtio_of_command_connect has following structure:
>>>>>
>>>>> I'm confsued. virtio_of_command_connect was defined above. The struct
>>>>> defined below is virtio_of_connect. Does this paragraph need to be
>>>>> updated (virtio_of_command_connect -> virtio_of_connect)?
>>>>>
>>>>> Why is virtio_of_connect a separate struct and not part of
>>>>> virtio_of_command_connect?
>>>>>
>>>>
>>>> Because I'd like to define all the commands with a fixed length.
>>>
>>> I don't understand. virtio_of_connect and virtio_of_command_connect are
>>> both fixed-length. Why can't they be unified into 1 fixed-length struct?
>>>
>>
>> For stream protocol, it always work fine.
>> For keyed protocol, for example RDMA, the target side needs to use
>> ibv_post_recv to receive a large size(sizeof virtio_of_command_connect +
>> sizeof virtio_of_connect). If the target uses ibv_post_recv to receive
>> sizeof(CMD) + sizeof(DESC) * 1, the initiator fails in RDMA SEND.
> 
> I read that "A RC connection is very similar to a TCP connection" in the
> NVIDIA documentation
> (https://docs.nvidia.com/networking/display/RDMAAwareProgrammingv17/Transport+Modes)
> and expected SOCK_STREAM semantics for RDMA SEND.
> 
> Are you saying ibv_post_send() fails when the receiver's work request
> sg_list size is smaller (fewer bytes) than the sender's?
> 

Yes, it will fail.
The receiver get a CQE with status 'IBV_WC_LOC_LEN_ERR', see
https://www.rdmamojo.com/2013/02/15/ibv_poll_cq/

> Does the receiver have just 1 WR queued in your example? 

No, the receiver need queue WRs of 'depth', generally the 'depth' is the 
virtqueue size.

What happens if
> the receiver queues multiple small WRs?
> 
> Stefan

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-06-07  2:58             ` [virtio-comment] " zhenwei pi
@ 2023-06-08 16:41               ` Stefan Hajnoczi
  2023-06-08 17:01                 ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2023-06-08 16:41 UTC (permalink / raw)
  To: zhenwei pi, parav
  Cc: mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

[-- Attachment #1: Type: text/plain, Size: 12585 bytes --]

On Wed, Jun 07, 2023 at 10:58:45AM +0800, zhenwei pi wrote:
> On 6/6/23 21:34, Stefan Hajnoczi wrote:
> > On Tue, Jun 06, 2023 at 09:31:27AM +0800, zhenwei pi wrote:
> > > On 6/6/23 00:30, Stefan Hajnoczi wrote:
> > > > On Fri, Jun 02, 2023 at 01:15:00PM +0800, zhenwei pi wrote:
> > > > > 
> > > > > 
> > > > > On 6/1/23 01:10, Stefan Hajnoczi wrote:
> > > > > > On Thu, May 04, 2023 at 04:19:05PM +0800, zhenwei pi wrote:
> > > > > > > Introduce command structures for Virtio-oF.
> > > > > > > 
> > > > > > > Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
> > > > > > > ---
> > > > > > >     transport-fabrics.tex | 209 ++++++++++++++++++++++++++++++++++++++++++
> > > > > > >     1 file changed, 209 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/transport-fabrics.tex b/transport-fabrics.tex
> > > > > > > index 7711321..37f57c6 100644
> > > > > > > --- a/transport-fabrics.tex
> > > > > > > +++ b/transport-fabrics.tex
> > > > > > > @@ -495,3 +495,212 @@ \subsubsection{Buffer Mapping Definition}\label{sec:Virtio Transport Options / V
> > > > > > >                         |value |  -> 8193 (value.u32)
> > > > > > >                         +------+
> > > > > > >     \end{lstlisting}
> > > > > > > +
> > > > > > > +\subsubsection{Commands Definition}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition}
> > > > > > > +This section defines command structures for Virtio Over Fabrics.
> > > > > > > +
> > > > > > > +A common structure virtio_of_value is fixed to 8 bytes and MUST be used as one
> > > > > > > +of the following format:
> > > > > > > +
> > > > > > > +\begin{itemize}
> > > > > > > +\item u8
> > > > > > > +\item le16
> > > > > > > +\item le32
> > > > > > > +\item le64
> > > > > > > +\end{itemize}
> > > > > > 
> > > > > > The way it's written does not document where the u8, u16, u32 bytes are
> > > > > > located and that the unused bytes are 0. I think I understand what you
> > > > > > mean though:
> > > > > > 
> > > > > >      le64 value = cpu_to_le64((u64)v); /* v is u8, u16, u32, or u64 */
> > > > > > 
> > > > > > Please clarify.
> > > > > > 
> > > > > 
> > > > > I want to describe an union structure of 8 bytes:
> > > > > union virtio_of_value {
> > > > >       u8;
> > > > >       u16;
> > > > >       u32;
> > > > >       u64;
> > > > > };
> > > > > 
> > > > > Depending on the opcode, use the right one.
> > > > 
> > > > I was trying to point out that the memory layout of C unions is not
> > > > portable. Your example does not define the exact in-memory layout of
> > > > union virtio_of_value. Here is the first web search result I found about
> > > > this topic:
> > > > 
> > > >     "Q: And a related question: if you dump unions in binary form to a file,
> > > >     and then reload them from the file on a different platform, or with a
> > > >     program compiled by a different compiler, are you guaranteed to get
> > > >     back what you stored? (I think not, but I'm not sure)
> > > > 
> > > >     A: You're right; you're not."
> > > > 
> > > >     https://bytes.com/topic/c/answers/220372-unions-storage-abis
> > > > 
> > > > In the cpu_to_le64() code example that I gave, the exact in-memory
> > > > layout is well-defined. There is no ambiguity.
> > > > 
> > > 
> > > OK, thanks.
> > > 
> > > > > > > +\hline
> > > > > > > +0xff00 - 0xfffd & Reserved \\
> > > > > > > +\hline
> > > > > > > +\end{tabular}
> > > > > > > +
> > > > > > > +\paragraph{Connect Command}\label{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition / Connect Command}
> > > > > > > +The Connect Command is used to establish Virtio Over Fabrics queue. The control
> > > > > > > +queue MUST be established firstly, then the Connect command establishes an
> > > > > > > +association between the initiator and the target.
> > > > > > 
> > > > > > Is a "Virtio Over Fabrics queue" different from a virtqueue?
> > > > > > 
> > > > > > If I understand correctly, the control queue must be established by the
> > > > > > initiator first and then the Connect command is sent to begin
> > > > > > communication between the initiator and the target?
> > > > > > 
> > > > > 
> > > > > The queue mapping is missing in the '[PATCH v2 01/11] transport-fabrics:
> > > > > introduce Virtio Over Fabrics overview', like:
> > > > > A "Virtio Over Fabrics queue" is a reliable connection between initiator and
> > > > > target. There are 2 types of Virtio Over Fabrics queue:
> > > > > +\begin{itemize}
> > > > > +\item A single Control queue is required to execute control operations.
> > > > > +\item 0 or more Virtio Over Fabrics queues map the virtqueues.
> > > > > +\end{itemize}
> > > > 
> > > > That helps, thanks!
> > > > 
> > > > > 
> > > > > > > +
> > > > > > > +The Target ID of 0xffff is reserved, then:
> > > > > > 
> > > > > > Please move this after the fields have been shown and the purpose of the
> > > > > > Target ID field has been explained.
> > > > > > 
> > > > > > > +\begin{itemize}
> > > > > > > +\item The Target ID of 0xffff MUST be specified as the Target ID in a Connect
> > > > > > > +Command for the control queue.
> > > > > > > +\item The target SHOULD allocate any available Target ID to the initiator,
> > > > > > > +and return the allocated Target ID in the Completion.
> > > > > > > +\item The returned Target ID MUST be specified as the Target ID, and the Queue ID
> > > > > > > +MUST be specified in a Connect Command for the virtqueue.
> > > > > > > +\end{itemize}
> > > > > > 
> > > > > > What is the purpose of the Target ID? Is it to allow a server to provide
> > > > > > access to multiple targets over the same connection?
> > > > > > 
> > > > > 
> > > > > A target listens on a port, and provides access to 0 or more targets. An
> > > > > initiator connect the specific target by TVQN of connect command.
> > > > > An initiator could connect a single target, multiple initiators could
> > > > > connect the same target(typically, shared disk/fs).
> > > > 
> > > > Why is the target ID separate from the TVQN? If the Target ID is a
> > > > separate parameter then users will have to learn additional
> > > > syntax/command-line options to specify the TVQN + Target ID and that
> > > > syntax may vary between software.
> > > > 
> > > > > 
> > > > > > > +
> > > > > > > +The Connect Command has following structure:
> > > > > > > +
> > > > > > > +\begin{lstlisting}
> > > > > > > +struct virtio_of_command_connect {
> > > > > > > +        le16 opcode;
> > > > > > > +        le16 command_id;
> > > > > > > +        le16 target_id;
> > > > > > > +        le16 queue_id;
> > > > > > > +        le16 ndesc;
> > > > > > 
> > > > > > Where is this field documented?
> > > > > > 
> > > > > 
> > > > > OK. Will add.
> > > > > 
> > > > > > Why does the initiator send ndesc to the target? Normally a VIRTIO Transport reports the device's max descriptors and then the driver can tell the device to reduce the number of descriptors, if desired.
> > > > > > 
> > > > > 
> > > > > A target supports at lease 1 descriptor. The 'ndesc' of struct
> > > > > virtio_of_command_connect indicates the full PDU contains: struct
> > > > > virtio_of_command_connect + 1 * virtio_of_vq_desc + data.
> > > > > 
> > > > > > > +#define VIRTIO_OF_CONNECTION_TCP     1
> > > > > > > +#define VIRTIO_OF_CONNECTION_RDMA    2
> > > > > > 
> > > > > > What does RDMA mean? I thought RDMA is a general concept that several
> > > > > > fabrics implement (with different details like how addressing works).
> > > > > > 
> > > > > 
> > > > > I guest your concern is the difference of IB/RoCE/iWarp ...
> > > > > We are trying to define the payload protocol here, so I think we can ignore
> > > > > the difference of the HCA.
> > > > 
> > > > I see, maybe this could be called STREAM vs KEYED instead of TCP vs RDMA?
> > > > 
> > > 
> > > I'd like to define two PDU mapping rules(in '[PATCH v2 04/11]
> > > transport-fabrics: introduce Stream Transmission' and '[PATCH v2 05/11]
> > > transport-fabrics: introduce Keyed Transmission'): STREAM and KEYED. A
> > > transport protocols need to use one.
> > > 
> > > Then we can define protocols:
> > > #define VIRTIO_OF_CONNECTION_TCP     1		-> use STREAM
> > > #define VIRTIO_OF_CONNECTION_RDMA    2		-> use KEYED
> > > #define VIRTIO_OF_CONNECTION_TLS     3(in the future)	-> use STREAM
> > > #define VIRTIO_OF_CONNECTION_XXX
> > 
> > It's not clear to me whether TCP actually means TCP/IP or if it actually
> > means STREAM. For example, if I run Virtio Over Fabrics over AF_VSOCK,
> > would it use VIRTIO_OF_CONNECTION_TCP although there is no TCP/IP? If
> > so, then I think the name TCP is misleading and STREAM would be clearer.
> > 
> 
> What about dropping 'oftype' field from this command? When the command is
> allowed to issue, the reliable connection is already established, at this
> point, we have enough information about the connection type.
> 
> Instead, we define the multiple transports in the following section, like:
> \subsection{Transport Binding}\label{sec:Virtio Transport Options / Virtio
> Over Fabrics / Transport Binding}
> \subsubsection{TCP/IP}\label{sec:Virtio Transport Options / Virtio Over
> Fabrics / Transport Binding / TCP_IP}
> TCP/IP supports both IPv4 and IPv6, it uses \ref{sec:Virtio Transport
> Options / Virtio Over Fabrics / Transmission Protocol / Commands Definition
> / Stream Transmission}
> ~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission
> Protocol / Commands Definition / Stream Transmission} ...
> 
> \subsubsection{TLS-TCP/IP}\label{sec:Virtio Transport Options / Virtio Over
> Fabrics / Transport Binding / TLS-TCP_IP}
> TLS-TCP/IP supports both IPv4 and IPv6 ...
> 
> \subsubsection{RDMA}\label{sec:Virtio Transport Options / Virtio Over
> Fabrics / Transport Binding / RDMA}
> RDMA MUST use \ref{sec:Virtio Transport Options / Virtio Over Fabrics /
> Transmission Protocol / Commands Definition / Keyed Transmission}
> ~\nameref{sec:Virtio Transport Options / Virtio Over Fabrics / Transmission
> Protocol / Commands Definition / Keyed Transmission} ...
> 
> [\subsubsection{TCP/VSOCK}\label{sec:Virtio Transport Options / Virtio Over
> Fabrics / ransport Binding / TCP_VSOCK} ...]

Sounds good. Thanks!

> 
> > > 
> > > > > 
> > > > > > > +        u8 oftype;
> > > > > > > +        u8 padding[5];
> > > > > > > +};
> > > > > > > +\end{lstlisting}
> > > > > > > +
> > > > > > > +The Connect commands MUST contains one Segment Descriptor and one structure
> > > > > > > +virtio_of_command_connect to specify Initiator VQN and Target VNQ,
> > > > > > > +virtio_of_command_connect has following structure:
> > > > > > 
> > > > > > I'm confsued. virtio_of_command_connect was defined above. The struct
> > > > > > defined below is virtio_of_connect. Does this paragraph need to be
> > > > > > updated (virtio_of_command_connect -> virtio_of_connect)?
> > > > > > 
> > > > > > Why is virtio_of_connect a separate struct and not part of
> > > > > > virtio_of_command_connect?
> > > > > > 
> > > > > 
> > > > > Because I'd like to define all the commands with a fixed length.
> > > > 
> > > > I don't understand. virtio_of_connect and virtio_of_command_connect are
> > > > both fixed-length. Why can't they be unified into 1 fixed-length struct?
> > > > 
> > > 
> > > For stream protocol, it always work fine.
> > > For keyed protocol, for example RDMA, the target side needs to use
> > > ibv_post_recv to receive a large size(sizeof virtio_of_command_connect +
> > > sizeof virtio_of_connect). If the target uses ibv_post_recv to receive
> > > sizeof(CMD) + sizeof(DESC) * 1, the initiator fails in RDMA SEND.
> > 
> > I read that "A RC connection is very similar to a TCP connection" in the
> > NVIDIA documentation
> > (https://docs.nvidia.com/networking/display/RDMAAwareProgrammingv17/Transport+Modes)
> > and expected SOCK_STREAM semantics for RDMA SEND.
> > 
> > Are you saying ibv_post_send() fails when the receiver's work request
> > sg_list size is smaller (fewer bytes) than the sender's?
> > 
> 
> Yes, it will fail.
> The receiver get a CQE with status 'IBV_WC_LOC_LEN_ERR', see
> https://www.rdmamojo.com/2013/02/15/ibv_poll_cq/

Parav: Can you confirm that this is expected?

This makes it hard to inline payloads as I was suggesting before :(.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] RE: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-06-08 16:41               ` Stefan Hajnoczi
@ 2023-06-08 17:01                 ` Parav Pandit
  2023-06-09  1:39                   ` [virtio-comment] " zhenwei pi
  0 siblings, 1 reply; 74+ messages in thread
From: Parav Pandit @ 2023-06-08 17:01 UTC (permalink / raw)
  To: Stefan Hajnoczi, zhenwei pi
  Cc: mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong


> From: Stefan Hajnoczi <stefanha@redhat.com>
> Sent: Thursday, June 8, 2023 12:41 PM

> > > > For stream protocol, it always work fine.
> > > > For keyed protocol, for example RDMA, the target side needs to use
> > > > ibv_post_recv to receive a large size(sizeof
> > > > virtio_of_command_connect + sizeof virtio_of_connect). If the
> > > > target uses ibv_post_recv to receive
> > > > sizeof(CMD) + sizeof(DESC) * 1, the initiator fails in RDMA SEND.
> > >
> > > I read that "A RC connection is very similar to a TCP connection" in
> > > the NVIDIA documentation
> > > (https://docs.nvidia.com/networking/display/RDMAAwareProgrammingv17/
> > > Transport+Modes) and expected SOCK_STREAM semantics for RDMA SEND.
> > >
> > > Are you saying ibv_post_send() fails when the receiver's work
> > > request sg_list size is smaller (fewer bytes) than the sender's?
> > >
> >
> > Yes, it will fail.
> > The receiver get a CQE with status 'IBV_WC_LOC_LEN_ERR', see
> > https://www.rdmamojo.com/2013/02/15/ibv_poll_cq/
> 
> Parav: Can you confirm that this is expected?
> 
Ibv_post_send() will not fail because it is a queuing interface.
But the send operation itself will fail via send (requester) side completion moving the QP to error.
Receive q also moves to error.

> This makes it hard to inline payloads as I was suggesting before :(.

What I was suggesting in other thread, is if we want to inline the payload, we should do following.
RDMA write followed by RDMA send. So, a Block write commands actual data can be placed directly in say 4K memory of target.

This way, sender and receiver works with constant size buffers in send and receive queue.
RDMA is message based and not byte stream based.

Inline RDMA write is often called eager buffer, similar to PCIe write combine buffer.

Both doesn't likely work at scale as the buffer sharing becomes difficult across multiple connections.
It is memory vs perf trade off.
But doable.

We should start with first establishing the data transfer model covering 512B to 1M context and take up the optimizations as extensions.



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] Re: RE: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-06-08 17:01                 ` [virtio-comment] " Parav Pandit
@ 2023-06-09  1:39                   ` zhenwei pi
  2023-06-09  2:06                     ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-09  1:39 UTC (permalink / raw)
  To: Parav Pandit, Stefan Hajnoczi
  Cc: mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong



On 6/9/23 01:01, Parav Pandit wrote:
> 
>> From: Stefan Hajnoczi <stefanha@redhat.com>
>> Sent: Thursday, June 8, 2023 12:41 PM
> 
>>>>> For stream protocol, it always work fine.
>>>>> For keyed protocol, for example RDMA, the target side needs to use
>>>>> ibv_post_recv to receive a large size(sizeof
>>>>> virtio_of_command_connect + sizeof virtio_of_connect). If the
>>>>> target uses ibv_post_recv to receive
>>>>> sizeof(CMD) + sizeof(DESC) * 1, the initiator fails in RDMA SEND.
>>>>
>>>> I read that "A RC connection is very similar to a TCP connection" in
>>>> the NVIDIA documentation
>>>> (https://docs.nvidia.com/networking/display/RDMAAwareProgrammingv17/
>>>> Transport+Modes) and expected SOCK_STREAM semantics for RDMA SEND.
>>>>
>>>> Are you saying ibv_post_send() fails when the receiver's work
>>>> request sg_list size is smaller (fewer bytes) than the sender's?
>>>>
>>>
>>> Yes, it will fail.
>>> The receiver get a CQE with status 'IBV_WC_LOC_LEN_ERR', see
>>> https://www.rdmamojo.com/2013/02/15/ibv_poll_cq/
>>
>> Parav: Can you confirm that this is expected?
>>
> Ibv_post_send() will not fail because it is a queuing interface.
> But the send operation itself will fail via send (requester) side completion moving the QP to error.
> Receive q also moves to error.
> 
>> This makes it hard to inline payloads as I was suggesting before :(.
> 
> What I was suggesting in other thread, is if we want to inline the payload, we should do following.
> RDMA write followed by RDMA send. So, a Block write commands actual data can be placed directly in say 4K memory of target.
> 
> This way, sender and receiver works with constant size buffers in send and receive queue.
> RDMA is message based and not byte stream based.
> 
> Inline RDMA write is often called eager buffer, similar to PCIe write combine buffer.
> 
> Both doesn't likely work at scale as the buffer sharing becomes difficult across multiple connections.
> It is memory vs perf trade off.
> But doable.
> 
> We should start with first establishing the data transfer model covering 512B to 1M context and take up the optimizations as extensions.
> 
> 

Hi, Parav

What do you think about another RDMA inline proposal in
'[PATCH v2 11/11] transport-fabrics: support inline data for keyed 
transmission'?

1, use feature command to get the target max recv buffer size, for 
example 16k
2, use feature command to set the initiator max recv buffer size, for 
example 16k
If the size of payload is less than max recv buffer size, using a single 
RDMA SEND is enough. for example, virtio-blk writes 8k: 16 + 8192 < 
16384, this means a single RDMA SEND is fine.

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [virtio-comment] RE: RE: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-06-09  1:39                   ` [virtio-comment] " zhenwei pi
@ 2023-06-09  2:06                     ` Parav Pandit
  2023-06-09  3:55                       ` zhenwei pi
  0 siblings, 1 reply; 74+ messages in thread
From: Parav Pandit @ 2023-06-09  2:06 UTC (permalink / raw)
  To: zhenwei pi, Stefan Hajnoczi
  Cc: mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong


> From: zhenwei pi <pizhenwei@bytedance.com>
> Sent: Thursday, June 8, 2023 9:39 PM


> > We should start with first establishing the data transfer model covering 512B
> to 1M context and take up the optimizations as extensions.
> >
> >
> 
> Hi, Parav
> 
> What do you think about another RDMA inline proposal in '[PATCH v2 11/11]
> transport-fabrics: support inline data for keyed transmission'?
> 
> 1, use feature command to get the target max recv buffer size, for example 16k
> 2, use feature command to set the initiator max recv buffer size, for example
> 16k If the size of payload is less than max recv buffer size, using a single RDMA
> SEND is enough. for example, virtio-blk writes 8k: 16 + 8192 < 16384, this
> means a single RDMA SEND is fine.

Let me read it.
From above short description, it appears that every receive buffer posted must be of size 16K.
And if sender choose not to do inline, there is super buffer wasted.

If it is read only or read workload, target majority buffer wastage is close to 98% or so assuming 64B command size.

And when buffer is full, the sender is stalled for the full round trip to enqueue the command.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [virtio-comment] RE: RE: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-06-09  2:06                     ` [virtio-comment] " Parav Pandit
@ 2023-06-09  3:55                       ` zhenwei pi
  2023-06-11 20:56                         ` Parav Pandit
  0 siblings, 1 reply; 74+ messages in thread
From: zhenwei pi @ 2023-06-09  3:55 UTC (permalink / raw)
  To: Parav Pandit, Stefan Hajnoczi
  Cc: mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong

On 6/9/23 10:06, Parav Pandit wrote:
> 
>> From: zhenwei pi <pizhenwei@bytedance.com>
>> Sent: Thursday, June 8, 2023 9:39 PM
> 
> 
>>> We should start with first establishing the data transfer model covering 512B
>> to 1M context and take up the optimizations as extensions.
>>>
>>>
>>
>> Hi, Parav
>>
>> What do you think about another RDMA inline proposal in '[PATCH v2 11/11]
>> transport-fabrics: support inline data for keyed transmission'?
>>
>> 1, use feature command to get the target max recv buffer size, for example 16k
>> 2, use feature command to set the initiator max recv buffer size, for example
>> 16k If the size of payload is less than max recv buffer size, using a single RDMA
>> SEND is enough. for example, virtio-blk writes 8k: 16 + 8192 < 16384, this
>> means a single RDMA SEND is fine.
> 
> Let me read it.
>  From above short description, it appears that every receive buffer posted must be of size 16K.
> And if sender choose not to do inline, there is super buffer wasted.
> 
> If it is read only or read workload, target majority buffer wastage is close to 98% or so assuming 64B command size.
> 
> And when buffer is full, the sender is stalled for the full round trip to enqueue the command.

Yes, this waste memory, it's not good enough.

I tried to understand your proposal, please correct me if I misunderstand...

Define data structure like:

struct virtio_of_keyed_desc {
         le64 addr;
         le32 length;
         le32 key;
};

struct virtio_of_command_vq {
         le16 opcode;
         le16 command_id;
         le32 out_length;
         le32 in_length;
         union {
                 struct virtio_of_keyed {
                         le32 out_offset;
                 };

                 struct virtio_of_stream {
                         u8 rsvd[4];
                 };
         };
};

struct virtio_of_completion {
         le16 status;
         le16 command_id;
         u8 rsvd[4];
         union {
                 le64 value;
                 struct virtio_of_vq_completion {
                         le32 in_length;
                         le32 len;
                 };
         }
};


For stream(Ex TCP/IP), the request PDU includes [struct 
virtio_of_command_vq + data], the response PDU includes [struct 
virtio_of_completion + data].

For keyed(Ex RDMA), the request PDU includes [struct 
virtio_of_command_vq + struct virtio_of_keyed_desc], there are 2 opcodes 
for keyed transmission:
1, opcode virtio_of_op_vq: (basic and required command)
the initiator prepares a buffer of [out_length + in_length], the target 
recv a 32B command, and reads the remote memory [addr, addr+out_length) 
by RDMA READ, then writes the remote memory [addr+out_length, 
addr+out_length+in_length) by RDMA WRITE, finally sends completion by 
RDMA SEND.

2, opcode virtio_of_op_vq_write_inline: (optional command)
the initiator gets a remote buffer of target(Ex, 128K) after feature 
negotiation.

The initiator selects a region of target remote memory(Ex, 4k - 12k), 
and writes payload by RDMA WRITE, then sends a 32B command by RDMA 
SEND(out_offset is 4K, ).
The target handles command, writes the remote memory [addr, 
addr+in_length), finally sends completion by RDMA SEND.

-- 
zhenwei pi

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [virtio-comment] RE: RE: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  2023-06-09  3:55                       ` zhenwei pi
@ 2023-06-11 20:56                         ` Parav Pandit
  0 siblings, 0 replies; 74+ messages in thread
From: Parav Pandit @ 2023-06-11 20:56 UTC (permalink / raw)
  To: zhenwei pi, Stefan Hajnoczi
  Cc: mst, jasowang, virtio-comment, houp, helei.sig11, xinhao.kong


> From: zhenwei pi <pizhenwei@bytedance.com>
> Sent: Thursday, June 8, 2023 11:55 PM

> I tried to understand your proposal, please correct me if I misunderstand...
> 
> Define data structure like:
> 
> struct virtio_of_keyed_desc {
>          le64 addr;
>          le32 length;
>          le32 key;
> };
> 
> struct virtio_of_command_vq {
>          le16 opcode;
>          le16 command_id;
>          le32 out_length;
>          le32 in_length;
>          union {
>                  struct virtio_of_keyed {
>                          le32 out_offset;
>                  };
> 
>                  struct virtio_of_stream {
>                          u8 rsvd[4];
>                  };
>          };
> };
> 
> struct virtio_of_completion {
>          le16 status;
>          le16 command_id;
>          u8 rsvd[4];
>          union {
>                  le64 value;
>                  struct virtio_of_vq_completion {
>                          le32 in_length;
>                          le32 len;
>                  };
>          }
> };
> 
> 
> For stream(Ex TCP/IP), the request PDU includes [struct virtio_of_command_vq
> + data], the response PDU includes [struct virtio_of_completion + data].
> 
> For keyed(Ex RDMA), the request PDU includes [struct virtio_of_command_vq +
> struct virtio_of_keyed_desc], there are 2 opcodes for keyed transmission:
> 1, opcode virtio_of_op_vq: (basic and required command) the initiator prepares
> a buffer of [out_length + in_length], the target recv a 32B command, and reads
> the remote memory [addr, addr+out_length) by RDMA READ, then writes the
> remote memory [addr+out_length,
> addr+out_length+in_length) by RDMA WRITE, finally sends completion by
> RDMA SEND.
> 
Maybe we can switch to the 64B format which has two benefits.
1. separate RDMA buffer for in, out xfer as each can have different DMA attributes.
2. ability to have one or more inline descs

A good way is to negotiate the max_cmd_size minimum being 32, maximum being a finite reasonable number of 64 or 128.

> 2, opcode virtio_of_op_vq_write_inline: (optional command)
> the initiator gets a remote buffer of target(Ex, 128K) after feature
> negotiation.
> 
> The initiator selects a region of target remote memory(Ex, 4k - 12k),
> and writes payload by RDMA WRITE, then sends a 32B command by RDMA
> SEND(out_offset is 4K, ).
> The target handles command, writes the remote memory [addr,
> addr+in_length), finally sends completion by RDMA SEND.
Yes.

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2023-06-11 20:56 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-04  8:18 [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview zhenwei pi
2023-05-04  8:57   ` David Hildenbrand
2023-05-04  9:46     ` zhenwei pi
2023-05-04 10:05       ` Michael S. Tsirkin
2023-05-04 10:12         ` David Hildenbrand
2023-05-04 10:50         ` Re: " zhenwei pi
2023-05-31 14:00   ` [virtio-comment] " Stefan Hajnoczi
2023-06-02  1:17     ` [virtio-comment] " zhenwei pi
2023-06-05  2:39   ` [virtio-comment] " Parav Pandit
2023-06-05  2:39   ` Parav Pandit
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 02/11] transport-fabrics: introduce Virtio Qualified Name zhenwei pi
2023-05-31 14:06   ` Stefan Hajnoczi
2023-06-02  1:50     ` zhenwei pi
2023-06-05  2:40       ` Parav Pandit
2023-06-05  7:57         ` zhenwei pi
2023-06-05 17:05         ` Stefan Hajnoczi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 03/11] transport-fabircs: introduce Segment Descriptor Definition zhenwei pi
2023-05-31 14:23   ` Stefan Hajnoczi
2023-06-02  3:08     ` zhenwei pi
2023-06-05  2:40   ` [virtio-comment] " Parav Pandit
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 04/11] transport-fabrics: introduce Stream Transmission zhenwei pi
2023-05-31 15:20   ` Stefan Hajnoczi
2023-06-02  2:26     ` zhenwei pi
2023-06-05 16:11       ` Stefan Hajnoczi
2023-06-06  3:13         ` zhenwei pi
2023-06-06 13:09           ` Stefan Hajnoczi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission zhenwei pi
2023-05-31 16:20   ` [virtio-comment] " Stefan Hajnoczi
2023-06-01  9:02     ` zhenwei pi
2023-06-01 11:33       ` Stefan Hajnoczi
2023-06-01 13:09         ` zhenwei pi
2023-06-01 19:13           ` Stefan Hajnoczi
2023-06-01 21:23             ` Stefan Hajnoczi
2023-06-02  0:55               ` zhenwei pi
2023-06-05 17:21                 ` Stefan Hajnoczi
2023-06-05  2:41   ` Parav Pandit
2023-06-05  8:41     ` zhenwei pi
2023-06-05 11:45       ` Parav Pandit
2023-06-05 12:50         ` zhenwei pi
2023-06-05 13:12           ` Parav Pandit
2023-06-06  7:13             ` zhenwei pi
2023-06-06 21:52               ` Parav Pandit
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 06/11] transport-fabrics: introduce command set zhenwei pi
2023-05-31 17:10   ` [virtio-comment] " Stefan Hajnoczi
2023-06-02  5:15     ` [virtio-comment] " zhenwei pi
2023-06-05 16:30       ` Stefan Hajnoczi
2023-06-06  1:31         ` [virtio-comment] " zhenwei pi
2023-06-06 13:34           ` Stefan Hajnoczi
2023-06-07  2:58             ` [virtio-comment] " zhenwei pi
2023-06-08 16:41               ` Stefan Hajnoczi
2023-06-08 17:01                 ` [virtio-comment] " Parav Pandit
2023-06-09  1:39                   ` [virtio-comment] " zhenwei pi
2023-06-09  2:06                     ` [virtio-comment] " Parav Pandit
2023-06-09  3:55                       ` zhenwei pi
2023-06-11 20:56                         ` Parav Pandit
2023-06-06  2:02         ` [virtio-comment] " zhenwei pi
2023-06-06 13:44           ` Stefan Hajnoczi
2023-06-07  2:03             ` [virtio-comment] " zhenwei pi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 07/11] transport-fabrics: introduce opcodes zhenwei pi
2023-05-31 17:11   ` [virtio-comment] " Stefan Hajnoczi
     [not found]   ` <20230531205508.GA1509630@fedora>
2023-06-02  8:39     ` [virtio-comment] " zhenwei pi
2023-06-05 16:46       ` Stefan Hajnoczi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 08/11] transport-fabrics: introduce status of completion zhenwei pi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 09/11] transport-fabrics: add TCP&RDMA binding zhenwei pi
     [not found]   ` <20230531210255.GC1509630@fedora>
2023-06-02  9:07     ` [virtio-comment] Re: " zhenwei pi
2023-06-05 16:57       ` Stefan Hajnoczi
2023-06-06  1:41         ` [virtio-comment] " zhenwei pi
2023-06-06 13:51           ` Stefan Hajnoczi
2023-06-07  2:15             ` zhenwei pi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 10/11] transport-fabrics: add device initialization zhenwei pi
     [not found]   ` <20230531210925.GD1509630@fedora>
2023-06-02  9:11     ` zhenwei pi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 11/11] transport-fabrics: support inline data for keyed transmission zhenwei pi
2023-05-29  0:56 ` [virtio-comment] PING: [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.