virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] virtio-vsock: add description for datagram type
@ 2021-03-16 21:56 jiang.wang
  2021-03-17 15:44 ` Stefan Hajnoczi
  0 siblings, 1 reply; 19+ messages in thread
From: jiang.wang @ 2021-03-16 21:56 UTC (permalink / raw)
  To: mst, cohuck
  Cc: cong.wang, duanxiongchun, jiang.wang, virtualization, xieyongji,
	stefanha, asias, arseny.krasnov

Add supports for datagram type for virtio-vsock. Datagram
sockets are connectionless and unreliable. To avoid contention
with stream and other sockets, add two more virtqueues and
a new feature bit to identify if those two new queues exist or not.

Also add descriptions for resouce management of datagram, which
does not use the existing credit update mechanism associated with
stream sockets.

Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
---
 virtio-vsock.tex | 72 ++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 62 insertions(+), 10 deletions(-)

diff --git a/virtio-vsock.tex b/virtio-vsock.tex
index da7e641..a2ff0a4 100644
--- a/virtio-vsock.tex
+++ b/virtio-vsock.tex
@@ -9,14 +9,27 @@ \subsection{Device ID}\label{sec:Device Types / Socket Device / Device ID}
 
 \subsection{Virtqueues}\label{sec:Device Types / Socket Device / Virtqueues}
 \begin{description}
+\item[0] stream rx
+\item[1] stream tx
+\item[2] dgram rx
+\item[3] dgram tx
+\item[4] event
+\end{description}
+The virtio socket device uses 5 queues if feature bit VIRTIO_VSOCK_DRGAM is set. Otherwise, it
+only uses 3 queues, as the following.
+
+\begin{description}
 \item[0] rx
 \item[1] tx
 \item[2] event
 \end{description}
 
+
 \subsection{Feature bits}\label{sec:Device Types / Socket Device / Feature bits}
 
-There are currently no feature bits defined for this device.
+\begin{description}
+\item[VIRTIO_VSOCK_DGRAM (0)] Device has support for datagram sockets type.
+\end{description}
 
 \subsection{Device configuration layout}\label{sec:Device Types / Socket Device / Device configuration layout}
 
@@ -76,6 +89,7 @@ \subsection{Device Operation}\label{sec:Device Types / Socket Device / Device Op
 	le32 flags;
 	le32 buf_alloc;
 	le32 fwd_cnt;
+	le64 timestamp;
 };
 \end{lstlisting}
 
@@ -121,11 +135,14 @@ \subsubsection{Virtqueue Flow Control}\label{sec:Device Types / Socket Device /
 packets.  With additional resources, it becomes possible to process incoming
 packets even when outgoing packets cannot be sent.
 
-Eventually even the additional resources will be exhausted and further
+For stream type, eventually even the additional resources will be exhausted and further
 processing is not possible until the other side processes the virtqueue that
 it has neglected.  This stop to processing prevents one side from causing
 unbounded resource consumption in the other side.
 
+For datagram type, the additional resources have a fixed size limit to prevent
+unbounded resource consumption.
+
 \drivernormative{\paragraph}{Device Operation: Virtqueue Flow Control}{Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
 
 The rx virtqueue MUST be processed even when the tx virtqueue is full so long as there are additional resources available to hold packets outside the tx virtqueue.
@@ -140,12 +157,15 @@ \subsubsection{Addressing}\label{sec:Device Types / Socket Device / Device Opera
 consists of a (cid, port number) tuple. The header fields used for this are
 \field{src_cid}, \field{src_port}, \field{dst_cid}, and \field{dst_port}.
 
-Currently only stream sockets are supported. \field{type} is 1 for stream
-socket types.
+Currently stream and datagram (dgram) sockets are supported. \field{type} is 1 for stream
+socket types. \field{type} is 3 for dgram socket types.
 
 Stream sockets provide in-order, guaranteed, connection-oriented delivery
 without message boundaries.
 
+Datagram socekts provide connectionless unreliable messages of
+a fixed maximum length.
+
 \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device / Device Operation / Buffer Space Management}
 \field{buf_alloc} and \field{fwd_cnt} are used for buffer space management of
 stream sockets. The guest and the device publish how much buffer space is
@@ -170,20 +190,41 @@ \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device /
 previously receiving a VIRTIO_VSOCK_OP_CREDIT_REQUEST packet. This allows
 communicating updates any time a change in buffer space occurs.
 
+For datagram sockets, \field{buf_alloc} is also used on the rx side. Unlike stream 
+sockets, dgram sockets do not use VIRTIO_VSOCK_OP_CREDIT_UPDATE or 
+VIRTIO_VSOCK_OP_CREDIT_REQUEST packets. The dgram sender does not check if
+the \field{buf_alloc} of the peer is full or not. If it is full, the packet
+will be dropped. To limit resouce usage of the sender, \field{dgram_tx_bytes}
+is used for each VM and host. Only payload bytes are counted and header bytes are not
+included. If \field{dgram_tx_bytes} is equal to VIRTIO_VSOCK_MAX_DGRAM_BUF_SIZE,
+the send will fail.
+
+For dgram, since \field{dgram_tx_byte} is shared within a VM or host, a tx_timeout and
+a timer are used for each outgoing packet. If a packet is not delivered within
+tx_timeout, it will be dropped to make free space for other dgram sockets.
+
 \drivernormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Buffer Space Management}
-VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
-sufficient free buffer space for the payload.
+For stream sockets, VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
+sufficient free buffer space for the payload. For dgram sockets, VIRTIO_VSOCK_OP_RW data packets
+can be transmitted when the peer buffer is full. Then the pacekt will be dropped.
 
 All packets associated with a stream flow MUST contain valid information in
 \field{buf_alloc} and \field{fwd_cnt} fields.
 
+All packets associated with a dgram flow MUST contain valid information in
+\field{timestamp} field, which will be used to check for tx timeout.
+
 \devicenormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Buffer Space Management}
-VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
-sufficient free buffer space for the payload.
+For stream sockets, VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
+sufficient free buffer space for the payload. For dgram sockets, VIRTIO_VSOCK_OP_RW data packets
+can be transmitted when the peer buffer is full. Then the pacekt will be dropped.
 
 All packets associated with a stream flow MUST contain valid information in
 \field{buf_alloc} and \field{fwd_cnt} fields.
 
+All packets associated with a dgram flow MUST contain valid information in
+\field{timestamp} field.
+
 \subsubsection{Receive and Transmit}\label{sec:Device Types / Socket Device / Device Operation / Receive and Transmit}
 The driver queues outgoing packets on the tx virtqueue and incoming packet
 receive buffers on the rx virtqueue. Packets are of the following form:
@@ -203,14 +244,14 @@ \subsubsection{Receive and Transmit}\label{sec:Device Types / Socket Device / De
 The \field{guest_cid} configuration field MUST be used as the source CID when
 sending outgoing packets.
 
-A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
+For stream socekts, A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
 unknown \field{type} value.
 
 \devicenormative{\paragraph}{Device Operation: Receive and Transmit}{Device Types / Socket Device / Device Operation / Receive and Transmit}
 
 The \field{guest_cid} configuration field MUST NOT contain a reserved CID as listed in \ref{sec:Device Types / Socket Device / Device configuration layout}.
 
-A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
+For stream sockets, A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
 unknown \field{type} value.
 
 \subsubsection{Stream Sockets}\label{sec:Device Types / Socket Device / Device Operation / Stream Sockets}
@@ -240,6 +281,17 @@ \subsubsection{Stream Sockets}\label{sec:Device Types / Socket Device / Device O
 destination) address tuple for a new connection while the other peer is still
 processing the old connection.
 
+\subsubsection{Datagram Sockets}\label{sec:Device Types / Socket Device / Device Operation / Stream Sockets}
+
+Datagram (dgram) sockets are connectionless and unreliable. The sender just send 
+a message to the peer and hope it will be delivered. If there is no socket binds the 
+address on the other end, or the transmision or receving buffers are full, the packets 
+will be droped. Each packet may have a source cid and 
+source port, the receiver can use them to send back a reply message.
+
+Dgram sockets preserve the message boundaries. A message is either sent or dropped.
+Multiple messages will not be combined.
+
 \subsubsection{Device Events}\label{sec:Device Types / Socket Device / Device Operation / Device Events}
 
 Certain events are communicated by the device to the driver using the event
-- 
2.11.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-16 21:56 [RFC PATCH] virtio-vsock: add description for datagram type jiang.wang
@ 2021-03-17 15:44 ` Stefan Hajnoczi
  2021-03-18 17:59   ` [External] " Jiang Wang .
       [not found]   ` <CAA68J_bQHzFXnsLpCqZ3waPW1NGz+hnu2OXfAG4XOLemLOX9DQ@mail.gmail.com>
  0 siblings, 2 replies; 19+ messages in thread
From: Stefan Hajnoczi @ 2021-03-17 15:44 UTC (permalink / raw)
  To: jiang.wang
  Cc: cong.wang, duanxiongchun, mst, cohuck, virtualization, xieyongji,
	arseny.krasnov, asias


[-- Attachment #1.1: Type: text/plain, Size: 11748 bytes --]

On Tue, Mar 16, 2021 at 09:56:44PM +0000, jiang.wang wrote:
> Add supports for datagram type for virtio-vsock. Datagram
> sockets are connectionless and unreliable. To avoid contention
> with stream and other sockets, add two more virtqueues and
> a new feature bit to identify if those two new queues exist or not.
> 
> Also add descriptions for resouce management of datagram, which
> does not use the existing credit update mechanism associated with
> stream sockets.
> 
> Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> ---
>  virtio-vsock.tex | 72 ++++++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 62 insertions(+), 10 deletions(-)
> 
> diff --git a/virtio-vsock.tex b/virtio-vsock.tex
> index da7e641..a2ff0a4 100644
> --- a/virtio-vsock.tex
> +++ b/virtio-vsock.tex
> @@ -9,14 +9,27 @@ \subsection{Device ID}\label{sec:Device Types / Socket Device / Device ID}
>  
>  \subsection{Virtqueues}\label{sec:Device Types / Socket Device / Virtqueues}
>  \begin{description}
> +\item[0] stream rx
> +\item[1] stream tx
> +\item[2] dgram rx
> +\item[3] dgram tx
> +\item[4] event
> +\end{description}
> +The virtio socket device uses 5 queues if feature bit VIRTIO_VSOCK_DRGAM is set. Otherwise, it
> +only uses 3 queues, as the following.
> +
> +\begin{description}
>  \item[0] rx
>  \item[1] tx
>  \item[2] event
>  \end{description}
>  
> +
>  \subsection{Feature bits}\label{sec:Device Types / Socket Device / Feature bits}
>  
> -There are currently no feature bits defined for this device.
> +\begin{description}
> +\item[VIRTIO_VSOCK_DGRAM (0)] Device has support for datagram sockets type.
> +\end{description}

By convention feature bits are named VIRTIO_<device>_F_<feature>. Please
add the "_F_".

>  
>  \subsection{Device configuration layout}\label{sec:Device Types / Socket Device / Device configuration layout}
>  
> @@ -76,6 +89,7 @@ \subsection{Device Operation}\label{sec:Device Types / Socket Device / Device Op
>  	le32 flags;
>  	le32 buf_alloc;
>  	le32 fwd_cnt;
> +	le64 timestamp;

Adding this field breaks old devices and drivers. Please make this field
conditional on the dgram socket type or the VIRTIO_VSOCK_F_DGRAM feature
bit.

Also, are all the other fields still used with dgram sockets? Maybe you
can use a union instead to reuse some space?

>  };
>  \end{lstlisting}
>  
> @@ -121,11 +135,14 @@ \subsubsection{Virtqueue Flow Control}\label{sec:Device Types / Socket Device /
>  packets.  With additional resources, it becomes possible to process incoming
>  packets even when outgoing packets cannot be sent.
>  
> -Eventually even the additional resources will be exhausted and further
> +For stream type, eventually even the additional resources will be exhausted and further
>  processing is not possible until the other side processes the virtqueue that
>  it has neglected.  This stop to processing prevents one side from causing
>  unbounded resource consumption in the other side.
>  
> +For datagram type, the additional resources have a fixed size limit to prevent
> +unbounded resource consumption.
> +
>  \drivernormative{\paragraph}{Device Operation: Virtqueue Flow Control}{Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
>  
>  The rx virtqueue MUST be processed even when the tx virtqueue is full so long as there are additional resources available to hold packets outside the tx virtqueue.
> @@ -140,12 +157,15 @@ \subsubsection{Addressing}\label{sec:Device Types / Socket Device / Device Opera
>  consists of a (cid, port number) tuple. The header fields used for this are
>  \field{src_cid}, \field{src_port}, \field{dst_cid}, and \field{dst_port}.
>  
> -Currently only stream sockets are supported. \field{type} is 1 for stream
> -socket types.
> +Currently stream and datagram (dgram) sockets are supported. \field{type} is 1 for stream
> +socket types. \field{type} is 3 for dgram socket types.
>  
>  Stream sockets provide in-order, guaranteed, connection-oriented delivery
>  without message boundaries.
>  
> +Datagram socekts provide connectionless unreliable messages of

s/socekts/sockets/

> +a fixed maximum length.
> +
>  \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device / Device Operation / Buffer Space Management}
>  \field{buf_alloc} and \field{fwd_cnt} are used for buffer space management of
>  stream sockets. The guest and the device publish how much buffer space is
> @@ -170,20 +190,41 @@ \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device /
>  previously receiving a VIRTIO_VSOCK_OP_CREDIT_REQUEST packet. This allows
>  communicating updates any time a change in buffer space occurs.
>  
> +For datagram sockets, \field{buf_alloc} is also used on the rx side. Unlike stream 
> +sockets, dgram sockets do not use VIRTIO_VSOCK_OP_CREDIT_UPDATE or 
> +VIRTIO_VSOCK_OP_CREDIT_REQUEST packets. The dgram sender does not check if
> +the \field{buf_alloc} of the peer is full or not. If it is full, the packet
> +will be dropped. To limit resouce usage of the sender, \field{dgram_tx_bytes}

s/resouce/resource/

\field{dgram_tx_bytes} is not included in any struct definition?

> +is used for each VM and host. Only payload bytes are counted and header bytes are not

Please use the VIRTIO specification terms "driver" and "device" instead
of VM and host.

It's not clear to me what "used" means. What exactly do the driver and
device need to do?

> +included. If \field{dgram_tx_bytes} is equal to VIRTIO_VSOCK_MAX_DGRAM_BUF_SIZE,
> +the send will fail.

Does the device reject dgram packets or does the driver refuse to send
more dgram packets?

> +
> +For dgram, since \field{dgram_tx_byte} is shared within a VM or host, a tx_timeout and
> +a timer are used for each outgoing packet. If a packet is not delivered within
> +tx_timeout, it will be dropped to make free space for other dgram sockets.

What does this mean? Does the timeout cause the device to drop packets
that haven't been handled yet (e.g. read by a host userspace process)?

POSIX Sockets have a receive socket buffer (rcvbuf) that is used for
memory accounting and dropping packets. Operating systems implementing
POSIX Sockets would use that to decide when incoming packets are
dropped. It seems like dgram_tx_byte does something similar at the
device level and I'm not sure why it's necessary?

In the POSIX Sockets model the virtio-vsock tx virtqueue is processed by
the device and packets are read into socket rcvbuf. They do not stay in
the virtqueue. This involves an extra data copy but it keeps the
virtqueue as empty as possible so that further communication is
possible.

> +
>  \drivernormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Buffer Space Management}
> -VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
> -sufficient free buffer space for the payload.
> +For stream sockets, VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
> +sufficient free buffer space for the payload. For dgram sockets, VIRTIO_VSOCK_OP_RW data packets
> +can be transmitted when the peer buffer is full. Then the pacekt will be dropped.

s/can/MAY/

s/pacekt/packet/

"Then the packet will be dropped" is not clear to me. Is it saying the
device drops packets when buffer space has exceeded?

>  
>  All packets associated with a stream flow MUST contain valid information in
>  \field{buf_alloc} and \field{fwd_cnt} fields.
>  
> +All packets associated with a dgram flow MUST contain valid information in
> +\field{timestamp} field, which will be used to check for tx timeout.

What are the units of the timestamp field?

> +
>  \devicenormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Buffer Space Management}
> -VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
> -sufficient free buffer space for the payload.
> +For stream sockets, VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
> +sufficient free buffer space for the payload. For dgram sockets, VIRTIO_VSOCK_OP_RW data packets
> +can be transmitted when the peer buffer is full. Then the pacekt will be dropped.
>  
>  All packets associated with a stream flow MUST contain valid information in
>  \field{buf_alloc} and \field{fwd_cnt} fields.
>  
> +All packets associated with a dgram flow MUST contain valid information in
> +\field{timestamp} field.
> +
>  \subsubsection{Receive and Transmit}\label{sec:Device Types / Socket Device / Device Operation / Receive and Transmit}
>  The driver queues outgoing packets on the tx virtqueue and incoming packet
>  receive buffers on the rx virtqueue. Packets are of the following form:
> @@ -203,14 +244,14 @@ \subsubsection{Receive and Transmit}\label{sec:Device Types / Socket Device / De
>  The \field{guest_cid} configuration field MUST be used as the source CID when
>  sending outgoing packets.
>  
> -A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
> +For stream socekts, A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
>  unknown \field{type} value.
>  
>  \devicenormative{\paragraph}{Device Operation: Receive and Transmit}{Device Types / Socket Device / Device Operation / Receive and Transmit}
>  
>  The \field{guest_cid} configuration field MUST NOT contain a reserved CID as listed in \ref{sec:Device Types / Socket Device / Device configuration layout}.
>  
> -A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
> +For stream sockets, A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
>  unknown \field{type} value.
>  
>  \subsubsection{Stream Sockets}\label{sec:Device Types / Socket Device / Device Operation / Stream Sockets}
> @@ -240,6 +281,17 @@ \subsubsection{Stream Sockets}\label{sec:Device Types / Socket Device / Device O
>  destination) address tuple for a new connection while the other peer is still
>  processing the old connection.
>  
> +\subsubsection{Datagram Sockets}\label{sec:Device Types / Socket Device / Device Operation / Stream Sockets}
> +
> +Datagram (dgram) sockets are connectionless and unreliable. The sender just send 
> +a message to the peer and hope it will be delivered. If there is no socket binds the 
> +address on the other end, or the transmision or receving buffers are full, the packets 
> +will be droped. Each packet may have a source cid and 

s/dropped/

Please stick to UDP semantics as much as possible so that applications
can be ported easily and developers aren't surprised by unexpected
behavior. UDP packets sent to a destination that has no listen socket
result in a Connection Refused error. vsock dgrams should behave in the
same way.

> +source port, the receiver can use them to send back a reply message.

The VIRTIO specification avoids using the word "may" (as well as "must"
and "should") outside the normative sections of the specification.

It's unclear what this sentence means: can the driver set the source cid
and source port to zero or an arbitary number if it does not need a
reply? (I guess the answer is "no" but the sentence implies setting the
source cid and source port is optional.)

> +
> +Dgram sockets preserve the message boundaries. A message is either sent or dropped.

What does "a message is either sent or dropped" mean? Does it mean
messages are delivered once or not at all, but they are never
duplicated?

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-17 15:44 ` Stefan Hajnoczi
@ 2021-03-18 17:59   ` Jiang Wang .
  2021-03-22 16:50     ` Stefan Hajnoczi
       [not found]   ` <CAA68J_bQHzFXnsLpCqZ3waPW1NGz+hnu2OXfAG4XOLemLOX9DQ@mail.gmail.com>
  1 sibling, 1 reply; 19+ messages in thread
From: Jiang Wang . @ 2021-03-18 17:59 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: cong.wang, Xiongchun Duan, Michael S. Tsirkin, cohuck,
	virtualization, xieyongji, Arseny Krasnov, asias

On Wed, Mar 17, 2021 at 8:45 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, Mar 16, 2021 at 09:56:44PM +0000, jiang.wang wrote:
> > Add supports for datagram type for virtio-vsock. Datagram
> > sockets are connectionless and unreliable. To avoid contention
> > with stream and other sockets, add two more virtqueues and
> > a new feature bit to identify if those two new queues exist or not.
> >
> > Also add descriptions for resouce management of datagram, which
> > does not use the existing credit update mechanism associated with
> > stream sockets.
> >
> > Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> > ---
> >  virtio-vsock.tex | 72 ++++++++++++++++++++++++++++++++++++++++++++++++--------
> >  1 file changed, 62 insertions(+), 10 deletions(-)
> >
> > diff --git a/virtio-vsock.tex b/virtio-vsock.tex
> > index da7e641..a2ff0a4 100644
> > --- a/virtio-vsock.tex
> > +++ b/virtio-vsock.tex
> > @@ -9,14 +9,27 @@ \subsection{Device ID}\label{sec:Device Types / Socket Device / Device ID}
> >
> >  \subsection{Virtqueues}\label{sec:Device Types / Socket Device / Virtqueues}
> >  \begin{description}
> > +\item[0] stream rx
> > +\item[1] stream tx
> > +\item[2] dgram rx
> > +\item[3] dgram tx
> > +\item[4] event
> > +\end{description}
> > +The virtio socket device uses 5 queues if feature bit VIRTIO_VSOCK_DRGAM is set. Otherwise, it
> > +only uses 3 queues, as the following.
> > +
> > +\begin{description}
> >  \item[0] rx
> >  \item[1] tx
> >  \item[2] event
> >  \end{description}
> >
> > +
> >  \subsection{Feature bits}\label{sec:Device Types / Socket Device / Feature bits}
> >
> > -There are currently no feature bits defined for this device.
> > +\begin{description}
> > +\item[VIRTIO_VSOCK_DGRAM (0)] Device has support for datagram sockets type.
> > +\end{description}
>
> By convention feature bits are named VIRTIO_<device>_F_<feature>. Please
> add the "_F_".
>
Sure.

> >
> >  \subsection{Device configuration layout}\label{sec:Device Types / Socket Device / Device configuration layout}
> >
> > @@ -76,6 +89,7 @@ \subsection{Device Operation}\label{sec:Device Types / Socket Device / Device Op
> >       le32 flags;
> >       le32 buf_alloc;
> >       le32 fwd_cnt;
> > +     le64 timestamp;
>
> Adding this field breaks old devices and drivers. Please make this field
> conditional on the dgram socket type or the VIRTIO_VSOCK_F_DGRAM feature
> bit.
>
> Also, are all the other fields still used with dgram sockets? Maybe you
> can use a union instead to reuse some space?

I will make this new field depending on the dgram socket type.

For the union idea, could I change the order of those fields? Dgram does not use
flags and fwd_cnt fields, and I want to put them together with a union
of a le64 timestamp.
Something like:

struct virtio_vsock_hdr {
  le64 src_cid;
  le64 dst_cid;
  le32 src_port;
  le32 dst_port;
  le32 len;
  le16 type;
  le16 op;
  le32 buf_alloc;
 union {
        struct {
              le32 flags;
              le32 fwd_cnt;
        } stream;
        le64 dgram_timestamp;
     } internal; // or a better name
};

> >  };
> >  \end{lstlisting}
> >
> > @@ -121,11 +135,14 @@ \subsubsection{Virtqueue Flow Control}\label{sec:Device Types / Socket Device /
> >  packets.  With additional resources, it becomes possible to process incoming
> >  packets even when outgoing packets cannot be sent.
> >
> > -Eventually even the additional resources will be exhausted and further
> > +For stream type, eventually even the additional resources will be exhausted and further
> >  processing is not possible until the other side processes the virtqueue that
> >  it has neglected.  This stop to processing prevents one side from causing
> >  unbounded resource consumption in the other side.
> >
> > +For datagram type, the additional resources have a fixed size limit to prevent
> > +unbounded resource consumption.
> > +
> >  \drivernormative{\paragraph}{Device Operation: Virtqueue Flow Control}{Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
> >
> >  The rx virtqueue MUST be processed even when the tx virtqueue is full so long as there are additional resources available to hold packets outside the tx virtqueue.
> > @@ -140,12 +157,15 @@ \subsubsection{Addressing}\label{sec:Device Types / Socket Device / Device Opera
> >  consists of a (cid, port number) tuple. The header fields used for this are
> >  \field{src_cid}, \field{src_port}, \field{dst_cid}, and \field{dst_port}.
> >
> > -Currently only stream sockets are supported. \field{type} is 1 for stream
> > -socket types.
> > +Currently stream and datagram (dgram) sockets are supported. \field{type} is 1 for stream
> > +socket types. \field{type} is 3 for dgram socket types.
> >
> >  Stream sockets provide in-order, guaranteed, connection-oriented delivery
> >  without message boundaries.
> >
> > +Datagram socekts provide connectionless unreliable messages of
>
> s/socekts/sockets/
>
> > +a fixed maximum length.
> > +
> >  \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device / Device Operation / Buffer Space Management}
> >  \field{buf_alloc} and \field{fwd_cnt} are used for buffer space management of
> >  stream sockets. The guest and the device publish how much buffer space is
> > @@ -170,20 +190,41 @@ \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device /
> >  previously receiving a VIRTIO_VSOCK_OP_CREDIT_REQUEST packet. This allows
> >  communicating updates any time a change in buffer space occurs.
> >
> > +For datagram sockets, \field{buf_alloc} is also used on the rx side. Unlike stream
> > +sockets, dgram sockets do not use VIRTIO_VSOCK_OP_CREDIT_UPDATE or
> > +VIRTIO_VSOCK_OP_CREDIT_REQUEST packets. The dgram sender does not check if
> > +the \field{buf_alloc} of the peer is full or not. If it is full, the packet
> > +will be dropped. To limit resouce usage of the sender, \field{dgram_tx_bytes}
>
> s/resouce/resource/
>
> \field{dgram_tx_bytes} is not included in any struct definition?

dgram_tx_bytes is a field in struct vhost_vsock and virtio_vsock. But
I don't see them
mentioned in the spec.

> > +is used for each VM and host. Only payload bytes are counted and header bytes are not
>
> Please use the VIRTIO specification terms "driver" and "device" instead
> of VM and host.
Sure.

> It's not clear to me what "used" means. What exactly do the driver and
> device need to do?

In the  Linux and KVM case, I added the dgram_tx_bytes to vhost_vsock
and virtio_vsock.
Then in virtio_transport_send_pkt() and vhost_transport_send_pkt(),
the code will increase and check
dgram_tx_bytes first. If dgram_tx_bytes is no less than
VIRTIO_VSOCK_MAX_DGRAM_BUF_SIZE,
the function will free the pkt with virtio_transport_free_pkt() and
then return. The dgram_tx_bytes
is decreased in vhost_transport_do_send_pkt, just before each
virtio_transport_free_pkt() is called.
It is similar for the device case.

The goal is to limit the memory usage for the vsock. Since dgram does
not use credit, the sender
can send far more packets than the receiver can handle. If we don't
add the above check, the
sender can use lots of memory and cause OOM in the Linux kernel.

I am not sure if these details are too much for a spec or not. But I
think it will be good to
add some description. Or we can just say that the sender ( device or
the driver ) must not use
unlimited resources.


> > +included. If \field{dgram_tx_bytes} is equal to VIRTIO_VSOCK_MAX_DGRAM_BUF_SIZE,
> > +the send will fail.
>
> Does the device reject dgram packets or does the driver refuse to send
> more dgram packets?

The sender will drop the packets. This one is independent of the rx
side. If the device is the
sender, device will drop the packets. If the driver is the sender, the
driver will drop the packets.

> > +
> > +For dgram, since \field{dgram_tx_byte} is shared within a VM or host, a tx_timeout and
> > +a timer are used for each outgoing packet. If a packet is not delivered within
> > +tx_timeout, it will be dropped to make free space for other dgram sockets.
>
> What does this mean? Does the timeout cause the device to drop packets
> that haven't been handled yet (e.g. read by a host userspace process)?

This is on the sender side (regardless of the driver or the device ).
This  is similar to
above mentioned dgram_tx_bytes. Dgram_tx_bytes only makes sure if the memory
usage reaches a limit, the sender will not consume more memory. But
dgram_tx_bytes
is shared among different dgram sockets for a device or the driver, a
misbehave or
malicious dgram socket can potentially use all memory and block other
dgram sockets.
from sending any new packets. To handle this case,
the tx_timeout is used to free some space from the sender's memory pool, so that
other dgram sockets can continue to send some packets.

> POSIX Sockets have a receive socket buffer (rcvbuf) that is used for
> memory accounting and dropping packets. Operating systems implementing
> POSIX Sockets would use that to decide when incoming packets are
> dropped. It seems like dgram_tx_byte does something similar at the
> device level and I'm not sure why it's necessary?

dgram_tx_byte is for the tx (sender) side. The receive buffer you mentioned
is on the receiving side.

> In the POSIX Sockets model the virtio-vsock tx virtqueue is processed by
> the device and packets are read into socket rcvbuf. They do not stay in
> the virtqueue. This involves an extra data copy but it keeps the
> virtqueue as empty as possible so that further communication is
> possible.
> > +
> >  \drivernormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Buffer Space Management}
> > -VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
> > -sufficient free buffer space for the payload.
> > +For stream sockets, VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
> > +sufficient free buffer space for the payload. For dgram sockets, VIRTIO_VSOCK_OP_RW data packets
> > +can be transmitted when the peer buffer is full. Then the pacekt will be dropped.
>
> s/can/MAY/
>
> s/pacekt/packet/
>
> "Then the packet will be dropped" is not clear to me. Is it saying the
> device drops packets when buffer space has exceeded?

Yes, see above replies.

> >
> >  All packets associated with a stream flow MUST contain valid information in
> >  \field{buf_alloc} and \field{fwd_cnt} fields.
> >
> > +All packets associated with a dgram flow MUST contain valid information in
> > +\field{timestamp} field, which will be used to check for tx timeout.
>
> What are the units of the timestamp field?

I am thinking about using jiffy in the Linux kernel. But for a more
generic virtio spec, maybe
we should use something more generic for the people not familiar with
Linux kernel?

> > +
> >  \devicenormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Buffer Space Management}
> > -VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
> > -sufficient free buffer space for the payload.
> > +For stream sockets, VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
> > +sufficient free buffer space for the payload. For dgram sockets, VIRTIO_VSOCK_OP_RW data packets
> > +can be transmitted when the peer buffer is full. Then the pacekt will be dropped.
> >
> >  All packets associated with a stream flow MUST contain valid information in
> >  \field{buf_alloc} and \field{fwd_cnt} fields.
> >
> > +All packets associated with a dgram flow MUST contain valid information in
> > +\field{timestamp} field.
> > +
> >  \subsubsection{Receive and Transmit}\label{sec:Device Types / Socket Device / Device Operation / Receive and Transmit}
> >  The driver queues outgoing packets on the tx virtqueue and incoming packet
> >  receive buffers on the rx virtqueue. Packets are of the following form:
> > @@ -203,14 +244,14 @@ \subsubsection{Receive and Transmit}\label{sec:Device Types / Socket Device / De
> >  The \field{guest_cid} configuration field MUST be used as the source CID when
> >  sending outgoing packets.
> >
> > -A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
> > +For stream socekts, A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
> >  unknown \field{type} value.
> >
> >  \devicenormative{\paragraph}{Device Operation: Receive and Transmit}{Device Types / Socket Device / Device Operation / Receive and Transmit}
> >
> >  The \field{guest_cid} configuration field MUST NOT contain a reserved CID as listed in \ref{sec:Device Types / Socket Device / Device configuration layout}.
> >
> > -A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
> > +For stream sockets, A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
> >  unknown \field{type} value.
> >
> >  \subsubsection{Stream Sockets}\label{sec:Device Types / Socket Device / Device Operation / Stream Sockets}
> > @@ -240,6 +281,17 @@ \subsubsection{Stream Sockets}\label{sec:Device Types / Socket Device / Device O
> >  destination) address tuple for a new connection while the other peer is still
> >  processing the old connection.
> >
> > +\subsubsection{Datagram Sockets}\label{sec:Device Types / Socket Device / Device Operation / Stream Sockets}
> > +
> > +Datagram (dgram) sockets are connectionless and unreliable. The sender just send
> > +a message to the peer and hope it will be delivered. If there is no socket binds the
> > +address on the other end, or the transmision or receving buffers are full, the packets
> > +will be droped. Each packet may have a source cid and
>
> s/dropped/
>
> Please stick to UDP semantics as much as possible so that applications
> can be ported easily and developers aren't surprised by unexpected
> behavior. UDP packets sent to a destination that has no listen socket
> result in a Connection Refused error. vsock dgrams should behave in the
> same way.

Sure.Will do

> > +source port, the receiver can use them to send back a reply message.
>
> The VIRTIO specification avoids using the word "may" (as well as "must"
> and "should") outside the normative sections of the specification.
>
> It's unclear what this sentence means: can the driver set the source cid
> and source port to zero or an arbitary number if it does not need a
> reply? (I guess the answer is "no" but the sentence implies setting the
> source cid and source port is optional.)

Right, the driver should always set the source port correctly. Will fix it.

> > +
> > +Dgram sockets preserve the message boundaries. A message is either sent or dropped.
>
> What does "a message is either sent or dropped" mean? Does it mean
> messages are delivered once or not at all, but they are never
> duplicated?

Yes.

Thanks for all the comments. I will fix those spelling errors and use
a spell check
next time. I hope I answered all your questions. Please let me know if I missed
any questions or anything still not clear.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-18 17:59   ` [External] " Jiang Wang .
@ 2021-03-22 16:50     ` Stefan Hajnoczi
  2021-03-22 23:02       ` Jiang Wang .
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Hajnoczi @ 2021-03-22 16:50 UTC (permalink / raw)
  To: Jiang Wang .
  Cc: cong.wang, Xiongchun Duan, Michael S. Tsirkin, cohuck,
	virtualization, xieyongji, Arseny Krasnov, asias


[-- Attachment #1.1: Type: text/plain, Size: 12417 bytes --]

On Thu, Mar 18, 2021 at 10:59:20AM -0700, Jiang Wang . wrote:
> On Wed, Mar 17, 2021 at 8:45 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, Mar 16, 2021 at 09:56:44PM +0000, jiang.wang wrote:
> > > Add supports for datagram type for virtio-vsock. Datagram
> > > sockets are connectionless and unreliable. To avoid contention
> > > with stream and other sockets, add two more virtqueues and
> > > a new feature bit to identify if those two new queues exist or not.
> > >
> > > Also add descriptions for resouce management of datagram, which
> > > does not use the existing credit update mechanism associated with
> > > stream sockets.
> > >
> > > Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> > > ---
> > >  virtio-vsock.tex | 72 ++++++++++++++++++++++++++++++++++++++++++++++++--------
> > >  1 file changed, 62 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/virtio-vsock.tex b/virtio-vsock.tex
> > > index da7e641..a2ff0a4 100644
> > > --- a/virtio-vsock.tex
> > > +++ b/virtio-vsock.tex
> > > @@ -9,14 +9,27 @@ \subsection{Device ID}\label{sec:Device Types / Socket Device / Device ID}
> > >
> > >  \subsection{Virtqueues}\label{sec:Device Types / Socket Device / Virtqueues}
> > >  \begin{description}
> > > +\item[0] stream rx
> > > +\item[1] stream tx
> > > +\item[2] dgram rx
> > > +\item[3] dgram tx
> > > +\item[4] event
> > > +\end{description}
> > > +The virtio socket device uses 5 queues if feature bit VIRTIO_VSOCK_DRGAM is set. Otherwise, it
> > > +only uses 3 queues, as the following.
> > > +
> > > +\begin{description}
> > >  \item[0] rx
> > >  \item[1] tx
> > >  \item[2] event
> > >  \end{description}
> > >
> > > +
> > >  \subsection{Feature bits}\label{sec:Device Types / Socket Device / Feature bits}
> > >
> > > -There are currently no feature bits defined for this device.
> > > +\begin{description}
> > > +\item[VIRTIO_VSOCK_DGRAM (0)] Device has support for datagram sockets type.
> > > +\end{description}
> >
> > By convention feature bits are named VIRTIO_<device>_F_<feature>. Please
> > add the "_F_".
> >
> Sure.
> 
> > >
> > >  \subsection{Device configuration layout}\label{sec:Device Types / Socket Device / Device configuration layout}
> > >
> > > @@ -76,6 +89,7 @@ \subsection{Device Operation}\label{sec:Device Types / Socket Device / Device Op
> > >       le32 flags;
> > >       le32 buf_alloc;
> > >       le32 fwd_cnt;
> > > +     le64 timestamp;
> >
> > Adding this field breaks old devices and drivers. Please make this field
> > conditional on the dgram socket type or the VIRTIO_VSOCK_F_DGRAM feature
> > bit.
> >
> > Also, are all the other fields still used with dgram sockets? Maybe you
> > can use a union instead to reuse some space?
> 
> I will make this new field depending on the dgram socket type.
> 
> For the union idea, could I change the order of those fields? Dgram does not use
> flags and fwd_cnt fields, and I want to put them together with a union
> of a le64 timestamp.
> Something like:
> 
> struct virtio_vsock_hdr {
>   le64 src_cid;
>   le64 dst_cid;
>   le32 src_port;
>   le32 dst_port;
>   le32 len;
>   le16 type;
>   le16 op;
>   le32 buf_alloc;
>  union {
>         struct {
>               le32 flags;
>               le32 fwd_cnt;
>         } stream;
>         le64 dgram_timestamp;
>      } internal; // or a better name
> };

Unfortunately reordering the fields would break existing devices and
drivers since they access flags and fwd_cnt at a specific offset in
struct virtio_vsock_hdr.

Thinking more about the union idea, I think it's more trouble than it is
worth. You could just write dgram_timestamp accessor functions that
split/join the le64 value into the existing le32 flags and fwd_cnt
fields.

> 
> > >  };
> > >  \end{lstlisting}
> > >
> > > @@ -121,11 +135,14 @@ \subsubsection{Virtqueue Flow Control}\label{sec:Device Types / Socket Device /
> > >  packets.  With additional resources, it becomes possible to process incoming
> > >  packets even when outgoing packets cannot be sent.
> > >
> > > -Eventually even the additional resources will be exhausted and further
> > > +For stream type, eventually even the additional resources will be exhausted and further
> > >  processing is not possible until the other side processes the virtqueue that
> > >  it has neglected.  This stop to processing prevents one side from causing
> > >  unbounded resource consumption in the other side.
> > >
> > > +For datagram type, the additional resources have a fixed size limit to prevent
> > > +unbounded resource consumption.
> > > +
> > >  \drivernormative{\paragraph}{Device Operation: Virtqueue Flow Control}{Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
> > >
> > >  The rx virtqueue MUST be processed even when the tx virtqueue is full so long as there are additional resources available to hold packets outside the tx virtqueue.
> > > @@ -140,12 +157,15 @@ \subsubsection{Addressing}\label{sec:Device Types / Socket Device / Device Opera
> > >  consists of a (cid, port number) tuple. The header fields used for this are
> > >  \field{src_cid}, \field{src_port}, \field{dst_cid}, and \field{dst_port}.
> > >
> > > -Currently only stream sockets are supported. \field{type} is 1 for stream
> > > -socket types.
> > > +Currently stream and datagram (dgram) sockets are supported. \field{type} is 1 for stream
> > > +socket types. \field{type} is 3 for dgram socket types.
> > >
> > >  Stream sockets provide in-order, guaranteed, connection-oriented delivery
> > >  without message boundaries.
> > >
> > > +Datagram socekts provide connectionless unreliable messages of
> >
> > s/socekts/sockets/
> >
> > > +a fixed maximum length.
> > > +
> > >  \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device / Device Operation / Buffer Space Management}
> > >  \field{buf_alloc} and \field{fwd_cnt} are used for buffer space management of
> > >  stream sockets. The guest and the device publish how much buffer space is
> > > @@ -170,20 +190,41 @@ \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device /
> > >  previously receiving a VIRTIO_VSOCK_OP_CREDIT_REQUEST packet. This allows
> > >  communicating updates any time a change in buffer space occurs.
> > >
> > > +For datagram sockets, \field{buf_alloc} is also used on the rx side. Unlike stream
> > > +sockets, dgram sockets do not use VIRTIO_VSOCK_OP_CREDIT_UPDATE or
> > > +VIRTIO_VSOCK_OP_CREDIT_REQUEST packets. The dgram sender does not check if
> > > +the \field{buf_alloc} of the peer is full or not. If it is full, the packet
> > > +will be dropped. To limit resouce usage of the sender, \field{dgram_tx_bytes}
> >
> > s/resouce/resource/
> >
> > \field{dgram_tx_bytes} is not included in any struct definition?
> 
> dgram_tx_bytes is a field in struct vhost_vsock and virtio_vsock. But
> I don't see them
> mentioned in the spec.

The VIRTIO specification does not depend on the internals of device
implementations. Someone reading the spec must be able to implement a
driver or a device without knowledge of the Linux virtio_vsock or
vhost_vsock implementations.

If dgram_tx_bytes is a new concept that device implementations must
incorporate, then please describe it fully in the spec.

> 
> > > +is used for each VM and host. Only payload bytes are counted and header bytes are not
> >
> > Please use the VIRTIO specification terms "driver" and "device" instead
> > of VM and host.
> Sure.
> 
> > It's not clear to me what "used" means. What exactly do the driver and
> > device need to do?
> 
> In the  Linux and KVM case, I added the dgram_tx_bytes to vhost_vsock
> and virtio_vsock.
> Then in virtio_transport_send_pkt() and vhost_transport_send_pkt(),
> the code will increase and check
> dgram_tx_bytes first. If dgram_tx_bytes is no less than
> VIRTIO_VSOCK_MAX_DGRAM_BUF_SIZE,
> the function will free the pkt with virtio_transport_free_pkt() and
> then return. The dgram_tx_bytes
> is decreased in vhost_transport_do_send_pkt, just before each
> virtio_transport_free_pkt() is called.
> It is similar for the device case.
> 
> The goal is to limit the memory usage for the vsock. Since dgram does
> not use credit, the sender
> can send far more packets than the receiver can handle. If we don't
> add the above check, the
> sender can use lots of memory and cause OOM in the Linux kernel.
> 
> I am not sure if these details are too much for a spec or not. But I
> think it will be good to
> add some description. Or we can just say that the sender ( device or
> the driver ) must not use
> unlimited resources.

I'm not sure why this mechanism is needed since the virtqueue has a
fixed size and the receiver has socket buffers (rcvbuf) of fixed sizes.
Memory usage is bounded so I don't understand how OOM can occur.

I would have expected the following:
1. When sending, drop VIRTIO_VSOCK_OP_RW packets instead of sending if
   the virtqueue is full.
2. When receiving, drop VIRTIO_VSOCK_OP_RW packets if the socket rcvbuf
   is full.

That's all. No additional accounting mechanism is necessary for
unreliable delivery.

> > > +included. If \field{dgram_tx_bytes} is equal to VIRTIO_VSOCK_MAX_DGRAM_BUF_SIZE,
> > > +the send will fail.
> >
> > Does the device reject dgram packets or does the driver refuse to send
> > more dgram packets?
> 
> The sender will drop the packets. This one is independent of the rx
> side. If the device is the
> sender, device will drop the packets. If the driver is the sender, the
> driver will drop the packets.
> 
> > > +
> > > +For dgram, since \field{dgram_tx_byte} is shared within a VM or host, a tx_timeout and
> > > +a timer are used for each outgoing packet. If a packet is not delivered within
> > > +tx_timeout, it will be dropped to make free space for other dgram sockets.
> >
> > What does this mean? Does the timeout cause the device to drop packets
> > that haven't been handled yet (e.g. read by a host userspace process)?
> 
> This is on the sender side (regardless of the driver or the device ).
> This  is similar to
> above mentioned dgram_tx_bytes. Dgram_tx_bytes only makes sure if the memory
> usage reaches a limit, the sender will not consume more memory. But
> dgram_tx_bytes
> is shared among different dgram sockets for a device or the driver, a
> misbehave or
> malicious dgram socket can potentially use all memory and block other
> dgram sockets.
> from sending any new packets. To handle this case,
> the tx_timeout is used to free some space from the sender's memory pool, so that
> other dgram sockets can continue to send some packets.

Why is the timestamp included in the header? It seems to only be used by
the side that is sending so it's not obvious to me that it needs to be
sent to the other side.

How does timestamp ensure fairness? It seems like the malicious sending
could simple send more packets in order to exhaust all available memory
again when its old packets expire.

> 
> > POSIX Sockets have a receive socket buffer (rcvbuf) that is used for
> > memory accounting and dropping packets. Operating systems implementing
> > POSIX Sockets would use that to decide when incoming packets are
> > dropped. It seems like dgram_tx_byte does something similar at the
> > device level and I'm not sure why it's necessary?
> 
> dgram_tx_byte is for the tx (sender) side. The receive buffer you mentioned
> is on the receiving side.

POSIX Sockets have a send socket buffer (sndbuf) for that. The sender
should use a similar approach to how UDP sockets work.

> Thanks for all the comments. I will fix those spelling errors and use
> a spell check
> next time. I hope I answered all your questions. Please let me know if I missed
> any questions or anything still not clear.

To summarize:

1. Why is dgram_tx_bytes needed since the virtqueue already has a finite
   size?

2. Does the timestamp field provide fairness between senders? I don't
   understand its purpose.

3. vsock dgram should probably follow the UDP approach to memory
   accounting instead of inventing new mechanisms. Applications and
   developers will probably be happiest with UDP-like behavior.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-22 16:50     ` Stefan Hajnoczi
@ 2021-03-22 23:02       ` Jiang Wang .
  2021-03-22 23:10         ` Michael S. Tsirkin
  0 siblings, 1 reply; 19+ messages in thread
From: Jiang Wang . @ 2021-03-22 23:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: cong.wang, Xiongchun Duan, Michael S. Tsirkin, cohuck,
	virtualization, xieyongji, Arseny Krasnov, asias

On Mon, Mar 22, 2021 at 9:51 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Mar 18, 2021 at 10:59:20AM -0700, Jiang Wang . wrote:
> > On Wed, Mar 17, 2021 at 8:45 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Tue, Mar 16, 2021 at 09:56:44PM +0000, jiang.wang wrote:
> > > > Add supports for datagram type for virtio-vsock. Datagram
> > > > sockets are connectionless and unreliable. To avoid contention
> > > > with stream and other sockets, add two more virtqueues and
> > > > a new feature bit to identify if those two new queues exist or not.
> > > >
> > > > Also add descriptions for resouce management of datagram, which
> > > > does not use the existing credit update mechanism associated with
> > > > stream sockets.
> > > >
> > > > Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
> > > > ---
> > > >  virtio-vsock.tex | 72 ++++++++++++++++++++++++++++++++++++++++++++++++--------
> > > >  1 file changed, 62 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/virtio-vsock.tex b/virtio-vsock.tex
> > > > index da7e641..a2ff0a4 100644
> > > > --- a/virtio-vsock.tex
> > > > +++ b/virtio-vsock.tex
> > > > @@ -9,14 +9,27 @@ \subsection{Device ID}\label{sec:Device Types / Socket Device / Device ID}
> > > >
> > > >  \subsection{Virtqueues}\label{sec:Device Types / Socket Device / Virtqueues}
> > > >  \begin{description}
> > > > +\item[0] stream rx
> > > > +\item[1] stream tx
> > > > +\item[2] dgram rx
> > > > +\item[3] dgram tx
> > > > +\item[4] event
> > > > +\end{description}
> > > > +The virtio socket device uses 5 queues if feature bit VIRTIO_VSOCK_DRGAM is set. Otherwise, it
> > > > +only uses 3 queues, as the following.
> > > > +
> > > > +\begin{description}
> > > >  \item[0] rx
> > > >  \item[1] tx
> > > >  \item[2] event
> > > >  \end{description}
> > > >
> > > > +
> > > >  \subsection{Feature bits}\label{sec:Device Types / Socket Device / Feature bits}
> > > >
> > > > -There are currently no feature bits defined for this device.
> > > > +\begin{description}
> > > > +\item[VIRTIO_VSOCK_DGRAM (0)] Device has support for datagram sockets type.
> > > > +\end{description}
> > >
> > > By convention feature bits are named VIRTIO_<device>_F_<feature>. Please
> > > add the "_F_".
> > >
> > Sure.
> >
> > > >
> > > >  \subsection{Device configuration layout}\label{sec:Device Types / Socket Device / Device configuration layout}
> > > >
> > > > @@ -76,6 +89,7 @@ \subsection{Device Operation}\label{sec:Device Types / Socket Device / Device Op
> > > >       le32 flags;
> > > >       le32 buf_alloc;
> > > >       le32 fwd_cnt;
> > > > +     le64 timestamp;
> > >
> > > Adding this field breaks old devices and drivers. Please make this field
> > > conditional on the dgram socket type or the VIRTIO_VSOCK_F_DGRAM feature
> > > bit.
> > >
> > > Also, are all the other fields still used with dgram sockets? Maybe you
> > > can use a union instead to reuse some space?
> >
> > I will make this new field depending on the dgram socket type.
> >
> > For the union idea, could I change the order of those fields? Dgram does not use
> > flags and fwd_cnt fields, and I want to put them together with a union
> > of a le64 timestamp.
> > Something like:
> >
> > struct virtio_vsock_hdr {
> >   le64 src_cid;
> >   le64 dst_cid;
> >   le32 src_port;
> >   le32 dst_port;
> >   le32 len;
> >   le16 type;
> >   le16 op;
> >   le32 buf_alloc;
> >  union {
> >         struct {
> >               le32 flags;
> >               le32 fwd_cnt;
> >         } stream;
> >         le64 dgram_timestamp;
> >      } internal; // or a better name
> > };
>
> Unfortunately reordering the fields would break existing devices and
> drivers since they access flags and fwd_cnt at a specific offset in
> struct virtio_vsock_hdr.
>
> Thinking more about the union idea, I think it's more trouble than it is
> worth. You could just write dgram_timestamp accessor functions that
> split/join the le64 value into the existing le32 flags and fwd_cnt
> fields.

Sure. Maybe we don't need this field at all. Will discuss it at the
end of email.

> >
> > > >  };
> > > >  \end{lstlisting}
> > > >
> > > > @@ -121,11 +135,14 @@ \subsubsection{Virtqueue Flow Control}\label{sec:Device Types / Socket Device /
> > > >  packets.  With additional resources, it becomes possible to process incoming
> > > >  packets even when outgoing packets cannot be sent.
> > > >
> > > > -Eventually even the additional resources will be exhausted and further
> > > > +For stream type, eventually even the additional resources will be exhausted and further
> > > >  processing is not possible until the other side processes the virtqueue that
> > > >  it has neglected.  This stop to processing prevents one side from causing
> > > >  unbounded resource consumption in the other side.
> > > >
> > > > +For datagram type, the additional resources have a fixed size limit to prevent
> > > > +unbounded resource consumption.
> > > > +
> > > >  \drivernormative{\paragraph}{Device Operation: Virtqueue Flow Control}{Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
> > > >
> > > >  The rx virtqueue MUST be processed even when the tx virtqueue is full so long as there are additional resources available to hold packets outside the tx virtqueue.
> > > > @@ -140,12 +157,15 @@ \subsubsection{Addressing}\label{sec:Device Types / Socket Device / Device Opera
> > > >  consists of a (cid, port number) tuple. The header fields used for this are
> > > >  \field{src_cid}, \field{src_port}, \field{dst_cid}, and \field{dst_port}.
> > > >
> > > > -Currently only stream sockets are supported. \field{type} is 1 for stream
> > > > -socket types.
> > > > +Currently stream and datagram (dgram) sockets are supported. \field{type} is 1 for stream
> > > > +socket types. \field{type} is 3 for dgram socket types.
> > > >
> > > >  Stream sockets provide in-order, guaranteed, connection-oriented delivery
> > > >  without message boundaries.
> > > >
> > > > +Datagram socekts provide connectionless unreliable messages of
> > >
> > > s/socekts/sockets/
> > >
> > > > +a fixed maximum length.
> > > > +
> > > >  \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device / Device Operation / Buffer Space Management}
> > > >  \field{buf_alloc} and \field{fwd_cnt} are used for buffer space management of
> > > >  stream sockets. The guest and the device publish how much buffer space is
> > > > @@ -170,20 +190,41 @@ \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device /
> > > >  previously receiving a VIRTIO_VSOCK_OP_CREDIT_REQUEST packet. This allows
> > > >  communicating updates any time a change in buffer space occurs.
> > > >
> > > > +For datagram sockets, \field{buf_alloc} is also used on the rx side. Unlike stream
> > > > +sockets, dgram sockets do not use VIRTIO_VSOCK_OP_CREDIT_UPDATE or
> > > > +VIRTIO_VSOCK_OP_CREDIT_REQUEST packets. The dgram sender does not check if
> > > > +the \field{buf_alloc} of the peer is full or not. If it is full, the packet
> > > > +will be dropped. To limit resouce usage of the sender, \field{dgram_tx_bytes}
> > >
> > > s/resouce/resource/
> > >
> > > \field{dgram_tx_bytes} is not included in any struct definition?
> >
> > dgram_tx_bytes is a field in struct vhost_vsock and virtio_vsock. But
> > I don't see them
> > mentioned in the spec.
>
> The VIRTIO specification does not depend on the internals of device
> implementations. Someone reading the spec must be able to implement a
> driver or a device without knowledge of the Linux virtio_vsock or
> vhost_vsock implementations.
>
> If dgram_tx_bytes is a new concept that device implementations must
> incorporate, then please describe it fully in the spec.
>

Sure. Will likely drop this field. Will discuss more at the end of the email.

> >
> > > > +is used for each VM and host. Only payload bytes are counted and header bytes are not
> > >
> > > Please use the VIRTIO specification terms "driver" and "device" instead
> > > of VM and host.
> > Sure.
> >
> > > It's not clear to me what "used" means. What exactly do the driver and
> > > device need to do?
> >
> > In the  Linux and KVM case, I added the dgram_tx_bytes to vhost_vsock
> > and virtio_vsock.
> > Then in virtio_transport_send_pkt() and vhost_transport_send_pkt(),
> > the code will increase and check
> > dgram_tx_bytes first. If dgram_tx_bytes is no less than
> > VIRTIO_VSOCK_MAX_DGRAM_BUF_SIZE,
> > the function will free the pkt with virtio_transport_free_pkt() and
> > then return. The dgram_tx_bytes
> > is decreased in vhost_transport_do_send_pkt, just before each
> > virtio_transport_free_pkt() is called.
> > It is similar for the device case.
> >
> > The goal is to limit the memory usage for the vsock. Since dgram does
> > not use credit, the sender
> > can send far more packets than the receiver can handle. If we don't
> > add the above check, the
> > sender can use lots of memory and cause OOM in the Linux kernel.
> >
> > I am not sure if these details are too much for a spec or not. But I
> > think it will be good to
> > add some description. Or we can just say that the sender ( device or
> > the driver ) must not use
> > unlimited resources.
>
> I'm not sure why this mechanism is needed since the virtqueue has a
> fixed size and the receiver has socket buffers (rcvbuf) of fixed sizes.
> Memory usage is bounded so I don't understand how OOM can occur.
>
> I would have expected the following:
> 1. When sending, drop VIRTIO_VSOCK_OP_RW packets instead of sending if
>    the virtqueue is full.
> 2. When receiving, drop VIRTIO_VSOCK_OP_RW packets if the socket rcvbuf
>    is full.
>
> That's all. No additional accounting mechanism is necessary for
> unreliable delivery.
>

You are right. I checked again and the OOM problem is likely a problem in my
implementation. I will drop this additional accounting.

Following are more
details about my implementation problem. I reused the send_pkt functions
of the stream for datagram too. When a virtqueue is full, the code does not
drop dgram packets, but add it back to the send_pkt_list ( because stream
type does that). I think I should fix that part to drop
the dgram packets instead of adding a new accounting mechanism.
Will give it a try.

> > > > +included. If \field{dgram_tx_bytes} is equal to VIRTIO_VSOCK_MAX_DGRAM_BUF_SIZE,
> > > > +the send will fail.
> > >
> > > Does the device reject dgram packets or does the driver refuse to send
> > > more dgram packets?
> >
> > The sender will drop the packets. This one is independent of the rx
> > side. If the device is the
> > sender, device will drop the packets. If the driver is the sender, the
> > driver will drop the packets.
> >
> > > > +
> > > > +For dgram, since \field{dgram_tx_byte} is shared within a VM or host, a tx_timeout and
> > > > +a timer are used for each outgoing packet. If a packet is not delivered within
> > > > +tx_timeout, it will be dropped to make free space for other dgram sockets.
> > >
> > > What does this mean? Does the timeout cause the device to drop packets
> > > that haven't been handled yet (e.g. read by a host userspace process)?
> >
> > This is on the sender side (regardless of the driver or the device ).
> > This  is similar to
> > above mentioned dgram_tx_bytes. Dgram_tx_bytes only makes sure if the memory
> > usage reaches a limit, the sender will not consume more memory. But
> > dgram_tx_bytes
> > is shared among different dgram sockets for a device or the driver, a
> > misbehave or
> > malicious dgram socket can potentially use all memory and block other
> > dgram sockets.
> > from sending any new packets. To handle this case,
> > the tx_timeout is used to free some space from the sender's memory pool, so that
> > other dgram sockets can continue to send some packets.
>
> Why is the timestamp included in the header? It seems to only be used by
> the side that is sending so it's not obvious to me that it needs to be
> sent to the other side.
>
> How does timestamp ensure fairness? It seems like the malicious sending
> could simple send more packets in order to exhaust all available memory
> again when its old packets expire.
>
Since I will drop the additional accounting, I will probably drop this timestamp
too. Will describe more in the end of the email.

> >
> > > POSIX Sockets have a receive socket buffer (rcvbuf) that is used for
> > > memory accounting and dropping packets. Operating systems implementing
> > > POSIX Sockets would use that to decide when incoming packets are
> > > dropped. It seems like dgram_tx_byte does something similar at the
> > > device level and I'm not sure why it's necessary?
> >
> > dgram_tx_byte is for the tx (sender) side. The receive buffer you mentioned
> > is on the receiving side.
>
> POSIX Sockets have a send socket buffer (sndbuf) for that. The sender
> should use a similar approach to how UDP sockets work.
>
> > Thanks for all the comments. I will fix those spelling errors and use
> > a spell check
> > next time. I hope I answered all your questions. Please let me know if I missed
> > any questions or anything still not clear.
>
> To summarize:
>
> 1. Why is dgram_tx_bytes needed since the virtqueue already has a finite
>    size?

The problem is in my implementation as mentioned above. Will drop this.

>
> 2. Does the timestamp field provide fairness between senders? I don't
>    understand its purpose.
>

The timestamp is associated with my additional accounting. I will drop
this field.

> 3. vsock dgram should probably follow the UDP approach to memory
>    accounting instead of inventing new mechanisms. Applications and
>    developers will probably be happiest with UDP-like behavior.
>

Sure. I checked with UDP briefly. I see SO_SNDBUF, which is per socket.
Also, I found net.ipv4.udp_mem, which is a global value.

After dropping my additional accounting. I think there is still a question
about if we want to protect the shared dgram virtqueue
against bad dgram sockets or not. And if so, how to do it, or what to write
in the spec. For example, if a bad dgram socket keeps sending lots of
data to a port that no socket is receiving,
then those packets will only be dropped by the receiver (device or the
driver), or
when the virtqueue is full. Other good dgram sockets will compete
with this bad one on the tx side. In my current implementation, it
depends on how the Linux scheduler schedules those processes.
The bad one is unlikely to make the virtqueue full all the time and
completely block
other good dgram sockets because the other end is still receiving and
cleaning the virtqueue. But it will waste a lot of resources. I think
that is fine and we don't need to add strict requirements about it
in the spec.

I don't know if UDP has a similar situation as shared virtqueue or not. The
net.ipv4.udp_mem looks like just a global accounting. If you have any
suggestions about this, please let me know.

Thank you!

> Stefan
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-22 23:02       ` Jiang Wang .
@ 2021-03-22 23:10         ` Michael S. Tsirkin
  2021-03-23  2:23           ` Jiang Wang .
  0 siblings, 1 reply; 19+ messages in thread
From: Michael S. Tsirkin @ 2021-03-22 23:10 UTC (permalink / raw)
  To: Jiang Wang .
  Cc: cong.wang, Xiongchun Duan, cohuck, virtualization, xieyongji,
	Stefan Hajnoczi, Arseny Krasnov, asias

On Mon, Mar 22, 2021 at 04:02:14PM -0700, Jiang Wang . wrote:
> After dropping my additional accounting. I think there is still a question
> about if we want to protect the shared dgram virtqueue
> against bad dgram sockets or not. And if so, how to do it, or what to write
> in the spec. For example, if a bad dgram socket keeps sending lots of
> data to a port that no socket is receiving,
> then those packets will only be dropped by the receiver (device or the
> driver), or
> when the virtqueue is full. Other good dgram sockets will compete
> with this bad one on the tx side. In my current implementation, it
> depends on how the Linux scheduler schedules those processes.
> The bad one is unlikely to make the virtqueue full all the time and
> completely block
> other good dgram sockets because the other end is still receiving and
> cleaning the virtqueue. But it will waste a lot of resources. I think
> that is fine and we don't need to add strict requirements about it
> in the spec.
> 
> I don't know if UDP has a similar situation as shared virtqueue or not. The
> net.ipv4.udp_mem looks like just a global accounting. If you have any
> suggestions about this, please let me know.
> 
> Thank you!

Yes I suspect just not doing any accounting isn't going to work well.
Consider that with a NIC, if a socket is sending too much data faster
than destination can consume it, its packets get dropped. So far so
good.

With vsock, if your guest gets too fast, packets are being dropped
which is faster than processing them on the host.
The result is guest goes even faster!

Used to be a problem on linux too before packets inside transmit
queues started being accounted for: a socket would fill
the tx queue then others would just drop their packets.

So we need some kind of accounting to slow down transmitters when
they go too fast, to avoid this kind of positive feedback.


-- 
MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-22 23:10         ` Michael S. Tsirkin
@ 2021-03-23  2:23           ` Jiang Wang .
  2021-03-23  8:53             ` Stefan Hajnoczi
  0 siblings, 1 reply; 19+ messages in thread
From: Jiang Wang . @ 2021-03-23  2:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: cong.wang, Xiongchun Duan, cohuck, virtualization, xieyongji,
	Stefan Hajnoczi, Arseny Krasnov, asias

Got it. Will do.

On Mon, Mar 22, 2021 at 4:10 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Mar 22, 2021 at 04:02:14PM -0700, Jiang Wang . wrote:
> > After dropping my additional accounting. I think there is still a question
> > about if we want to protect the shared dgram virtqueue
> > against bad dgram sockets or not. And if so, how to do it, or what to write
> > in the spec. For example, if a bad dgram socket keeps sending lots of
> > data to a port that no socket is receiving,
> > then those packets will only be dropped by the receiver (device or the
> > driver), or
> > when the virtqueue is full. Other good dgram sockets will compete
> > with this bad one on the tx side. In my current implementation, it
> > depends on how the Linux scheduler schedules those processes.
> > The bad one is unlikely to make the virtqueue full all the time and
> > completely block
> > other good dgram sockets because the other end is still receiving and
> > cleaning the virtqueue. But it will waste a lot of resources. I think
> > that is fine and we don't need to add strict requirements about it
> > in the spec.
> >
> > I don't know if UDP has a similar situation as shared virtqueue or not. The
> > net.ipv4.udp_mem looks like just a global accounting. If you have any
> > suggestions about this, please let me know.
> >
> > Thank you!
>
> Yes I suspect just not doing any accounting isn't going to work well.
> Consider that with a NIC, if a socket is sending too much data faster
> than destination can consume it, its packets get dropped. So far so
> good.
>
> With vsock, if your guest gets too fast, packets are being dropped
> which is faster than processing them on the host.
> The result is guest goes even faster!
>
> Used to be a problem on linux too before packets inside transmit
> queues started being accounted for: a socket would fill
> the tx queue then others would just drop their packets.
>
> So we need some kind of accounting to slow down transmitters when
> they go too fast, to avoid this kind of positive feedback.
>
>
> --
> MST
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-23  2:23           ` Jiang Wang .
@ 2021-03-23  8:53             ` Stefan Hajnoczi
  2021-03-26 23:40               ` Jiang Wang .
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Hajnoczi @ 2021-03-23  8:53 UTC (permalink / raw)
  To: Jiang Wang .
  Cc: cong.wang, Xiongchun Duan, Michael S. Tsirkin, cohuck,
	virtualization, xieyongji, Arseny Krasnov, asias


[-- Attachment #1.1: Type: text/plain, Size: 628 bytes --]

On Mon, Mar 22, 2021 at 07:23:14PM -0700, Jiang Wang . wrote:
> Got it. Will do.

You could look at udp_sendmsg() to see how sockets compete when
transmitting to the same net device.

I'm not very familiar with this but I guess that the qdisc (like
fq_codel) decides which packets to place into the device's tx queue. I
guess sk_buffs waiting to be put onto the device's tx queue are
accounted for against the socket's sndbuf. Further sendmsg calls will
fail with -ENOBUFS when the sndbuf limit is reached.

It's not clear to me how much of the existing code can be reused since
vsock does not use sk_buff or netdev :(.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-23  8:53             ` Stefan Hajnoczi
@ 2021-03-26 23:40               ` Jiang Wang .
  2021-03-29  9:25                 ` Stefan Hajnoczi
  2021-03-30 15:32                 ` Stefano Garzarella
  0 siblings, 2 replies; 19+ messages in thread
From: Jiang Wang . @ 2021-03-26 23:40 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: cong.wang, Xiongchun Duan, Michael S. Tsirkin, cohuck,
	virtualization, xieyongji, Arseny Krasnov, asias

Hi Michael and Stefan,

I thought about this and discussed it with my colleague Cong Wang.
One idea is to make current asynchronous send_pkt flow to be synchronous,
then if the virtqueue is full, the function can return  ENOMEM all the way back
to the caller and the caller can check the return value of sendmsg
and slow down when necessary.

In the spec, we can put something like, if the virtqueue is full, the caller
should be notified with an error etc.

In terms of implementation, that means we will remove the current
send_pkt_work for both stream and dgram sockets. Currently, the
code path uses RCU and a work queue, then grab a mutex in the
work queue function. Since we cannot grab mutex when in rcu
critical section, we have to change RCU to a normal reference
counting mechanism. I think this is doable. The drawback is
that the reference counting in general spends a little more
cycles than the RCU, so there is a small price to pay. Another
option is to use Sleepable RCU and remove the work queue.

What do you guys think?

btw, I will also add some SENDBUF restrictions for the dgram
sockets, but I don't think it needs to be in the spec.

Regards,

Jiang

On Tue, Mar 23, 2021 at 1:53 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Mon, Mar 22, 2021 at 07:23:14PM -0700, Jiang Wang . wrote:
> > Got it. Will do.
>
> You could look at udp_sendmsg() to see how sockets compete when
> transmitting to the same net device.
>
> I'm not very familiar with this but I guess that the qdisc (like
> fq_codel) decides which packets to place into the device's tx queue. I
> guess sk_buffs waiting to be put onto the device's tx queue are
> accounted for against the socket's sndbuf. Further sendmsg calls will
> fail with -ENOBUFS when the sndbuf limit is reached.
>
> It's not clear to me how much of the existing code can be reused since
> vsock does not use sk_buff or netdev :(.
>
> Stefan
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-26 23:40               ` Jiang Wang .
@ 2021-03-29  9:25                 ` Stefan Hajnoczi
  2021-03-29 23:22                   ` Jiang Wang .
  2021-03-30 15:32                 ` Stefano Garzarella
  1 sibling, 1 reply; 19+ messages in thread
From: Stefan Hajnoczi @ 2021-03-29  9:25 UTC (permalink / raw)
  To: Jiang Wang .
  Cc: cong.wang, Xiongchun Duan, Michael S. Tsirkin, cohuck,
	virtualization, xieyongji, Arseny Krasnov, asias


[-- Attachment #1.1: Type: text/plain, Size: 2070 bytes --]

On Fri, Mar 26, 2021 at 04:40:09PM -0700, Jiang Wang . wrote:
> I thought about this and discussed it with my colleague Cong Wang.
> One idea is to make current asynchronous send_pkt flow to be synchronous,
> then if the virtqueue is full, the function can return  ENOMEM all the way back
> to the caller and the caller can check the return value of sendmsg
> and slow down when necessary.
> 
> In the spec, we can put something like, if the virtqueue is full, the caller
> should be notified with an error etc.
> 
> In terms of implementation, that means we will remove the current
> send_pkt_work for both stream and dgram sockets. Currently, the
> code path uses RCU and a work queue, then grab a mutex in the
> work queue function. Since we cannot grab mutex when in rcu
> critical section, we have to change RCU to a normal reference
> counting mechanism. I think this is doable. The drawback is
> that the reference counting in general spends a little more
> cycles than the RCU, so there is a small price to pay. Another
> option is to use Sleepable RCU and remove the work queue.
> 
> What do you guys think?

I think the tx code path is like this because of reliable delivery.
Maybe a separate datagram rx/tx code path would be simpler?

Take the datagram tx virtqueue lock, try to add the packet into the
virtqueue, and return -ENOBUFS if the virtqueue is full. Then use the
datagram socket's sndbuf accounting to prevent queuing up too many
packets. When a datagram tx virtqueue buffer is used by the device,
select queued packets for transmission. Unlike the stream tx/rx code
path there is no dependency between tx and rx because we don't have the
credit mechanism.

> btw, I will also add some SENDBUF restrictions for the dgram
> sockets, but I don't think it needs to be in the spec.

Yes, the spec doesn't need to explain implementation-specific issues.

If there are common implementation issues then the spec can explain them
in general terms (not referring to Linux internals) to help
implementors.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-29  9:25                 ` Stefan Hajnoczi
@ 2021-03-29 23:22                   ` Jiang Wang .
  2021-03-30 10:42                     ` Stefan Hajnoczi
  0 siblings, 1 reply; 19+ messages in thread
From: Jiang Wang . @ 2021-03-29 23:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: cong.wang, Xiongchun Duan, Michael S. Tsirkin, cohuck,
	virtualization, xieyongji, Arseny Krasnov, asias

On Mon, Mar 29, 2021 at 2:26 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Fri, Mar 26, 2021 at 04:40:09PM -0700, Jiang Wang . wrote:
> > I thought about this and discussed it with my colleague Cong Wang.
> > One idea is to make current asynchronous send_pkt flow to be synchronous,
> > then if the virtqueue is full, the function can return  ENOMEM all the way back
> > to the caller and the caller can check the return value of sendmsg
> > and slow down when necessary.
> >
> > In the spec, we can put something like, if the virtqueue is full, the caller
> > should be notified with an error etc.
> >
> > In terms of implementation, that means we will remove the current
> > send_pkt_work for both stream and dgram sockets. Currently, the
> > code path uses RCU and a work queue, then grab a mutex in the
> > work queue function. Since we cannot grab mutex when in rcu
> > critical section, we have to change RCU to a normal reference
> > counting mechanism. I think this is doable. The drawback is
> > that the reference counting in general spends a little more
> > cycles than the RCU, so there is a small price to pay. Another
> > option is to use Sleepable RCU and remove the work queue.
> >
> > What do you guys think?
>
> I think the tx code path is like this because of reliable delivery.
> Maybe a separate datagram rx/tx code path would be simpler?

I thought about this too.  dgram can have a separate rx/tx
path from stream types. In this case, the the_virtio_vsock
will still be shared because the virtqueues have to be in one
structure. Then the_virtio_vsock will be protected by a rcu
and a reference counting ( or a sleepable RCU ).

In vhost_vsock_dev_release, it will wait for both rcu and another
one to be finished and then free the memory. I think this is
doable. Let me know if there is a better way to do it.
Btw, I think dgram tx code path will be quite different from
stream, but dgram rx path will be similar to stream type.

> Take the datagram tx virtqueue lock, try to add the packet into the
> virtqueue, and return -ENOBUFS if the virtqueue is full. Then use the
> datagram socket's sndbuf accounting to prevent queuing up too many
> packets. When a datagram tx virtqueue buffer is used by the device,
> select queued packets for transmission.

I am not sure about the last sentence. In the new design, there will
be no other queued packets for dgram (except those in the virtqueue).
When dgram tx virtqueue is freed by the device, the driver side
just needs to free some space. No need
to trigger more transmission.


Unlike the stream tx/rx code
> path there is no dependency between tx and rx because we don't have the
> credit mechanism.
> > btw, I will also add some SENDBUF restrictions for the dgram
> > sockets, but I don't think it needs to be in the spec.
>
> Yes, the spec doesn't need to explain implementation-specific issues.
>
> If there are common implementation issues then the spec can explain them
> in general terms (not referring to Linux internals) to help
> implementors.

Got it. Thanks. I will send a v2 spec soon.

> Stefan
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-29 23:22                   ` Jiang Wang .
@ 2021-03-30 10:42                     ` Stefan Hajnoczi
  2021-03-30 18:30                       ` Jiang Wang .
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Hajnoczi @ 2021-03-30 10:42 UTC (permalink / raw)
  To: Jiang Wang .
  Cc: cong.wang, Xiongchun Duan, Michael S. Tsirkin, cohuck,
	virtualization, xieyongji, Arseny Krasnov, asias


[-- Attachment #1.1: Type: text/plain, Size: 3344 bytes --]

On Mon, Mar 29, 2021 at 04:22:28PM -0700, Jiang Wang . wrote:
> On Mon, Mar 29, 2021 at 2:26 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Fri, Mar 26, 2021 at 04:40:09PM -0700, Jiang Wang . wrote:
> > > I thought about this and discussed it with my colleague Cong Wang.
> > > One idea is to make current asynchronous send_pkt flow to be synchronous,
> > > then if the virtqueue is full, the function can return  ENOMEM all the way back
> > > to the caller and the caller can check the return value of sendmsg
> > > and slow down when necessary.
> > >
> > > In the spec, we can put something like, if the virtqueue is full, the caller
> > > should be notified with an error etc.
> > >
> > > In terms of implementation, that means we will remove the current
> > > send_pkt_work for both stream and dgram sockets. Currently, the
> > > code path uses RCU and a work queue, then grab a mutex in the
> > > work queue function. Since we cannot grab mutex when in rcu
> > > critical section, we have to change RCU to a normal reference
> > > counting mechanism. I think this is doable. The drawback is
> > > that the reference counting in general spends a little more
> > > cycles than the RCU, so there is a small price to pay. Another
> > > option is to use Sleepable RCU and remove the work queue.
> > >
> > > What do you guys think?
> >
> > I think the tx code path is like this because of reliable delivery.
> > Maybe a separate datagram rx/tx code path would be simpler?
> 
> I thought about this too.  dgram can have a separate rx/tx
> path from stream types. In this case, the the_virtio_vsock
> will still be shared because the virtqueues have to be in one
> structure. Then the_virtio_vsock will be protected by a rcu
> and a reference counting ( or a sleepable RCU ).
> 
> In vhost_vsock_dev_release, it will wait for both rcu and another
> one to be finished and then free the memory. I think this is
> doable. Let me know if there is a better way to do it.
> Btw, I think dgram tx code path will be quite different from
> stream, but dgram rx path will be similar to stream type.
> 
> > Take the datagram tx virtqueue lock, try to add the packet into the
> > virtqueue, and return -ENOBUFS if the virtqueue is full. Then use the
> > datagram socket's sndbuf accounting to prevent queuing up too many
> > packets. When a datagram tx virtqueue buffer is used by the device,
> > select queued packets for transmission.
> 
> I am not sure about the last sentence. In the new design, there will
> be no other queued packets for dgram (except those in the virtqueue).
> When dgram tx virtqueue is freed by the device, the driver side
> just needs to free some space. No need
> to trigger more transmission.

This approach drops packets when the virtqueue is full. In order to get
close to UDP semantics I think the following is necessary:

1. Packets are placed directly into the tx virtqueue when possible.
2. Packets are queued up to sndbuf bytes if the tx virtqueue is full.
3. When a full tx virtqueue gets some free space again there is an
   algorithm that selects from among the queued sockets.

This is how unreliably delivery can be somewhat fair between competing
threads. I thought you were trying to achieve this (it's similar to what
UDP does)?

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-26 23:40               ` Jiang Wang .
  2021-03-29  9:25                 ` Stefan Hajnoczi
@ 2021-03-30 15:32                 ` Stefano Garzarella
  2021-03-30 18:34                   ` Jiang Wang .
  1 sibling, 1 reply; 19+ messages in thread
From: Stefano Garzarella @ 2021-03-30 15:32 UTC (permalink / raw)
  To: Jiang Wang .
  Cc: cong.wang, Xiongchun Duan, Michael S. Tsirkin, cohuck,
	virtualization, xieyongji, Stefan Hajnoczi, asias,
	Arseny Krasnov

Hi Jiang,

On Fri, Mar 26, 2021 at 04:40:09PM -0700, Jiang Wang . wrote:
>Hi Michael and Stefan,
>
>I thought about this and discussed it with my colleague Cong Wang.
>One idea is to make current asynchronous send_pkt flow to be synchronous,
>then if the virtqueue is full, the function can return  ENOMEM all the way back
>to the caller and the caller can check the return value of sendmsg
>and slow down when necessary.
>
>In the spec, we can put something like, if the virtqueue is full, the caller
>should be notified with an error etc.
>
>In terms of implementation, that means we will remove the current
>send_pkt_work for both stream and dgram sockets. Currently, the
>code path uses RCU and a work queue, then grab a mutex in the
>work queue function. Since we cannot grab mutex when in rcu
>critical section, we have to change RCU to a normal reference
>counting mechanism. I think this is doable. The drawback is
>that the reference counting in general spends a little more
>cycles than the RCU, so there is a small price to pay. Another
>option is to use Sleepable RCU and remove the work queue.
>
>What do you guys think?

another thing that came to mind not related to the spec but to the Linux 
implementation, is the multi-transport support.
When we discussed the initial proposals [1][2], we decided to take a 
shortcut for DGRAM, since the only transport with DGRAM support was 
vmci. So for now only a single transport with VSOCK_TRANSPORT_F_DGRAM 
set can be registered.

Since also virtio-transport and vhost-transport will support DGRAM, we 
need to find a way to allow multiple transports that support DGRAM to be 
registered together to support nested VMs.

Do you have already thought about how to solve this problem?

We should definitely allow the registration of more transports with 
VSOCK_TRANSPORT_F_DGRAM, and dynamically choose which one to use when 
sending a packet.

Thanks,
Stefano

[1] https://www.spinics.net/lists/netdev/msg570274.html
[2] https://www.spinics.net/lists/netdev/msg575792.html

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-30 10:42                     ` Stefan Hajnoczi
@ 2021-03-30 18:30                       ` Jiang Wang .
  0 siblings, 0 replies; 19+ messages in thread
From: Jiang Wang . @ 2021-03-30 18:30 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: cong.wang, Xiongchun Duan, Michael S. Tsirkin, cohuck,
	virtualization, xieyongji, Arseny Krasnov, asias

On Tue, Mar 30, 2021 at 3:42 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Mon, Mar 29, 2021 at 04:22:28PM -0700, Jiang Wang . wrote:
> > On Mon, Mar 29, 2021 at 2:26 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Fri, Mar 26, 2021 at 04:40:09PM -0700, Jiang Wang . wrote:
> > > > I thought about this and discussed it with my colleague Cong Wang.
> > > > One idea is to make current asynchronous send_pkt flow to be synchronous,
> > > > then if the virtqueue is full, the function can return  ENOMEM all the way back
> > > > to the caller and the caller can check the return value of sendmsg
> > > > and slow down when necessary.
> > > >
> > > > In the spec, we can put something like, if the virtqueue is full, the caller
> > > > should be notified with an error etc.
> > > >
> > > > In terms of implementation, that means we will remove the current
> > > > send_pkt_work for both stream and dgram sockets. Currently, the
> > > > code path uses RCU and a work queue, then grab a mutex in the
> > > > work queue function. Since we cannot grab mutex when in rcu
> > > > critical section, we have to change RCU to a normal reference
> > > > counting mechanism. I think this is doable. The drawback is
> > > > that the reference counting in general spends a little more
> > > > cycles than the RCU, so there is a small price to pay. Another
> > > > option is to use Sleepable RCU and remove the work queue.
> > > >
> > > > What do you guys think?
> > >
> > > I think the tx code path is like this because of reliable delivery.
> > > Maybe a separate datagram rx/tx code path would be simpler?
> >
> > I thought about this too.  dgram can have a separate rx/tx
> > path from stream types. In this case, the the_virtio_vsock
> > will still be shared because the virtqueues have to be in one
> > structure. Then the_virtio_vsock will be protected by a rcu
> > and a reference counting ( or a sleepable RCU ).
> >
> > In vhost_vsock_dev_release, it will wait for both rcu and another
> > one to be finished and then free the memory. I think this is
> > doable. Let me know if there is a better way to do it.
> > Btw, I think dgram tx code path will be quite different from
> > stream, but dgram rx path will be similar to stream type.
> >
> > > Take the datagram tx virtqueue lock, try to add the packet into the
> > > virtqueue, and return -ENOBUFS if the virtqueue is full. Then use the
> > > datagram socket's sndbuf accounting to prevent queuing up too many
> > > packets. When a datagram tx virtqueue buffer is used by the device,
> > > select queued packets for transmission.
> >
> > I am not sure about the last sentence. In the new design, there will
> > be no other queued packets for dgram (except those in the virtqueue).
> > When dgram tx virtqueue is freed by the device, the driver side
> > just needs to free some space. No need
> > to trigger more transmission.
>
> This approach drops packets when the virtqueue is full. In order to get
> close to UDP semantics I think the following is necessary:
>
> 1. Packets are placed directly into the tx virtqueue when possible.
> 2. Packets are queued up to sndbuf bytes if the tx virtqueue is full.
> 3. When a full tx virtqueue gets some free space again there is an
>    algorithm that selects from among the queued sockets.
>
> This is how unreliably delivery can be somewhat fair between competing
> threads. I thought you were trying to achieve this (it's similar to what
> UDP does)?

I see. I was thinking something different without an extra queue. I think
your approach is better. For step 3, I think I will just start with a simple
FIFO algorithm first.


> Stefan
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-30 15:32                 ` Stefano Garzarella
@ 2021-03-30 18:34                   ` Jiang Wang .
  2021-03-31  1:02                     ` Jiang Wang .
  0 siblings, 1 reply; 19+ messages in thread
From: Jiang Wang . @ 2021-03-30 18:34 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: cong.wang, Xiongchun Duan, Michael S. Tsirkin, cohuck,
	virtualization, xieyongji, Stefan Hajnoczi, asias,
	Arseny Krasnov

On Tue, Mar 30, 2021 at 8:32 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> Hi Jiang,
>
> On Fri, Mar 26, 2021 at 04:40:09PM -0700, Jiang Wang . wrote:
> >Hi Michael and Stefan,
> >
> >I thought about this and discussed it with my colleague Cong Wang.
> >One idea is to make current asynchronous send_pkt flow to be synchronous,
> >then if the virtqueue is full, the function can return  ENOMEM all the way back
> >to the caller and the caller can check the return value of sendmsg
> >and slow down when necessary.
> >
> >In the spec, we can put something like, if the virtqueue is full, the caller
> >should be notified with an error etc.
> >
> >In terms of implementation, that means we will remove the current
> >send_pkt_work for both stream and dgram sockets. Currently, the
> >code path uses RCU and a work queue, then grab a mutex in the
> >work queue function. Since we cannot grab mutex when in rcu
> >critical section, we have to change RCU to a normal reference
> >counting mechanism. I think this is doable. The drawback is
> >that the reference counting in general spends a little more
> >cycles than the RCU, so there is a small price to pay. Another
> >option is to use Sleepable RCU and remove the work queue.
> >
> >What do you guys think?
>
> another thing that came to mind not related to the spec but to the Linux
> implementation, is the multi-transport support.
> When we discussed the initial proposals [1][2], we decided to take a
> shortcut for DGRAM, since the only transport with DGRAM support was
> vmci. So for now only a single transport with VSOCK_TRANSPORT_F_DGRAM
> set can be registered.
>
> Since also virtio-transport and vhost-transport will support DGRAM, we
> need to find a way to allow multiple transports that support DGRAM to be
> registered together to support nested VMs.
>
> Do you have already thought about how to solve this problem?
>
> We should definitely allow the registration of more transports with
> VSOCK_TRANSPORT_F_DGRAM, and dynamically choose which one to use when
> sending a packet.

Got it. We briefly discussed multiple dgram transport
support in my old email thread. At that time, I was thinking
maybe check for transport for each packet. I did not spend more
time on that part after that. I will think about it more and get
back to you. Thanks

> Thanks,
> Stefano
>
> [1] https://www.spinics.net/lists/netdev/msg570274.html
> [2] https://www.spinics.net/lists/netdev/msg575792.html
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-30 18:34                   ` Jiang Wang .
@ 2021-03-31  1:02                     ` Jiang Wang .
  2021-03-31  6:42                       ` Stefano Garzarella
  0 siblings, 1 reply; 19+ messages in thread
From: Jiang Wang . @ 2021-03-31  1:02 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: cong.wang, Xiongchun Duan, Michael S. Tsirkin, cohuck,
	virtualization, xieyongji, Stefan Hajnoczi, asias,
	Arseny Krasnov

Hi Stefano,

I checked dgram code again. I think assigning transport for each packet
is doable. The dgram will be similar to stream to have two transports.

If there is no other problem, I can submit a Linux kernel patch to support
nested dgram on VMCI first. Then it will be easier for virtio vsock.

Also, I don't think we need to put anything related to this multiple-
transport support in the spec.  Let me know if otherwise.

Regards,

Jiang

On Tue, Mar 30, 2021 at 11:34 AM Jiang Wang . <jiang.wang@bytedance.com> wrote:
>
> On Tue, Mar 30, 2021 at 8:32 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> >
> > Hi Jiang,
> >
> > On Fri, Mar 26, 2021 at 04:40:09PM -0700, Jiang Wang . wrote:
> > >Hi Michael and Stefan,
> > >
> > >I thought about this and discussed it with my colleague Cong Wang.
> > >One idea is to make current asynchronous send_pkt flow to be synchronous,
> > >then if the virtqueue is full, the function can return  ENOMEM all the way back
> > >to the caller and the caller can check the return value of sendmsg
> > >and slow down when necessary.
> > >
> > >In the spec, we can put something like, if the virtqueue is full, the caller
> > >should be notified with an error etc.
> > >
> > >In terms of implementation, that means we will remove the current
> > >send_pkt_work for both stream and dgram sockets. Currently, the
> > >code path uses RCU and a work queue, then grab a mutex in the
> > >work queue function. Since we cannot grab mutex when in rcu
> > >critical section, we have to change RCU to a normal reference
> > >counting mechanism. I think this is doable. The drawback is
> > >that the reference counting in general spends a little more
> > >cycles than the RCU, so there is a small price to pay. Another
> > >option is to use Sleepable RCU and remove the work queue.
> > >
> > >What do you guys think?
> >
> > another thing that came to mind not related to the spec but to the Linux
> > implementation, is the multi-transport support.
> > When we discussed the initial proposals [1][2], we decided to take a
> > shortcut for DGRAM, since the only transport with DGRAM support was
> > vmci. So for now only a single transport with VSOCK_TRANSPORT_F_DGRAM
> > set can be registered.
> >
> > Since also virtio-transport and vhost-transport will support DGRAM, we
> > need to find a way to allow multiple transports that support DGRAM to be
> > registered together to support nested VMs.
> >
> > Do you have already thought about how to solve this problem?
> >
> > We should definitely allow the registration of more transports with
> > VSOCK_TRANSPORT_F_DGRAM, and dynamically choose which one to use when
> > sending a packet.
>
> Got it. We briefly discussed multiple dgram transport
> support in my old email thread. At that time, I was thinking
> maybe check for transport for each packet. I did not spend more
> time on that part after that. I will think about it more and get
> back to you. Thanks
>
> > Thanks,
> > Stefano
> >
> > [1] https://www.spinics.net/lists/netdev/msg570274.html
> > [2] https://www.spinics.net/lists/netdev/msg575792.html
> >
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type
  2021-03-31  1:02                     ` Jiang Wang .
@ 2021-03-31  6:42                       ` Stefano Garzarella
  0 siblings, 0 replies; 19+ messages in thread
From: Stefano Garzarella @ 2021-03-31  6:42 UTC (permalink / raw)
  To: Jiang Wang .
  Cc: cong.wang, Xiongchun Duan, Michael S. Tsirkin, cohuck,
	virtualization, xieyongji, Stefan Hajnoczi, asias,
	Arseny Krasnov

On Tue, Mar 30, 2021 at 06:02:33PM -0700, Jiang Wang . wrote:
>Hi Stefano,
>
>I checked dgram code again. I think assigning transport for each packet
>is doable. The dgram will be similar to stream to have two transports.

Yeah, make sense.

>
>If there is no other problem, I can submit a Linux kernel patch to support
>nested dgram on VMCI first. Then it will be easier for virtio vsock.

Okay, I'll look to the patches when you will send.

>
>Also, I don't think we need to put anything related to this multiple-
>transport support in the spec.  Let me know if otherwise.

Agree, it's related to the implementation.

Thanks,
Stefano

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] virtio-vsock: add description for datagram type
       [not found]   ` <CAA68J_bQHzFXnsLpCqZ3waPW1NGz+hnu2OXfAG4XOLemLOX9DQ@mail.gmail.com>
@ 2021-04-26 16:07     ` Stefan Hajnoczi
       [not found]       ` <CAA68J_Z=1uf5rLCpQeH+m9YmsYGsbJgf2VtRJjQrBd_jTdUYuA@mail.gmail.com>
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Hajnoczi @ 2021-04-26 16:07 UTC (permalink / raw)
  To: Cong Wang .
  Cc: 段熊春,
	jiang.wang, mst, cohuck, virtualization, xieyongji, asias,
	arseny.krasnov


[-- Attachment #1.1: Type: text/plain, Size: 1773 bytes --]

On Wed, Apr 14, 2021 at 05:43:52PM -0700, Cong Wang . wrote:
> On Wed, Mar 17, 2021 at 8:45 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > Please stick to UDP semantics as much as possible so that applications
> > can be ported easily and developers aren't surprised by unexpected
> > behavior. UDP packets sent to a destination that has no listen socket
> > result in a Connection Refused error. vsock dgrams should behave in the
> > same way.
> 
> There is no connection refused error for UDP, as it is clearly connectionless.
> What you suggested might be an ICMP unreachable error, however, vsock does
> not have anything equivalent. So, I think there is no error returned
> in this scenario
> for vsock datagram.

Yes, exactly. UDP uses ICMP unreachable:

  16:55:40.380292 IP 127.0.0.1.41519 > 127.0.0.1.1234: UDP, length 5
  16:55:40.380308 IP 127.0.0.1 > 127.0.0.1: ICMP 127.0.0.1 udp port 1234 unreachable, length 41

This is mentioned a bit here:
https://tools.ietf.org/html/rfc8085#section-5.2

Here is how Python's socket module produces an ConnectionRefused
exception:

  socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
  sendto(3, "hello", 5, 0, {sa_family=AF_INET, sin_port=htons(1234), sin_addr=inet_addr("127.0.0.1")}, 16) = 5
  getsockname(3, {sa_family=AF_INET, sin_port=htons(41519), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
  getpeername(3, 0x7ffc43c99a20, [16])    = -1 ENOTCONN (Transport endpoint is not connected)

UDP itself may not carry this information but applications do rely on
the ICMP information as shown above.

The Linux network stack implementation is here:
net/ipv4/udp.c:__udp4_lib_err().

Please take a look and decide how vsock dgrams can have similar
semantics.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] virtio-vsock: add description for datagram type
       [not found]       ` <CAA68J_Z=1uf5rLCpQeH+m9YmsYGsbJgf2VtRJjQrBd_jTdUYuA@mail.gmail.com>
@ 2021-05-13 16:04         ` Stefan Hajnoczi
  0 siblings, 0 replies; 19+ messages in thread
From: Stefan Hajnoczi @ 2021-05-13 16:04 UTC (permalink / raw)
  To: Cong Wang .
  Cc: 段熊春,
	jiang.wang, mst, cohuck, virtualization, xieyongji, asias,
	arseny.krasnov


[-- Attachment #1.1: Type: text/plain, Size: 2745 bytes --]

On Thu, May 06, 2021 at 04:56:48PM -0700, Cong Wang . wrote:
> On Mon, Apr 26, 2021 at 9:07 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Wed, Apr 14, 2021 at 05:43:52PM -0700, Cong Wang . wrote:
> > > On Wed, Mar 17, 2021 at 8:45 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > Please stick to UDP semantics as much as possible so that applications
> > > > can be ported easily and developers aren't surprised by unexpected
> > > > behavior. UDP packets sent to a destination that has no listen socket
> > > > result in a Connection Refused error. vsock dgrams should behave in the
> > > > same way.
> > >
> > > There is no connection refused error for UDP, as it is clearly connectionless.
> > > What you suggested might be an ICMP unreachable error, however, vsock does
> > > not have anything equivalent. So, I think there is no error returned
> > > in this scenario
> > > for vsock datagram.
> >
> > Yes, exactly. UDP uses ICMP unreachable:
> >
> >   16:55:40.380292 IP 127.0.0.1.41519 > 127.0.0.1.1234: UDP, length 5
> >   16:55:40.380308 IP 127.0.0.1 > 127.0.0.1: ICMP 127.0.0.1 udp port 1234 unreachable, length 41
> >
> > This is mentioned a bit here:
> > https://tools.ietf.org/html/rfc8085#section-5.2
> >
> > Here is how Python's socket module produces an ConnectionRefused
> > exception:
> >
> >   socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_IP) = 3
> >   sendto(3, "hello", 5, 0, {sa_family=AF_INET, sin_port=htons(1234), sin_addr=inet_addr("127.0.0.1")}, 16) = 5
> >   getsockname(3, {sa_family=AF_INET, sin_port=htons(41519), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
> >   getpeername(3, 0x7ffc43c99a20, [16])    = -1 ENOTCONN (Transport endpoint is not connected)
> 
> Yes, we are all aware of these UDP behaviors.
> 
> >
> > UDP itself may not carry this information but applications do rely on
> > the ICMP information as shown above.
> >
> > The Linux network stack implementation is here:
> > net/ipv4/udp.c:__udp4_lib_err().
> 
> Sure, but applications using UDP when porting to vsock dgram, they
> must be aware AF_VSOCK has totally different protocols from AF_INET,
> hence behaves differently from AF_INET. Therefore, it is not necessary
> to do everything UDP does here, particularly not necessary to mimic ICMP
> in vsock. In short, I don't think there is any standard to enforce what
> dgram sockets must behave, each address family should have its right
> to behave reasonably differently.

They can behave differently but at the cost of making it harder to port
applications. Sending a RST packet in vsock is easy and allows
applications to get the same behavior as with UDP/ICMP. I don't see a
reason to avoid it.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-05-13 16:04 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-16 21:56 [RFC PATCH] virtio-vsock: add description for datagram type jiang.wang
2021-03-17 15:44 ` Stefan Hajnoczi
2021-03-18 17:59   ` [External] " Jiang Wang .
2021-03-22 16:50     ` Stefan Hajnoczi
2021-03-22 23:02       ` Jiang Wang .
2021-03-22 23:10         ` Michael S. Tsirkin
2021-03-23  2:23           ` Jiang Wang .
2021-03-23  8:53             ` Stefan Hajnoczi
2021-03-26 23:40               ` Jiang Wang .
2021-03-29  9:25                 ` Stefan Hajnoczi
2021-03-29 23:22                   ` Jiang Wang .
2021-03-30 10:42                     ` Stefan Hajnoczi
2021-03-30 18:30                       ` Jiang Wang .
2021-03-30 15:32                 ` Stefano Garzarella
2021-03-30 18:34                   ` Jiang Wang .
2021-03-31  1:02                     ` Jiang Wang .
2021-03-31  6:42                       ` Stefano Garzarella
     [not found]   ` <CAA68J_bQHzFXnsLpCqZ3waPW1NGz+hnu2OXfAG4XOLemLOX9DQ@mail.gmail.com>
2021-04-26 16:07     ` Stefan Hajnoczi
     [not found]       ` <CAA68J_Z=1uf5rLCpQeH+m9YmsYGsbJgf2VtRJjQrBd_jTdUYuA@mail.gmail.com>
2021-05-13 16:04         ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).