From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: virtio-return-2917-cohuck=redhat.com@lists.oasis-open.org Sender: List-Post: List-Help: List-Unsubscribe: List-Subscribe: Date: Mon, 26 Feb 2018 23:05:14 +0200 From: "Michael S. Tsirkin" Message-ID: <20180226224109-mutt-send-email-mst@kernel.org> References: <1518765602-8739-1-git-send-email-mst@redhat.com> <20180216092412-mutt-send-email-mst@kernel.org> <55eb7e9c-97ee-f7d7-10db-190dd844160d@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <55eb7e9c-97ee-f7d7-10db-190dd844160d@linux.vnet.ibm.com> Subject: [virtio] Re: [PATCH v8 08/16] packed virtqueues: more efficient virtqueue layout To: Halil Pasic Cc: virtio@lists.oasis-open.org, virtio-dev@lists.oasis-open.org, Cornelia Huck , Tiwei Bie , Stefan Hajnoczi , "Dhanoa, Kully" List-ID: On Mon, Feb 26, 2018 at 06:19:21PM +0100, Halil Pasic wrote: > Some of my comments are taken from (unchanged or reworded) > (https://lists.oasis-open.org/archives/virtio-dev/201802/msg00026.html) > > Tried to not repeat stuff pointed out by Tiwei Bie. > > > On 02/16/2018 08:24 AM, Michael S. Tsirkin wrote: > > Performance analysis of this is in my kvm forum 2016 presentation. The > > idea is to have a r/w descriptor in a ring structure, replacing the used > > and available ring, index and descriptor buffer. > > > > This is also easier for devices to implement than the 1.0 layout. > > Several more enhancements will be necessary to actually make this > > efficient for devices to use. > > > > Signed-off-by: Michael S. Tsirkin > > --- > > content.tex | 28 ++- > > packed-ring.tex | 646 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 671 insertions(+), 3 deletions(-) > > create mode 100644 packed-ring.tex > > > > diff --git a/content.tex b/content.tex > > index e1e30a0..73f40b7 100644 > > --- a/content.tex > > +++ b/content.tex > > @@ -263,8 +263,20 @@ these parts (following \ref{sec:Basic Facilities of a Virtio Device / Split Virt > > > > \end{note} > > > > +Two formats are supported: Split Virtqueues (see \ref{sec:Basic > > +Facilities of a Virtio Device / Split > > +Virtqueues}~\nameref{sec:Basic Facilities of a Virtio Device / > > +Split Virtqueues}) and Packed Virtqueues (see \ref{sec:Basic > > +Facilities of a Virtio Device / Packed > > +Virtqueues}~\nameref{sec:Basic Facilities of a Virtio Device / > > +Packed Virtqueues}). > > + > > +Every driver and device supports either the Packed or the Split > > +Virtqueue format, or both. > > + > > \input{split-ring.tex} > > > > +\input{packed-ring.tex} > > \chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation} > > > > We start with an overview of device initialization, then expand on the > > @@ -5215,10 +5227,15 @@ Currently these device-independent feature bits defined: > > \begin{description} > > \item[VIRTIO_F_RING_INDIRECT_DESC (28)] Negotiating this feature indicates > > that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT > > - flag set, as described in \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors}~\nameref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors}. > > - > > + flag set, as described in \ref{sec:Basic Facilities of a Virtio > > +Device / Virtqueues / The Virtqueue Descriptor Table / Indirect > > +Descriptors}~\nameref{sec:Basic Facilities of a Virtio Device / > > +Virtqueues / The Virtqueue Descriptor Table / Indirect > > +Descriptors} and \ref{sec:Packed Virtqueues / Indirect Flag: Scatter-Gather Support}~\nameref{sec:Packed Virtqueues / Indirect Flag: Scatter-Gather Support}. > > \item[VIRTIO_F_RING_EVENT_IDX(29)] This feature enables the \field{used_event} > > - and the \field{avail_event} fields as described in \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression} and \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring}. > > + and the \field{avail_event} fields as described in > > +\ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression}, \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring} and \ref{sec:Packed Virtqueues / Driver and Device Event Suppression}. > > + > > > > \item[VIRTIO_F_VERSION_1(32)] This indicates compliance with this > > specification, giving a simple way to detect legacy devices or drivers. > > @@ -5228,6 +5245,9 @@ Currently these device-independent feature bits defined: > > addresses in memory. If this feature bit is set to 0, then the device emits > > physical addresses which are not translated further, even though an IOMMU > > may be present. > > + \item[VIRTIO_F_RING_PACKED(34)] This feature indicates > > + support for the packed virtqueue layout as described in > > + \ref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}~\nameref{sec:Basic Facilities of a Virtio Device / Packed Virtqueues}. > > \end{description} > > > > \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits} > > @@ -5241,6 +5261,8 @@ passed to the device into physical addresses in memory. If > > VIRTIO_F_IOMMU_PLATFORM is not offered, then a driver MUST pass only physical > > addresses to the device. > > > > +A driver SHOULD accept VIRTIO_F_RING_PACKED if it is offered. > > + > > \devicenormative{\section}{Reserved Feature Bits}{Reserved Feature Bits} > > > > A device MUST offer VIRTIO_F_VERSION_1. A device MAY fail to operate further > > diff --git a/packed-ring.tex b/packed-ring.tex > > new file mode 100644 > > index 0000000..213295b > > --- /dev/null > > +++ b/packed-ring.tex > > @@ -0,0 +1,646 @@ > > +\section{Packed Virtqueues}\label{sec:Basic Facilities of a Virtio Device / Packed Virtqueues} > > + > > +Packed virtqueues is an alternative compact virtqueue layout using > > +read-write memory, that is memory that is both read and written > > +by both host and guest. > > + > > +Use of packed virtqueues is negotiated by the VIRTIO_F_RING_PACKED > > +feature bit. > > + > > +Packed virtqueues support up to $2^{15}$ entries each. > > + > > +With current transports, virtqueues are located in guest memory > > +allocated by driver. > > +Each packed virtqueue consists of three parts: > > + > > +\begin{itemize} > > +\item Descriptor Ring - occupies the Descriptor Area > > +\item Driver Event Suppression - occupies the Driver Area > > +\item Device Event Suppression - occupies the Device Area > > +\end{itemize} > > + > > +Where Descriptor Ring in turn consists of descriptors, > > +and where each descriptor can contain the following parts: > > + > > +\begin{itemize} > > +\item Buffer ID > > +\item Element Address > > +\item Element Length > > +\item Flags > > +\end{itemize} > > + > > +A buffer consists of zero or more device-readable physically-contiguous > > +elements followed by zero or more physically-contiguous > > +device-writable elements (each buffer has at least one element). > > + > > +When the driver wants to send such a buffer to the device, it > > +writes at least one available descriptor describing elements of > > +the buffer into the Descriptor Ring. The descriptor(s) are > > +associated with a buffer by means of a Buffer ID stored within > > +the descriptor. > > + > > +Driver then notifies the device. When the device has finished > > +processing the buffer, it writes a used device descriptor > > +including the Buffer ID into the Descriptor Ring (overwriting a > > +driver descriptor previously made available), and sends an > > +interrupt. > > + > > +Descriptor Ring is used in a circular manner: driver writes > > +descriptors into the ring in order. After reaching end of ring, > > +the next descriptor is placed at head of the ring. Once ring is > > +full of driver descriptors, driver stops sending new requests and > > +waits for device to start processing descriptors and to write out > > +some used descriptors before making new driver descriptors > > +available. > > + > > +Similarly, device reads descriptors from the ring in order and > > +detects that a driver descriptor has been made available. As > > +processing of descriptors is completed used descriptors are > > +written by the device back into the ring. > > + > > +Note: after reading driver descriptors and starting their > > +processing in order, device might complete their processing out > > +of order. Used device descriptors are written in the order > > +in which their processing is complete. > > + > > +Device Event Suppression data structure is write-only by the > > +device. It includes information for reducing the number of > > +device events - i.e. driver notifications to device. > > + > > +Driver Event Suppression data structure is read-only by the > > +device. It includes information for reducing the number of > > +driver events - i.e. device interrupts to driver. > > + > > +\subsection{Driver and Device Ring Wrap Counters} > > +\label{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters} > > +Each of the driver and the device are expected to maintain, > > +internally, a single-bit ring wrap counter initialized to 1. > > + > > +The counter maintained by the driver is called the Driver > > +Ring Wrap Counter. Driver changes the value of this counter > > +each time it makes available the > > +last descriptor in the ring (after making the last descriptor > > +available). > > + > > +The counter maintained by the device is called the Device Ring Wrap > > +Counter. Device changes the value of this counter > > +each time it uses the last descriptor in > > +the ring (after marking the last descriptor used). > > + > > +It is easy to see that the Driver Ring Wrap Counter in the driver matches > > +the Device Ring Wrap Counter in the device when both are processing the same > > +descriptor, or when all available descriptors have been used. > > + > > +To mark a descriptor as available and used, both driver and > > +device use the following two flags: > > +\begin{lstlisting} > > +#define VIRTQ_DESC_F_AVAIL (1 << 7) > > +#define VIRTQ_DESC_F_USED (1 << 15) > > +\end{lstlisting} > > + > > +To mark a descriptor as available, driver sets the > > +VIRTQ_DESC_F_AVAIL bit in Flags to match the internal Driver > > +Ring Wrap Counter. It also sets the VIRTQ_DESC_F_USED bit to match the > > +\emph{inverse} value (i.e. to not match the internal Driver Ring > > +Wrap Counter). > > + > > +To mark a descriptor as used, device sets the > > +VIRTQ_DESC_F_USED bit in Flags to match the internal Device > > +Ring Wrap Counter. It also sets the VIRTQ_DESC_F_AVAIL bit to match the > > +\emph{same} value. > > + > > +Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different > > +for an available descriptor and equal for a used descriptor. > > + > > AFAIU VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED being > different is a necessary but not a sufficient pre-condition for > a descriptor being available; VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED > equal is a necessary but not a sufficient pre-condition for > a descriptor being used. Did I get that right? > > If I got it right, then my suggestion is to provide a necessary and > sufficient condition here. > > > > +\subsection{Polling of available and used descriptors} > > +\label{sec:Packed Virtqueues / Polling of available and used descriptors} > > + > > +Writes of device and driver descriptors can generally be > > +reordered, but each side (driver and device) are only required to > > +poll (or test) a single location in memory: next device descriptor after > > +the one they processed previously, in circular order. > > + > > +Sometimes device needs to only write out a single used descriptor > > +after processing a batch of multiple available descriptors. As > > +described in more detail below, this can happen when using > > +descriptor chaining or with in-order > > +use of descriptors. In this case, device writes out a used > > +descriptor with buffer id of the last descriptor in the group. > > +After processing the used descriptor, both device and driver then > > +skip forward in the ring the number of the remaining descriptors > > +in the group until processing (reading for the driver and writing > > +for the device) the next used descriptor. > > + > > +\subsection{Write Flag} > > +\label{sec:Packed Virtqueues / Write Flag} > > + > > +In an available descriptor, VIRTQ_DESC_F_WRITE bit within Flags > > +is used to mark a descriptor as corresponding to a write-only or > > +read-only element of a buffer. > > + > > +\begin{lstlisting} > > +/* This marks a descriptor as device write-only (otherwise device read-only). */ > > +#define VIRTQ_DESC_F_WRITE 2 > > +\end{lstlisting} > > + > > +In a used descriptor, this bit is used to specify whether any > > +data has been written by the device into any parts of the buffer. > > + > > + > > +\subsection{Element Address and Length} > > +\label{sec:Packed Virtqueues / Element Address and Length} > > + > > +In an available descriptor, Element Address corresponds to the > > +physical address of the buffer element. The length of the element assumed > > +to be physically contigious is stored in Element Length. > > + > > +In a used descriptor, Element Address is unused. Element Length > > +specifies the length of the buffer that has been initialized > > +(written to) by the device. > > + > > +Element length is reserved for used descriptors without the > > +VIRTQ_DESC_F_WRITE flag, and is ignored by drivers. > > + > > +\subsection{Scatter-Gather Support} > > [Consistent wording] Both types of virtqueues support scatter-gather > but the term is used only for packed. Maybe we could unify the wording. > Or is this something for later? I'll take a look but this can be safely done later too. > > +\label{sec:Packed Virtqueues / Scatter-Gather Support} > > + > > +Some drivers need an ability to supply a list of multiple buffer > > +elements (also known as a scatter/gather list) with a request. > > +Two optional features support this: descriptor > > +chaining and indirect descriptors. > > + > > +If neither feature has been negotiated, each buffer is > > +physically-contigious, either read-only or write-only and is > > +described completely by a single descriptor. > > + > > +While unusual (most implementations either create all lists > > +solely using non-indirect descriptors, or always use a single > > +indirect element), if both features have been negotiated, mixing > > +direct and direct descriptors in a ring is valid, as long as each > > +list only contains descriptors of a given type. > > + > > +Scatter/gather lists only apply to available descriptors. A > > +single used descriptor corresponds to the whole list. > > + > > +The device limits the number of descriptors in a list through a > > +transport-specific and/or device-specific value. If not limited, > > +the maximum number of descriptors in a list is the virt queue > > +size. > > + > > +\subsection{Next Flag: Descriptor Chaining} > > +\label{sec:Packed Virtqueues / Next Flag: Descriptor Chaining} > > + > > +The packed ring format allows driver to supply > > +a scatter/gather list to the device > > +by using multiple descriptors, and setting the VIRTQ_DESC_F_NEXT in > > +Flags for all but the last available descriptor. > > + > > +\begin{lstlisting} > > +/* This marks a buffer as continuing. */ > > +#define VIRTQ_DESC_F_NEXT 1 > > +\end{lstlisting} > > + > > +Buffer ID is included in the last descriptor in the list. > > + > > +The driver always makes the first descriptor in the list > > +available after the rest of the list has been written out into > > +the ring. This guarantees that the device will never observe a > > +partial scatter/gather list in the ring. > > + > > +Note: all flags, including VIRTQ_DESC_F_AVAIL, VIRTQ_DESC_F_USED, > > +VIRTQ_DESC_F_WRITE must be set/cleared correctly in all > > +descriptors in the list, not just the first one. > > + > > +Device only writes out a single used descriptor for the whole > > +list. It then skips forward according to the number of > > +descriptors in the list. Driver needs to keep track of the size > > +of the list corresponding to each buffer ID, to be able to skip > > +to where the next used descriptor is written by the device. > > + > > +For example, if descriptors are used in the same order in which > > +they are made available, this will result in the used descriptor > > +overwriting the first available descriptor in the list, the used > > +descriptor for the next list overwriting the first available > > +descriptor in the next list, etc. > > + > > +VIRTQ_DESC_F_NEXT is reserved in used descriptors, and > > +should be ignored by drivers. > > + > > +\subsection{Indirect Flag: Scatter-Gather Support} > > +\label{sec:Packed Virtqueues / Indirect Flag: Scatter-Gather Support} > > + > > +Some devices benefit by concurrently dispatching a large number > > +of large requests. The VIRTIO_F_INDIRECT_DESC feature allows this. To increase > > +ring capacity the driver can store a (read-only by the device) table of indirect > > +descriptors anywhere in memory, and insert a descriptor in main > > +virtqueue (with \field{Flags} bit VIRTQ_DESC_F_INDIRECT on) that refers to > > +a buffer element > > +containing this indirect descriptor table; \field{addr} and \field{len} > > +refer to the indirect table address and length in bytes, > > +respectively. > > +\begin{lstlisting} > > +/* This means the element contains a table of descriptors. */ > > +#define VIRTQ_DESC_F_INDIRECT 4 > > +\end{lstlisting} > > + > > +The indirect table layout structure looks like this > > +(\field{len} is the Buffer Length of the descriptor that refers to this table, > > +which is a variable): > > + > > +\begin{lstlisting} > > +struct indirect_descriptor_table { > > + /* The actual descriptor structures (struct virtq_desc each) */ > > + struct virtq_desc desc[len / sizeof(struct virtq_desc)]; > > +}; > > +\end{lstlisting} > > + > > +The first descriptor is located at start of the indirect > > +descriptor table, additional indirect descriptors come > > +immediately afterwards. \field{Flags} bit VIRTQ_DESC_F_WRITE is the > > +only valid flag for descriptors in the indirect table. Others > > +are reserved and are ignored by the device. > > +Buffer ID is also reserved and is ignored by the device. > > + > > +In Descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE > > +is reserved and is ignored by the device. > > + > > +\subsection{Multi-buffer requests} > > +\label{sec:Packed Virtqueues / Multi-descriptor batches} > > +Some devices combine multiple buffers as part of processing of a > > +single request. These devices always mark the first descriptor > > +in the request used after the rest of the descriptors in the > > +request has been written out into the ring. This guarantees that > > +the driver will never observe a partial request in the ring. > > + > > I see you've changed s/in the request available/in the request used/. > But I still don't understand this paragraph. I will try to figure > it out later (and will come back to you if I fail). FYI this applies to mergeable buffers for the network device. > > + > > +\subsection{Driver and Device Event Suppression} > > +\label{sec:Packed Virtqueues / Driver and Device Event Suppression} > > +In many systems driver and device notifications involve > > +significant overhead. To mitigate this overhead, > > +each virtqueue includes two identical structures used for > > +controlling notifications between device and driver. > > + > > +Driver Event Suppression structure is read-only by the > > +device and controls the events sent by the device > > +to the driver (e.g. interrupts). > > + > > +Device Event Suppression structure is read-only by > > +the driver and controls the events sent by the driver > > +to the device (e.g. IO). > > + > > +Each of these Event Suppression structures controls > > +both Descriptor Ring events and structure events, and > > +each includes the following fields: > > + > > +\begin{description} > > +\item [Descriptor Ring Change Event Flags] Takes values: > > +\begin{itemize} > > +\item 00b enable events > > +\item 01b disable events > > +\item 10b enable events for a specific descriptor > > +(as specified by Descriptor Ring Change Event Offset/Wrap Counter). > > +Only valid if VIRTIO_F_RING_EVENT_IDX has been negotiated. > > +\item 11b reserved > > +\end{itemize} > > +\item [Descriptor Ring Change Event Offset] If Event Flags set to descriptor > > +specific event: offset within the ring (in units of descriptor > > +size). Event will only trigger when this descriptor is > > +made available/used respectively. > > +\item [Descriptor Ring Change Event Wrap Counter] If Event Flags set to descriptor > > +specific event: offset within the ring (in units of descriptor > > +size). Event will only trigger when Ring Wrap Counter > > +matches this value and a descriptor is > > +made available/used respectively. > > +\end{description} > > + > > +After writing out some descriptors, both device and driver > > +are expected to consult the relevant structure to find out > > +whether interrupt/notification should be sent. > > + > > +\subsubsection{Structure Size and Alignment} > > +\label{sec:Packed Virtqueues / Structure Size and Alignment} > > + > > +Each part of the virtqueue is physically-contiguous in guest memory, > > +and has different alignment requirements. > > + > > +The memory aligment and size requirements, in bytes, of each part of the > > s/aligment/alignment/ > > > +virtqueue are summarized in the following table: > > + > > +\begin{tabular}{|l|l|l|} > > +\hline > > +Virtqueue Part & Alignment & Size \\ > > +\hline \hline > > +Descriptor Ring & 16 & $16 * $(Queue Size) \\ > > +\hline > > +Device Event Suppression & 4 & 4 \\ > > + \hline > > +Driver Event Suppression & 4 & 4 \\ > > + \hline > > +\end{tabular} > > + > > +The Alignment column gives the minimum alignment for each part > > +of the virtqueue. > > + > > +The Size column gives the total number of bytes for each > > +part of the virtqueue. > > + > > +Queue Size corresponds to the maximum number of descriptors in the > > +virtqueue\footnote{For example, if Queue Size is 4 then at most 4 buffers > > +can be queued at any given time.}. Queue Size value does not > > +have to be a power of 2 unless enforced by the transport. > > + > > +\drivernormative{\subsection}{Virtqueues}{Basic Facilities of a > > +Virtio Device / Packed Virtqueues} > > +The driver MUST ensure that the physical address of the first byte > > +of each virtqueue part is a multiple of the specified alignment value > > +in the above table. > > + > > +\devicenormative{\subsection}{Virtqueues}{Basic Facilities of a > > +Virtio Device / Packed Virtqueues} > > +The device MUST start processing driver descriptors in the order > > +in which they appear in the ring. > > +The device MUST start writing device descriptors into the ring in > > +the order in which they complete. > > +Device MAY reorder descriptor writes once they are started. > > + > > +\subsection{The Virtqueue Descriptor Format}\label{sec:Basic > > +Facilities of a Virtio Device / Packed Virtqueues / The Virtqueue > > +Descriptor Format} > > + > > +The available descriptor refers to the buffers the driver is sending > > +to the device. \field{addr} is a physical address, and the > > +descriptor is identified with a buffer using the \field{id} field. > > + > > +\begin{lstlisting} > > +struct virtq_desc { > > + /* Buffer Address. */ > > + le64 addr; > > + /* Buffer Length. */ > > + le32 len; > > + /* Buffer ID. */ > > + le16 id; > > + /* The flags depending on descriptor type. */ > > + le16 flags; > > +}; > > +\end{lstlisting} > > + > > +The descriptor ring is zero-initialized. > > + > > +\subsection{Event Suppression Structure Format}\label{sec:Basic > > +Facilities of a Virtio Device / Packed Virtqueues / Event Suppression Structure > > +Format} > > + > > +The following structure is used to reduce the number of > > +notifications sent between driver and device. > > + > > +\begin{lstlisting} > > +__le16 desc_event_off : 15; /* Descriptor Event Offset */ > > +int desc_event_wrap : 1; /* Descriptor Event Wrap Counter */ > > +__le16 desc_event_flags : 2; /* Descriptor Event Flags */ > > +\end{lstlisting} > > + > > +\devicenormative{\subsection}{The Virtqueue Descriptor Table}{Basic Facilities of a Virtio Device / Packed Virtqueues / The Virtqueue Descriptor Table} > > +A device MUST NOT write to a device-readable buffer, and a device SHOULD NOT > > +read a device-writable buffer. > > +A device MUST NOT use a descriptor unless it observes > > +VIRTQ_DESC_F_AVAIL bit in its \field{flags} being changed. > > I don't really understand this. How does the device observe > the VIRTQ_DESC_F_AVAIL bit being changed? By reading the descriptor. > > +A device MUST NOT change a descriptor after changing it's > > +VIRTQ_DESC_F_USED bit in its \field{flags}. > > + > > +\drivernormative{\subsection}{The Virtqueue Descriptor Table}{Basic Facilities of a Virtio Device / PAcked Virtqueues / The Virtqueue Descriptor Table} > > +A driver MUST NOT change a descriptor unless it observes > > +VIRTQ_DESC_F_USED bit in its \field{flags} being changed. > > +A driver MUST NOT change a descriptor after changing > > +VIRTQ_DESC_F_USED bit in its \field{flags}. > > I'm a bit confused with this protocol. AFAIU the driver > always writes the value into the VIRTQ_DESC_F_USED that > is already there (so it does not change). Was that supposed > to be VIRTQ_DESC_F_AVAIL. That was supposed to be VIRTQ_DESC_F_AVAIL. > > +When notifying the device, driver MUST set > > +\field{next_off} and > > +\field{next_wrap} to match the next descriptor > > +not yet made available to the device. > > +A driver MAY send multiple notifications without making > > +any new descriptors available to the device. > > + > > +\drivernormative{\subsection}{Scatter-Gather Support}{Basic Facilities of a > > +Virtio Device / Packed Virtqueues / Scatter-Gather Support} > > +A driver MUST NOT create a descriptor list longer than allowed > > +by the device. > > + > > +A driver MUST NOT create a descriptor list longer than the Queue > > +Size. > > + > > +This implies that loops in the descriptor list are forbidden! > > + > > +The driver MUST place any device-writable descriptor elements after > > +any device-readable descriptor elements. > > + > > +A driver MUST NOT depend on the device to use more descriptors > > +to be able to write out all descriptors in a list. A driver > > +MUST make sure there's enough space in the ring > > +for the whole list before making the first descriptor in the list > > +available to the device. > > + > > +A driver MUST NOT make the first descriptor in the list > > +available before initializing the rest of the descriptors. > > Initializing is a bit vague here. How about: unless all subsequent descriptors > comprising the list (that is the request) are made available. OK. > > + > > +\devicenormative{\subsection}{Scatter-Gather Support}{Basic Facilities of a > > +Virtio Device / Packed Virtqueues / Scatter-Gather Support} > > +The device MUST use descriptors in a list chained by the > > +VIRTQ_DESC_F_NEXT flag in the same order that they > > +were made available by the driver. > > + > > +The device MAY limit the number of buffers it will allow in a > > +list. > > + > > +\drivernormative{\subsection}{Indirect Descriptors}{Basic Facilities of a Virtio Device / Packed Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors} > > +The driver MUST NOT set the DESC_F_INDIRECT flag unless the > > +VIRTIO_F_INDIRECT_DESC feature was negotiated. The driver MUST NOT > > +set any flags except DESC_F_WRITE within an indirect descriptor. > > + > > +A driver MUST NOT create a descriptor chain longer than allowed > > +by the device. > > + > > +A driver MUST NOT write direct descriptors with > > +DESC_F_INDIRECT set in a scatter-gather list linked by > > +VIRTQ_DESC_F_NEXT. > > +\field{flags}. > > + > > +\subsection{Virtqueue Operation}\label{sec:Basic Facilities of a Virtio Device / Packed Virtqueues / Virtqueue Operation} > > + > > +There are two parts to virtqueue operation: supplying new > > +available buffers to the device, and processing used buffers from > > +the device. > > + > > +What follows is the requirements of each of these two parts > > +when using the packed virtqueue format in more detail. > > + > > +\subsection{Supplying Buffers to The Device}\label{sec:Basic Facilities of a Virtio Device / Packed Virtqueues / Supplying Buffers to The Device} > > + > > +The driver offers buffers to one of the device's virtqueues as follows: > > + > > +\begin{enumerate} > > +\item The driver places the buffer into free descriptor in the Descriptor Ring. > > + > > +\item The driver performs a suitable memory barrier to ensure that it updates > > + the descriptor(s) before checking for notification suppression. > > + > > +\item If notifications are not suppressed, the driver notifies the device > > + of the new available buffers. > > +\end{enumerate} > > + > > +What follows is the requirements of each stage in more detail. > > + > > +\subsubsection{Placing Available Buffers Into The Descriptor Ring}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Supplying Buffers to The Device / Placing Available Buffers Into The Descriptor Ring} > > + > > +For each buffer element, b: > > + > > +\begin{enumerate} > > +\item Get the next descriptor table entry, d > > +\item Get the next free buffer id value > > +\item Set \field{d.addr} to the physical address of the start of b > > +\item Set \field{d.len} to the length of b. > > +\item Set \field{d.id} to the buffer id > > +\item Calculate the flags as follows: > > +\begin{enumerate} > > +\item If b is device-writable, set the VIRTQ_DESC_F_WRITE bit to 1, otherwise 0 > > +\item Set VIRTQ_DESC_F_AVAIL bit to the current value of the Driver Ring Wrap Counter > > +\item Set VIRTQ_DESC_F_USED bit to inverse value > > +\end{enumerate} > > +\item Perform a memory barrier to ensure that the descriptor has > > + been initialized > > +\item Set \field{d.flags} to the calculated flags value > > +\item If d is the last descriptor in the ring, toggle the > > + Driver Ring Wrap Counter > > +\item Otherwise, increment d to point at the next descriptor > > +\end{enumerate} > > + > > +This makes a single descriptor buffer available. However, in > > +general the driver MAY make use of a batch of descriptors as part > > +of a single request. In that case, it defers updating > > +the descriptor flags for the first descriptor > > +(and the previous memory barrier) until after the rest of > > +the descriptors have been initialized. > > + > > +Once the descriptor \field{flags} is updated by the driver, this exposes the > > +descriptor and its contents. The device MAY > > +access the descriptor and any following descriptors the driver created and the > > +memory they refer to immediately. > > + > > +\drivernormative{\paragraph}{Updating flags}{Basic Facilities of > > +a Virtio Device / Packed Virtqueues / Supplying Buffers to The > > +Device / Updating flags} > > +The driver MUST perform a suitable memory barrier before the > > +\field{flags} update, to ensure the > > +device sees the most up-to-date copy. > > + > > +\subsubsection{Notifying The Device}\label{sec:Basic Facilities > > +of a Virtio Device / Packed Virtqueues / Supplying Buffers to The Device / Notifying The Device} > > + > > +The actual method of device notification is bus-specific, but generally > > +it can be expensive. So the device MAY suppress such notifications if it > > +doesn't need them, using the Driver Event Suppression structure > > +as detailed in section \ref{sec:Basic > > +Facilities of a Virtio Device / Packed Virtqueues / Event > > +Suppression Structure Format}. > > + > > +The driver has to be careful to expose the new \field{flags} > > +value before checking if notifications are suppressed. > > + > > +\subsubsection{Implementation Example}\label{sec:Basic Facilities of a Virtio Device / Packed Virtqueues / Supplying Buffers to The Device / Implementation Example} > > + > > +Below is an example driver code. It does not attempt to reduce > > +the number of device interrupts, neither does it support > > +the VIRTIO_F_RING_EVENT_IDX feature. > > + > > +\begin{lstlisting} > > + > > +first = vq->next_avail; > > +id = alloc_id(vq); > > + > > +for (each buffer element b) { > > + vq->desc[vq->next_avail].address = get_addr(b); > > + vq->desc[vq->next_avail].len = get_len(b); > > + init_desc(vq->next_avail, b); > > What is init_desc? Can't find it elsewhere and I can't > guess it. Leftover I think - initially the above two lines were part of that function. > > + avail = vq->avail_wrap_count; > > + used = !vq->avail_wrap_count; > > + f = get_flags(b) | (avail << VIRTQ_DESC_F_AVAIL) | (used << VIRTQ_DESC_F_USED); > > + /* Don't mark the 1st descriptor available until all of them are ready. */ > > + if (vq->next_avail == first) { > > + flags = f; > > + } else { > > + vq->desc[vq->next_avail].flags = f; > > + } > > + > > + vq->next_avail++; > > + > > + if (vq->next_avail >= vq->size) { > > + vq->next_avail = 0; > > + vq->avail_wrap_count \^= 1; > > + } > > + > > + > > +} > > +/* ID included in the last descriptor in the list */ > > +vq->desc[vq->next_avail].id = id; > > +write_memory_barrier(); > > +vq->desc[first].flags = flags; > > + > > +memory_barrier(); > > + > > +if (vq->device_event.flags != 0x2) { > > + notify_device(vq); > > +} > > + > > +\end{lstlisting} > > + > > + > > +\drivernormative{\paragraph}{Notifying The Device}{Basic Facilities of a Virtio Device / Packed Virtqueues / Supplying Buffers to The Device / Notifying The Device} > > +The driver MUST perform a suitable memory barrier before reading > > +the Driver Event Suppression structure, to avoid missing a notification. > > + > > +\subsection{Receiving Used Buffers From The Device}\label{sec:Basic Facilities of a Virtio Device / Packed Virtqueues / Receiving Used Buffers From The Device} > > + > > +Once the device has used buffers referred to by a descriptor (read from or written to them, or > > +parts of both, depending on the nature of the virtqueue and the > > +device), it interrupts the driver > > +as detailed in section \ref{sec:Basic > > +Facilities of a Virtio Device / Packed Virtqueues / Event > > +Suppression Structure Format}. > > + > > +\begin{note} > > +For optimal performance, a driver MAY disable interrupts while processing > > +the used buffers, but beware the problem of missing interrupts between > > +emptying the ring and reenabling interrupts. This is usually handled by > > +re-checking for more used buffers after interrups are re-enabled: > > +\end{note} > > + > > +\begin{lstlisting} > > +vq->driver_event.flags = 0x2; > > + > > +for (;;) { > > + struct virtq_desc *d = vq->desc[vq->next_used]; > > + > > + flags = d->flags; > > + bool avail = flags & (1 << VIRTQ_DESC_F_AVAIL); > > + bool used = flags & (1 << VIRTQ_DESC_F_USED); > > + > > + if (avail != used) { > > I don't understand the condition which is AFAIU supposed to > correspond to the descriptor *not* being used. So avail == used means used. avail != used means available. > > + vq->driver_event.flags = 0x1; > > + memory_barrier(); > > + > > + flags = d->flags; > > + bool avail = flags & (1 << VIRTQ_DESC_F_AVAIL); > > + bool used = flags & (1 << VIRTQ_DESC_F_USED); > > + if (avail != used) { > > + break; > > + } > > + > > + vq->driver_event.flags = 0x2; > > + } > > + > > + read_memory_barrier(); > > Now with the condition avail != used a freshly (that is zero initialized) > ring would appear all used. And we would do process_buffer(d) for the > whole ring if this code happens to get executed. Do we have to make > sure that this does not happen? I'll have to think about this. > I was under the impression that this whole wrap counter exercise is > to be able to distinguish these cases. > > BTW tools/virtio/ringtest/ring.c has a single flag bit to indicate > available/used and does not have these wrap counters AFAIR. A single flag is fine if there's not s/g support and all descriptors are written out. Wrap counters are needed if we are to support skipping descriptors because of s/g or in order. > Also for split virtqueues a descriptor has three possible states: > * available > * used > * free > > I wonder if it's the same for packed, and if, how do I recognize > free descriptors (that is descriptors that are neither available > nor used. I'll think about this. > I'm pretty much confused on how this scheme with the available > and used wrap counters (or device and driver wrap counters is > supposed to work). A working implementation in C would really help > me to understand this. DPDK based implementation has been posted. > > + process_buffer(d); > > + vq->next_used++; > > + if (vq->next_used >= vq->size) { > > + vq->next_used = 0; > > + } > > +} > > +\end{lstlisting} > > --------------------------------------------------------------------- To unsubscribe from this mail list, you must leave the OASIS TC that generates this mail. Follow this link to all your TCs in OASIS at: https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php