All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-dev] [PATCH v4 0/2] virtio-fs: add virtio file system device
@ 2019-03-05 16:12 Stefan Hajnoczi
  2019-03-05 16:12 ` [virtio-dev] [PATCH v4 1/2] content: " Stefan Hajnoczi
  2019-03-05 16:12 ` [virtio-dev] [PATCH v4 2/2] virtio-fs: add DAX window Stefan Hajnoczi
  0 siblings, 2 replies; 3+ messages in thread
From: Stefan Hajnoczi @ 2019-03-05 16:12 UTC (permalink / raw)
  To: virtio-dev
  Cc: Vivek Goyal, Miklos Szeredi, Dr. David Alan Gilbert,
	Steven Whitehouse, Sage Weil, Paolo Bonzini, Stefan Hajnoczi

v4:
 * Clarify that there are no request ordering guarantees between requests in a
   single queue [vgoyal]
 * Add explanation of FUSE session endianness detection [dgilbert]

v3:
 * Remove notifications virtqueue, it's unimplemented and can be added when
   needed [Miklos]
 * Add Security Considerations and Live Migration Considerations sections
   [Michael]
v2:
 * Clean up core virtio file system device spec
 * Add DAX window

These patches add the virtio file system device, which is based on Linux FUSE
but includes the DAX window extension.  Similar to virtio-scsi, which
transports SCSI commands, virtio-fs transports FUSE requests and the protocol
documentation is not duplicated here.

The DAX window allows file contents to be accessed directly from shared memory.
This eliminates copying of data, reduces the number of vmexits, and reduces the
guest's memory footprint.  It also allows coherent mmap MAP_SHARED semantics
between guests on the same host.

Stefan Hajnoczi (2):
  content: add virtio file system device
  virtio-fs: add DAX window

 content.tex      |   3 +
 introduction.tex |   3 +
 virtio-fs.tex    | 226 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 232 insertions(+)
 create mode 100644 virtio-fs.tex

-- 
2.20.1


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [virtio-dev] [PATCH v4 1/2] content: add virtio file system device
  2019-03-05 16:12 [virtio-dev] [PATCH v4 0/2] virtio-fs: add virtio file system device Stefan Hajnoczi
@ 2019-03-05 16:12 ` Stefan Hajnoczi
  2019-03-05 16:12 ` [virtio-dev] [PATCH v4 2/2] virtio-fs: add DAX window Stefan Hajnoczi
  1 sibling, 0 replies; 3+ messages in thread
From: Stefan Hajnoczi @ 2019-03-05 16:12 UTC (permalink / raw)
  To: virtio-dev
  Cc: Vivek Goyal, Miklos Szeredi, Dr. David Alan Gilbert,
	Steven Whitehouse, Sage Weil, Paolo Bonzini, Stefan Hajnoczi

The virtio file system device transports Linux FUSE requests between a
FUSE daemon running on the host and the FUSE driver inside the guest.

The actual FUSE request definitions are not duplicated in the virtio
specification, similar to how virtio-scsi does not document SCSI
command details.  FUSE request definitions are available here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fuse.h

This patch documents the core virtio file system device, which is
functional but lacks the DAX feature introduced in the next patch.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 content.tex      |   3 +
 introduction.tex |   3 +
 virtio-fs.tex    | 201 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 207 insertions(+)
 create mode 100644 virtio-fs.tex

diff --git a/content.tex b/content.tex
index ef10cc0..ea6f5c9 100644
--- a/content.tex
+++ b/content.tex
@@ -2634,6 +2634,8 @@ Device ID  &  Virtio Device    \\
 \hline
 24         &   Memory device \\
 \hline
+26         &   file system device \\
+\hline
 \end{tabular}
 
 Some of the devices above are unspecified by this document,
@@ -5559,6 +5561,7 @@ descriptor for the \field{sense_len}, \field{residual},
 \input{virtio-input.tex}
 \input{virtio-crypto.tex}
 \input{virtio-vsock.tex}
+\input{virtio-fs.tex}
 
 \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
 
diff --git a/introduction.tex b/introduction.tex
index c5ccd89..b5b8159 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -60,6 +60,9 @@ Levels'', BCP 14, RFC 2119, March 1997. \newline\url{http://www.ietf.org/rfc/rfc
 	\phantomsection\label{intro:SCSI MMC}\textbf{[SCSI MMC]} &
         SCSI Multimedia Commands,
         \newline\url{http://www.t10.org/cgi-bin/ac.pl?t=f&f=mmc6r00.pdf}\\
+	\phantomsection\label{intro:FUSE}\textbf{[FUSE]} &
+	Linux FUSE interface,
+	\newline\url{https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fuse.h}\\
 
 \end{longtable}
 
diff --git a/virtio-fs.tex b/virtio-fs.tex
new file mode 100644
index 0000000..984f14f
--- /dev/null
+++ b/virtio-fs.tex
@@ -0,0 +1,201 @@
+\section{File System Device}\label{sec:Device Types / File System Device}
+
+The virtio file system device provides file system access.  The device may
+directly manage a file system or act as a gateway to a remote file system.  The
+details of how files are accessed are hidden by the device interface, allowing
+for a range of use cases.
+
+Unlike block-level storage devices such as virtio block and SCSI, the virtio
+file system device provides file-level access to data.  The device interface is
+based on the Linux Filesystem in Userspace (FUSE) protocol.  This consists of
+requests for file system traversal and access the files and directories within
+it.  The protocol details are defined by \hyperref[intro:FUSE]{FUSE}.
+
+The device acts as the FUSE file system daemon and the driver acts as the FUSE
+client mounting the file system.  The virtio file system device provides the
+mechanism for transporting FUSE requests, much like /dev/fuse in a traditional
+FUSE application.
+
+This section relies on definitions from \hyperref[intro:FUSE]{FUSE}.
+
+\subsection{Device ID}\label{sec:Device Types / File System Device / Device ID}
+  26
+
+\subsection{Virtqueues}\label{sec:Device Types / File System Device / Virtqueues}
+
+\begin{description}
+\item[0] hiprio
+\item[1\ldots n] request queues
+\end{description}
+
+\subsection{Feature bits}\label{sec:Device Types / File System Device / Feature bits}
+
+There are currently no feature bits defined.
+
+\subsection{Device configuration layout}\label{sec:Device Types / File System Device / Device configuration layout}
+
+All fields of this configuration are always available.
+
+\begin{lstlisting}
+struct virtio_fs_config {
+        char tag[36];
+        le32 num_queues;
+};
+\end{lstlisting}
+
+\begin{description}
+\item[\field{tag}] is the name associated with this file system.  The tag is
+    encoded in UTF-8 and padded with NUL bytes if shorter than the
+    available space.  This field is not NUL-terminated if the encoded bytes
+    take up the entire field.
+\item[\field{num_queues}] is the total number of request virtqueues exposed by
+    the device. The driver MAY use only one request queue,
+    or it can use more to achieve better performance.
+\end{description}
+
+\drivernormative{\subsubsection}{Device configuration layout}{Device Types / File System Device / Device configuration layout}
+
+The driver MUST NOT write to device configuration fields.
+
+\devicenormative{\subsubsection}{Device configuration layout}{Device Types / File System Device / Device configuration layout}
+
+The device MUST set \field{num_queues} to 1 or greater.
+
+\devicenormative{\subsection}{Device Initialization}{Device Types / File System Device / Device Initialization}
+
+On initialization the driver MUST first discover the
+device's virtqueues.
+
+\subsection{Device Operation}\label{sec:Device Types / File System Device / Device Operation}
+
+Device operation consists of operating the virtqueues to facilitate file system
+access.
+
+The FUSE request types are as follows:
+\begin{itemize}
+\item Normal requests are submitted by the driver and completed by the device.
+\item Interrupt requests are submitted by the driver to abort requests that the
+      device may have yet to complete.
+\end{itemize}
+
+Note that FUSE notification requests are not supported.
+
+\subsubsection{Device Operation: Request Queues}\label{sec:Device Types / File System Device / Device Operation / Device Operation: Request Queues}
+
+The driver enqueues normal requests on an arbitrary request queue and they are
+completed by the device on that same queue. The device processes requests in
+any order.  The driver is responsible for ensuring that ordering constraints
+are met by submitting a dependent request only after its prerequisite request
+has completed.
+
+Requests have the following format:
+
+\begin{lstlisting}
+struct virtio_fs_req {
+        // Device-readable part
+        struct fuse_in_header in;
+        u8 datain[];
+
+        // Device-writable part
+        struct fuse_out_header out;
+        u8 dataout[];
+};
+\end{lstlisting}
+
+Note that the words "in" and "out" follow the FUSE meaning and do not indicate
+the direction of data transfer under VIRTIO.  "In" means input to a request and
+"out" means output from processing a request.
+
+\field{in} is the common header for all types of FUSE requests.
+
+\field{datain} consists of request-specific data, if any.  This is identical to
+the data read from the /dev/fuse device by a FUSE daemon.
+
+\field{out} is the completion header common to all types of FUSE requests.
+
+\field{dataout} consists of request-specific data, if any.  This is identical
+to the data written to the /dev/fuse device by a FUSE daemon.
+
+For example, the full layout of a FUSE_READ request is as follows:
+
+\begin{lstlisting}
+struct virtio_fs_read_req {
+        // Device-readable part
+        struct fuse_in_header in;
+        union {
+                struct fuse_read_in readin;
+                u8 datain[sizeof(struct fuse_read_in)];
+        };
+
+        // Device-writable part
+        struct fuse_out_header out;
+        u8 dataout[out.len - sizeof(struct fuse_out_header)];
+};
+\end{lstlisting}
+
+The FUSE protocol documented in \hyperref[intro:FUSE]{FUSE} specifies the set
+of request types and their contents.  The endianness of the FUSE protocol
+session is detectable by inspecting the uint32_t \field{in.opcode} field of the
+first request sent by the driver to the device.  The first message of any
+session is FUSE_INIT and this allows the device to determine whether the
+session is little-endian or big-endian.
+
+\subsubsection{Device Operation: High Priority Queue}\label{sec:Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
+
+The hiprio queue follows the same request format as the requests queue.  This
+queue only contains FUSE_INTERRUPT, FUSE_FORGET, and FUSE_BATCH_FORGET
+requests.
+
+Interrupt and forget requests have a higher priority than normal requests.  In
+order to ensure that they can always be delivered, even if all request queues
+are full, a separate queue is used.
+
+\devicenormative{\paragraph}{Device Operation: High Priority Queue}{Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
+
+The device SHOULD attempt to process the hiprio queue promptly.
+
+The device MAY process request queues concurrently with the hiprio queue.
+
+\drivernormative{\paragraph}{Device Operation: High Priority Queue}{Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
+
+The driver MUST submit FUSE_INTERRUPT, FUSE_FORGET, and FUSE_BATCH_FORGET requests solely on the hiprio queue.
+
+The driver MUST anticipate that request queues are processed concurrently with the hiprio queue.
+
+\subsubsection{Security Considerations}\label{sec:Device Types / File System Device / Security Considerations}
+
+The device provides access to a file system that may contain files owned by
+different POSIX user ids and group ids.  The device has no secure way of
+differentiating between users originating requests via the driver.  Therefore
+the device accepts the POSIX user ids and group ids provided by the driver and
+security is enforced by the driver rather than the device.  It is nevertheless
+possible for devices to implement POSIX user id and group id mapping or
+whitelisting to control the ownership and access available to the driver.
+
+The file system may contain special files including device nodes and setuid
+executable files.  These properties are defined by the file type and mode,
+which may be set by the driver when creating new files or changed at a later
+time.  These special files present a security risk when the file system is
+shared with another system, such as the host or another guest.  This issue can
+be solved on some operating systems using mount options that ignore special
+files.  It is also possible for devices to implement restrictions on special
+files by refusing their creation.
+
+When the device provides shared access to a file system the possibility of
+symlink race conditions, exhausting file system capacity, and overwriting or
+deleting files used by others must be taken into account.  These issues have a
+long history in multi-user operating systems and should not be overlooked with
+virtio devices.
+
+\subsubsection{Live migration considerations}\label{sec:Device Types / File System Device / Live Migration Considerations}
+
+When a guest is migrated to a new host it is necessary to consider the FUSE
+session and its state.  The continuity of FUSE inode numbers (also known as
+nodeids) and fh values is necessary so the driver can continue operation
+without disruption.  Therefore it is trivial to migrate before a FUSE session
+has been started with FUSE_INIT.
+
+It is possible to maintain the FUSE session across live migration either by
+transferring the state or by redirecting requests from the new host to the old
+host where the state resides.  The details of how to achieve this are
+implementation-dependent and are not visible at the device interface level.
-- 
2.20.1


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [virtio-dev] [PATCH v4 2/2] virtio-fs: add DAX window
  2019-03-05 16:12 [virtio-dev] [PATCH v4 0/2] virtio-fs: add virtio file system device Stefan Hajnoczi
  2019-03-05 16:12 ` [virtio-dev] [PATCH v4 1/2] content: " Stefan Hajnoczi
@ 2019-03-05 16:12 ` Stefan Hajnoczi
  1 sibling, 0 replies; 3+ messages in thread
From: Stefan Hajnoczi @ 2019-03-05 16:12 UTC (permalink / raw)
  To: virtio-dev
  Cc: Vivek Goyal, Miklos Szeredi, Dr. David Alan Gilbert,
	Steven Whitehouse, Sage Weil, Paolo Bonzini, Stefan Hajnoczi

Describe how shared memory region ID 0 is the DAX window and how
FUSE_SETUPMAPPING maps file ranges into the window.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
Note that this depends on the shared memory resource specification
extension that David Gilbert is working on.
https://lists.oasis-open.org/archives/virtio-comment/201901/msg00000.html

The FUSE_SETUPMAPPING message is part of the virtio-fs Linux patches:
https://gitlab.com/virtio-fs/linux/blob/virtio-fs/include/uapi/linux/fuse.h
---
 virtio-fs.tex | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/virtio-fs.tex b/virtio-fs.tex
index 984f14f..010d352 100644
--- a/virtio-fs.tex
+++ b/virtio-fs.tex
@@ -162,6 +162,31 @@ The driver MUST submit FUSE_INTERRUPT, FUSE_FORGET, and FUSE_BATCH_FORGET reques
 
 The driver MUST anticipate that request queues are processed concurrently with the hiprio queue.
 
+\subsubsection{Device Operation: DAX Window}\label{sec:Device Types / File System Device / Device Operation / Device Operation: DAX Window}
+
+FUSE\_READ and FUSE\_WRITE requests transfer file contents between the
+driver-provided buffer and the device.  In cases where data transfer is
+undesirable, the device can map file contents into the DAX window shared memory
+region.  The driver then accesses file contents directly in device-owned memory
+without a data transfer.
+
+Shared memory region ID 0 is called the DAX window.  The driver maps a file
+range into the DAX window using the FUSE\_SETUPMAPPING request.  The mapping is
+removed using the FUSE\_REMOVEMAPPING request.
+
+After FUSE\_SETUPMAPPING has completed successfully the file range is accessible
+from the DAX window at the offset provided by the driver in the request.
+
+\devicenormative{\paragraph}{Device Operation: DAX Window}{Device Types / File System Device / Device Operation / Device Operation: DAX Window}
+
+The device MUST allow mappings that completely or partially overlap existing mappings within the DAX window.
+
+The device MUST reject mappings that would go beyond the end of the DAX window.
+
+\drivernormative{\paragraph}{Device Operation: DAX Window}{Device Types / File System Device / Device Operation / Device Operation: DAX Window}
+
+The driver SHOULD be prepared to find shared memory region ID 0 absent and fall back to FUSE\_READ and FUSE\_WRITE requests.
+
 \subsubsection{Security Considerations}\label{sec:Device Types / File System Device / Security Considerations}
 
 The device provides access to a file system that may contain files owned by
-- 
2.20.1


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-03-05 16:12 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-05 16:12 [virtio-dev] [PATCH v4 0/2] virtio-fs: add virtio file system device Stefan Hajnoczi
2019-03-05 16:12 ` [virtio-dev] [PATCH v4 1/2] content: " Stefan Hajnoczi
2019-03-05 16:12 ` [virtio-dev] [PATCH v4 2/2] virtio-fs: add DAX window Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.