All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-dev] [PATCH v5 0/2] virtio-fs: add virtio file system device
@ 2019-07-26  9:53 Stefan Hajnoczi
  2019-07-26  9:53 ` [virtio-dev] [PATCH v5 1/2] content: " Stefan Hajnoczi
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2019-07-26  9:53 UTC (permalink / raw)
  To: virtio-dev
  Cc: Sage Weil, Steven Whitehouse, Miklos Szeredi, Michael S. Tsirkin,
	Vivek Goyal, Dr. David Alan Gilbert, Stefan Hajnoczi

v5:
 * Explain multiqueue semantics: no ordering, identical functionality on each queue, one FUSE session state shared between all queues
 * Explain how the FUSE session is started with a FUSE_INIT request
 * Consistently use "submit" vs "made available" and "complete" vs "used" terminology [Michael]
 * Explain endianness [Michael]
 * Clarify hiprio vs normal queue usage [Michael]
 * Move SHOULD, MUST, etc wording into normative sections [Michael]
 * Mention that FUSE_INIT negotiated state needs to be transferred during live migration [Michael]
 * State that the DAX window is mapped with writeback caching like RAM [Michael]
 * Mention DAX window mapping alignment constraints (they are communicated via the FUSE protocol) [Michael]
 * Explain that FUSE_SETUPMAPPING fails when device resources are exhausted and that splitting mappings consumes resources too [Michael]
 * Clarify access rules to DAX window - only touch memory that has a mapping establised
 * Document that DAX data persistence is achieved via FUSE_FSYNC

v4:
 * Clarify that there are no request ordering guarantees between requests in a
   single queue [vgoyal]
 * Add explanation of FUSE session endianness detection [dgilbert]

v3:
 * Remove notifications virtqueue, it's unimplemented and can be added when
   needed [Miklos]
 * Add Security Considerations and Live Migration Considerations sections
   [Michael]
v2:
 * Clean up core virtio file system device spec
 * Add DAX window

These patches add the virtio file system device, which is based on Linux FUSE
but includes the DAX window extension.  Similar to virtio-scsi, which
transports SCSI commands, virtio-fs transports FUSE requests and the protocol
documentation is not duplicated here.

The DAX window allows file contents to be accessed directly from shared memory.
This eliminates copying of data, reduces the number of vmexits, and reduces the
guest's memory footprint.  It also allows coherent mmap MAP_SHARED semantics
between guests on the same host.

Stefan Hajnoczi (2):
  content: add virtio file system device
  virtio-fs: add DAX window

 content.tex      |   1 +
 introduction.tex |   3 +
 virtio-fs.tex    | 259 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 263 insertions(+)
 create mode 100644 virtio-fs.tex

-- 
2.21.0


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-dev] [PATCH v5 1/2] content: add virtio file system device
  2019-07-26  9:53 [virtio-dev] [PATCH v5 0/2] virtio-fs: add virtio file system device Stefan Hajnoczi
@ 2019-07-26  9:53 ` Stefan Hajnoczi
  2019-07-31  9:58   ` Cornelia Huck
                     ` (2 more replies)
  2019-07-26  9:53 ` [virtio-dev] [PATCH v5 2/2] virtio-fs: add DAX window Stefan Hajnoczi
  2019-07-26 12:27 ` [virtio-dev] Re: [PATCH v5 0/2] virtio-fs: add virtio file system device Michael S. Tsirkin
  2 siblings, 3 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2019-07-26  9:53 UTC (permalink / raw)
  To: virtio-dev
  Cc: Sage Weil, Steven Whitehouse, Miklos Szeredi, Michael S. Tsirkin,
	Vivek Goyal, Dr. David Alan Gilbert, Stefan Hajnoczi

The virtio file system device transports Linux FUSE requests between a
FUSE daemon running on the host and the FUSE driver inside the guest.

The actual FUSE request definitions are not duplicated in the virtio
specification, similar to how virtio-scsi does not document SCSI
command details.  FUSE request definitions are available here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fuse.h

This patch documents the core virtio file system device, which is
functional but lacks the DAX feature introduced in the next patch.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 content.tex      |   1 +
 introduction.tex |   3 +
 virtio-fs.tex    | 214 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 218 insertions(+)
 create mode 100644 virtio-fs.tex

diff --git a/content.tex b/content.tex
index ee0d7c9..c14e953 100644
--- a/content.tex
+++ b/content.tex
@@ -5677,6 +5677,7 @@ \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device
 \input{virtio-input.tex}
 \input{virtio-crypto.tex}
 \input{virtio-vsock.tex}
+\input{virtio-fs.tex}
 
 \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
 
diff --git a/introduction.tex b/introduction.tex
index c96acf9..40f16f8 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -60,6 +60,9 @@ \section{Normative References}\label{sec:Normative References}
 	\phantomsection\label{intro:SCSI MMC}\textbf{[SCSI MMC]} &
         SCSI Multimedia Commands,
         \newline\url{http://www.t10.org/cgi-bin/ac.pl?t=f&f=mmc6r00.pdf}\\
+	\phantomsection\label{intro:FUSE}\textbf{[FUSE]} &
+	Linux FUSE interface,
+	\newline\url{https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fuse.h}\\
 
 \end{longtable}
 
diff --git a/virtio-fs.tex b/virtio-fs.tex
new file mode 100644
index 0000000..a9f47ce
--- /dev/null
+++ b/virtio-fs.tex
@@ -0,0 +1,214 @@
+\section{File System Device}\label{sec:Device Types / File System Device}
+
+The virtio file system device provides file system access.  The device either
+directly manages a file system or it acts as a gateway to a remote file system.
+The details of how the device implementation accesses files are hidden by the
+device interface, allowing for a range of use cases.
+
+Unlike block-level storage devices such as virtio block and SCSI, the virtio
+file system device provides file-level access to data.  The device interface is
+based on the Linux Filesystem in Userspace (FUSE) protocol.  This consists of
+requests for file system traversal and access the files and directories within
+it.  The protocol details are defined by \hyperref[intro:FUSE]{FUSE}.
+
+The device acts as the FUSE file system daemon and the driver acts as the FUSE
+client mounting the file system.  The virtio file system device provides the
+mechanism for transporting FUSE requests, much like /dev/fuse in a traditional
+FUSE application.
+
+This section relies on definitions from \hyperref[intro:FUSE]{FUSE}.
+
+\subsection{Device ID}\label{sec:Device Types / File System Device / Device ID}
+  26
+
+\subsection{Virtqueues}\label{sec:Device Types / File System Device / Virtqueues}
+
+\begin{description}
+\item[0] hiprio
+\item[1\ldots n] request queues
+\end{description}
+
+\subsection{Feature bits}\label{sec:Device Types / File System Device / Feature bits}
+
+There are currently no feature bits defined.
+
+\subsection{Device configuration layout}\label{sec:Device Types / File System Device / Device configuration layout}
+
+All fields of this configuration are always available.
+
+\begin{lstlisting}
+struct virtio_fs_config {
+        char tag[36];
+        le32 num_queues;
+};
+\end{lstlisting}
+
+\begin{description}
+\item[\field{tag}] is the name associated with this file system.  The tag is
+    encoded in UTF-8 and padded with NUL bytes if shorter than the
+    available space.  This field is not NUL-terminated if the encoded bytes
+    take up the entire field.
+\item[\field{num_queues}] is the total number of request virtqueues exposed by
+    the device.  Each virtqueue offers identical functionality and there are no
+    ordering guarantees between requests made available on different queues.
+    Use of multiple queues is intended to increase performance.
+\end{description}
+
+\drivernormative{\subsubsection}{Device configuration layout}{Device Types / File System Device / Device configuration layout}
+
+The driver MUST NOT write to device configuration fields.
+
+The driver MAY use from one up to \field{num_queues} virtqueues.
+
+\devicenormative{\subsubsection}{Device configuration layout}{Device Types / File System Device / Device configuration layout}
+
+The device MUST set \field{num_queues} to 1 or greater.
+
+\subsection{Device Initialization}\label{Device Types / File System Device / Device Initialization}
+
+On initialization the driver first discovers the device's virtqueues.  The FUSE
+session is started by sending a FUSE\_INIT request as defined by the FUSE
+protocol on one request virtqueue.  All virtqueues provide access to the same
+FUSE session and therefore only one FUSE\_INIT request is required regardless
+of the number of available virtqueues.
+
+\subsection{Device Operation}\label{sec:Device Types / File System Device / Device Operation}
+
+Device operation consists of operating the virtqueues to facilitate file system
+access.
+
+The FUSE request types are as follows:
+\begin{itemize}
+\item Normal requests are made available by the driver and used by the device.
+\item Interrupt requests are made available by the driver to abort requests
+      that the device has not used yet.
+\end{itemize}
+
+Note that FUSE notification requests are not supported.
+
+\subsubsection{Device Operation: Request Queues}\label{sec:Device Types / File System Device / Device Operation / Device Operation: Request Queues}
+
+The driver enqueues normal requests on an arbitrary request queue and they are
+completed by the device on that same queue. The device processes requests in
+any order.  The driver is responsible for ensuring that ordering constraints
+are met by making available a dependent request only after its prerequisite
+request has been used.
+
+Requests have the following format:
+
+\begin{lstlisting}
+struct virtio_fs_req {
+        // Device-readable part
+        struct fuse_in_header in;
+        u8 datain[];
+
+        // Device-writable part
+        struct fuse_out_header out;
+        u8 dataout[];
+};
+\end{lstlisting}
+
+Note that the words "in" and "out" follow the FUSE meaning and do not indicate
+the direction of data transfer under VIRTIO.  "In" means input to a request and
+"out" means output from processing a request.
+
+\field{in} is the common header for all types of FUSE requests.
+
+\field{datain} consists of request-specific data, if any.  This is identical to
+the data read from the /dev/fuse device by a FUSE daemon.
+
+\field{out} is the completion header common to all types of FUSE requests.
+
+\field{dataout} consists of request-specific data, if any.  This is identical
+to the data written to the /dev/fuse device by a FUSE daemon.
+
+For example, the full layout of a FUSE\_READ request is as follows:
+
+\begin{lstlisting}
+struct virtio_fs_read_req {
+        // Device-readable part
+        struct fuse_in_header in;
+        union {
+                struct fuse_read_in readin;
+                u8 datain[sizeof(struct fuse_read_in)];
+        };
+
+        // Device-writable part
+        struct fuse_out_header out;
+        u8 dataout[out.len - sizeof(struct fuse_out_header)];
+};
+\end{lstlisting}
+
+The FUSE protocol documented in \hyperref[intro:FUSE]{FUSE} specifies the set
+of request types and their contents.
+
+The endianness of the FUSE protocol session is detectable by inspecting the
+uint32\_t \field{in.opcode} field of the FUSE\_INIT request sent by the driver
+to the device.  This allows the device to determine whether the session is
+little-endian or big-endian.
+
+\subsubsection{Device Operation: High Priority Queue}\label{sec:Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
+
+The hiprio queue follows the same request format as the request queues.  This
+queue only contains FUSE\_INTERRUPT, FUSE\_FORGET, and FUSE\_BATCH\_FORGET
+requests.
+
+Interrupt and forget requests have a higher priority than normal requests.  The
+separate hiprio queue is used for these requests to ensure they can be
+delivered even when all request queues are full.
+
+\devicenormative{\paragraph}{Device Operation: High Priority Queue}{Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
+
+The device MUST NOT pause processing of the hiprio queue due to activity on a
+normal request queue.
+
+The device MAY process request queues concurrently with the hiprio queue.
+
+\drivernormative{\paragraph}{Device Operation: High Priority Queue}{Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
+
+The driver MUST submit FUSE\_INTERRUPT, FUSE\_FORGET, and FUSE\_BATCH\_FORGET requests solely on the hiprio queue.
+
+The driver MUST anticipate that request queues are processed concurrently with the hiprio queue.
+
+\subsubsection{Security Considerations}\label{sec:Device Types / File System Device / Security Considerations}
+
+The device provides access to a file system containing files owned by one or
+more POSIX user ids and group ids.  The device has no secure way of
+differentiating between users originating requests via the driver.  Therefore
+the device accepts the POSIX user ids and group ids provided by the driver and
+security is enforced by the driver rather than the device.  It is nevertheless
+possible for devices to implement POSIX user id and group id mapping or
+whitelisting to control the ownership and access available to the driver.
+
+File systems containing special files including device nodes and setuid
+executable files pose a security concern.  These properties are defined by the
+file type and mode, which are set by the driver when creating new files or by
+changes at a later time.  These special files present a security risk when the
+file system is shared with another system, such as the host or another guest.
+This issue can be solved on some operating systems using mount options that
+ignore special files.  It is also possible for devices to implement
+restrictions on special files by refusing their creation.
+
+When the device provides shared access to a file system, symlink race
+conditions, exhausting file system capacity, and overwriting or deleting files
+used by others are factors to consider.  These issues have a long history in
+multi-user operating systems and also apply to virtio-fs.  They are typically
+managed at the file system administration level by providing shared access only
+to mutually trusted users.
+
+\subsubsection{Live migration considerations}\label{sec:Device Types / File System Device / Live Migration Considerations}
+
+When a guest is migrated to a new host it is necessary to consider the FUSE
+session and its state.  The continuity of FUSE inode numbers (also known as
+nodeids) and fh values is necessary so the driver can continue operation
+without disruption.
+
+It is possible to maintain the FUSE session across live migration either by
+transferring the state or by redirecting requests from the new host to the old
+host where the state resides.  The details of how to achieve this are
+implementation-dependent and are not visible at the device interface level.
+
+Maintaining version and feature information negotiated by FUSE\_INIT is
+necessary so that no FUSE protocol feature changes are visible to the driver
+across live migration.  The FUSE\_INIT information forms part of the FUSE
+session state that needs to be transferred during live migration.
-- 
2.21.0


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [virtio-dev] [PATCH v5 2/2] virtio-fs: add DAX window
  2019-07-26  9:53 [virtio-dev] [PATCH v5 0/2] virtio-fs: add virtio file system device Stefan Hajnoczi
  2019-07-26  9:53 ` [virtio-dev] [PATCH v5 1/2] content: " Stefan Hajnoczi
@ 2019-07-26  9:53 ` Stefan Hajnoczi
  2019-07-26 11:41   ` [virtio-dev] " Michael S. Tsirkin
  2019-07-26 12:27 ` [virtio-dev] Re: [PATCH v5 0/2] virtio-fs: add virtio file system device Michael S. Tsirkin
  2 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2019-07-26  9:53 UTC (permalink / raw)
  To: virtio-dev
  Cc: Sage Weil, Steven Whitehouse, Miklos Szeredi, Michael S. Tsirkin,
	Vivek Goyal, Dr. David Alan Gilbert, Stefan Hajnoczi

Describe how shared memory region ID 0 is the DAX window and how
FUSE_SETUPMAPPING maps file ranges into the window.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
Note that this depends on the shared memory resource specification
extension that David Gilbert is working on.
https://lists.oasis-open.org/archives/virtio-comment/201901/msg00000.html

The FUSE_SETUPMAPPING message is part of the virtio-fs Linux patches:
https://gitlab.com/virtio-fs/linux/blob/virtio-fs/include/uapi/linux/fuse.h
---
 virtio-fs.tex | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/virtio-fs.tex b/virtio-fs.tex
index a9f47ce..2d01a51 100644
--- a/virtio-fs.tex
+++ b/virtio-fs.tex
@@ -170,6 +170,51 @@ \subsubsection{Device Operation: High Priority Queue}\label{sec:Device Types / F
 
 The driver MUST anticipate that request queues are processed concurrently with the hiprio queue.
 
+\subsubsection{Device Operation: DAX Window}\label{sec:Device Types / File System Device / Device Operation / Device Operation: DAX Window}
+
+FUSE\_READ and FUSE\_WRITE requests transfer file contents between the
+driver-provided buffer and the device.  In cases where data transfer is
+undesirable, the device can map file contents into the DAX window shared memory
+region.  The driver then accesses file contents directly in device-owned memory
+without a data transfer.
+
+Shared memory region ID 0 is called the DAX window.  Drivers map this shared
+memory region with writeback caching as if it were regular RAM.  The contents
+of the DAX window are undefined unless a mapping exists for that range.
+
+The driver maps a file range into the DAX window using the FUSE\_SETUPMAPPING
+request.  Alignment constraints for FUSE\_SETUPMAPPING and FUSE\_REMOVEMAPPING
+requests are communicated during FUSE\_INIT negotiation.
+
+When a FUSE\_SETUPMAPPING request perfectly overlaps a previous mapping, the
+previous mapping is replaced.  When a mapping partially overlaps a previous
+mapping, the previous mapping is split into one or two smaller mappings.  When
+a mapping is partially unmapped it is also split into one or two smaller
+mappings.
+
+Establishing new mappings or splitting existing mappings consumes resources.
+If the device runs out of resources the FUSE\_SETUPMAPPING request fails until
+resources are available again following FUSE\_REMOVEMAPPING.
+
+After FUSE\_SETUPMAPPING has completed successfully the file range is
+accessible from the DAX window at the offset provided by the driver in the
+request.  A mapping is removed using the FUSE\_REMOVEMAPPING request.
+
+Data is only guaranteed to be persistent when a FUSE\_FSYNC request is used by
+the device after having been made available by the driver following the write.
+
+\devicenormative{\paragraph}{Device Operation: DAX Window}{Device Types / File System Device / Device Operation / Device Operation: DAX Window}
+
+The device MUST allow mappings that completely or partially overlap existing mappings within the DAX window.
+
+The device MUST reject mappings that would go beyond the end of the DAX window.
+
+\drivernormative{\paragraph}{Device Operation: DAX Window}{Device Types / File System Device / Device Operation / Device Operation: DAX Window}
+
+The driver SHOULD be prepared to find shared memory region ID 0 absent and fall back to FUSE\_READ and FUSE\_WRITE requests.
+
+The driver MUST NOT access DAX window areas that have not been mapped.
+
 \subsubsection{Security Considerations}\label{sec:Device Types / File System Device / Security Considerations}
 
 The device provides access to a file system containing files owned by one or
-- 
2.21.0


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [virtio-dev] Re: [PATCH v5 2/2] virtio-fs: add DAX window
  2019-07-26  9:53 ` [virtio-dev] [PATCH v5 2/2] virtio-fs: add DAX window Stefan Hajnoczi
@ 2019-07-26 11:41   ` Michael S. Tsirkin
  2019-08-12 13:55     ` Stefan Hajnoczi
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2019-07-26 11:41 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev, Sage Weil, Steven Whitehouse, Miklos Szeredi,
	Vivek Goyal, Dr. David Alan Gilbert

On Fri, Jul 26, 2019 at 10:53:38AM +0100, Stefan Hajnoczi wrote:
> Describe how shared memory region ID 0 is the DAX window and how
> FUSE_SETUPMAPPING maps file ranges into the window.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---

Are there any security considerations with this memory sharing?
If yes it might be a good idea to add a chapter about that,
and functionality/performance/security tradeoffs involved.

> Note that this depends on the shared memory resource specification
> extension that David Gilbert is working on.
> https://lists.oasis-open.org/archives/virtio-comment/201901/msg00000.html
> 
> The FUSE_SETUPMAPPING message is part of the virtio-fs Linux patches:
> https://gitlab.com/virtio-fs/linux/blob/virtio-fs/include/uapi/linux/fuse.h
> ---
>  virtio-fs.tex | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/virtio-fs.tex b/virtio-fs.tex
> index a9f47ce..2d01a51 100644
> --- a/virtio-fs.tex
> +++ b/virtio-fs.tex
> @@ -170,6 +170,51 @@ \subsubsection{Device Operation: High Priority Queue}\label{sec:Device Types / F
>  
>  The driver MUST anticipate that request queues are processed concurrently with the hiprio queue.
>  
> +\subsubsection{Device Operation: DAX Window}\label{sec:Device Types / File System Device / Device Operation / Device Operation: DAX Window}
> +
> +FUSE\_READ and FUSE\_WRITE requests transfer file contents between the
> +driver-provided buffer and the device.  In cases where data transfer is
> +undesirable, the device can map file contents into the DAX window shared memory
> +region.  The driver then accesses file contents directly in device-owned memory
> +without a data transfer.
> +
> +Shared memory region ID 0 is called the DAX window.  Drivers map this shared
> +memory region with writeback caching as if it were regular RAM.  The contents
> +of the DAX window are undefined unless a mapping exists for that range.
> +
> +The driver maps a file range into the DAX window using the FUSE\_SETUPMAPPING
> +request.  Alignment constraints for FUSE\_SETUPMAPPING and FUSE\_REMOVEMAPPING
> +requests are communicated during FUSE\_INIT negotiation.
> +
> +When a FUSE\_SETUPMAPPING request perfectly overlaps a previous mapping, the
> +previous mapping is replaced.  When a mapping partially overlaps a previous
> +mapping, the previous mapping is split into one or two smaller mappings.  When
> +a mapping is partially unmapped it is also split into one or two smaller
> +mappings.
> +
> +Establishing new mappings or splitting existing mappings consumes resources.
> +If the device runs out of resources the FUSE\_SETUPMAPPING request fails until
> +resources are available again following FUSE\_REMOVEMAPPING.
> +
> +After FUSE\_SETUPMAPPING has completed successfully the file range is
> +accessible from the DAX window at the offset provided by the driver in the
> +request.  A mapping is removed using the FUSE\_REMOVEMAPPING request.
> +
> +Data is only guaranteed to be persistent when a FUSE\_FSYNC request is used by
> +the device after having been made available by the driver following the write.
> +
> +\devicenormative{\paragraph}{Device Operation: DAX Window}{Device Types / File System Device / Device Operation / Device Operation: DAX Window}
> +
> +The device MUST allow mappings that completely or partially overlap existing mappings within the DAX window.
> +
> +The device MUST reject mappings that would go beyond the end of the DAX window.
> +
> +\drivernormative{\paragraph}{Device Operation: DAX Window}{Device Types / File System Device / Device Operation / Device Operation: DAX Window}
> +
> +The driver SHOULD be prepared to find shared memory region ID 0 absent and fall back to FUSE\_READ and FUSE\_WRITE requests.
> +
> +The driver MUST NOT access DAX window areas that have not been mapped.
> +
>  \subsubsection{Security Considerations}\label{sec:Device Types / File System Device / Security Considerations}
>  
>  The device provides access to a file system containing files owned by one or
> -- 
> 2.21.0

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-dev] Re: [PATCH v5 0/2] virtio-fs: add virtio file system device
  2019-07-26  9:53 [virtio-dev] [PATCH v5 0/2] virtio-fs: add virtio file system device Stefan Hajnoczi
  2019-07-26  9:53 ` [virtio-dev] [PATCH v5 1/2] content: " Stefan Hajnoczi
  2019-07-26  9:53 ` [virtio-dev] [PATCH v5 2/2] virtio-fs: add DAX window Stefan Hajnoczi
@ 2019-07-26 12:27 ` Michael S. Tsirkin
  2019-07-26 12:34   ` Dr. David Alan Gilbert
  2 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2019-07-26 12:27 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev, Sage Weil, Steven Whitehouse, Miklos Szeredi,
	Vivek Goyal, Dr. David Alan Gilbert

Didn't look at the spec yet, but I have a question:

On Fri, Jul 26, 2019 at 10:53:36AM +0100, Stefan Hajnoczi wrote:
> v5:
>  * Explain multiqueue semantics: no ordering, identical functionality on each queue, one FUSE session state shared between all queues
>  * Explain how the FUSE session is started with a FUSE_INIT request
>  * Consistently use "submit" vs "made available" and "complete" vs "used" terminology [Michael]
>  * Explain endianness [Michael]
>  * Clarify hiprio vs normal queue usage [Michael]
>  * Move SHOULD, MUST, etc wording into normative sections [Michael]
>  * Mention that FUSE_INIT negotiated state needs to be transferred during live migration [Michael]
>  * State that the DAX window is mapped with writeback caching like RAM [Michael]
>  * Mention DAX window mapping alignment constraints (they are communicated via the FUSE protocol) [Michael]
>  * Explain that FUSE_SETUPMAPPING fails when device resources are exhausted and that splitting mappings consumes resources too [Michael]
>  * Clarify access rules to DAX window - only touch memory that has a mapping establised
>  * Document that DAX data persistence is achieved via FUSE_FSYNC


The following was raised during v4 review:

	> > > > There is a practical problem that the QEMU process may hit the mmap
	> > > > limit and be unable to perform its own mmaps due to the DAX Window.  A
	> > > > limit must be enforced on the host so that QEMU's internal mmaps
	> > > > succeed.
	> > >
	> > > But number of mmaps are already limited by dax window size which is
	> > > controlled by virtiofsd. So all user has to do is start with smaller
	> > > dax window size if this ever becomes a concern.
	> >
	> > Yes, the DAX Window size sets a hard limit on the number of mappings
	> > (window_size / page_size).  I think we should still communicate a
	> > maximum number of mappings to provide more control for cases where the
	> > page size and DAX Window size don't produce a desirable number.
	> >
	> > Consider that window_size = 8 GB and page_size = 4 KB already yields
	> > 2,097,152 maximum mappings.  Oops, that number is too large!
	>
	> Ok, so that's something which will be in virtiofsd where maximum number
	> of outstanding can be configured by user and if we cross that limit,
	> FUSE_SETUPMAPPING will be returned with some error?

	Yes, virtiofsd will enforce the limit.

	The guest driver will also be aware of the limit via VIRTIO
	configuration space.

was this addressed?


> v4:
>  * Clarify that there are no request ordering guarantees between requests in a
>    single queue [vgoyal]
>  * Add explanation of FUSE session endianness detection [dgilbert]
> 
> v3:
>  * Remove notifications virtqueue, it's unimplemented and can be added when
>    needed [Miklos]
>  * Add Security Considerations and Live Migration Considerations sections
>    [Michael]
> v2:
>  * Clean up core virtio file system device spec
>  * Add DAX window
> 
> These patches add the virtio file system device, which is based on Linux FUSE
> but includes the DAX window extension.  Similar to virtio-scsi, which
> transports SCSI commands, virtio-fs transports FUSE requests and the protocol
> documentation is not duplicated here.
> 
> The DAX window allows file contents to be accessed directly from shared memory.
> This eliminates copying of data, reduces the number of vmexits, and reduces the
> guest's memory footprint.  It also allows coherent mmap MAP_SHARED semantics
> between guests on the same host.
> 
> Stefan Hajnoczi (2):
>   content: add virtio file system device
>   virtio-fs: add DAX window
> 
>  content.tex      |   1 +
>  introduction.tex |   3 +
>  virtio-fs.tex    | 259 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 263 insertions(+)
>  create mode 100644 virtio-fs.tex
> 
> -- 
> 2.21.0

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-dev] Re: [PATCH v5 0/2] virtio-fs: add virtio file system device
  2019-07-26 12:27 ` [virtio-dev] Re: [PATCH v5 0/2] virtio-fs: add virtio file system device Michael S. Tsirkin
@ 2019-07-26 12:34   ` Dr. David Alan Gilbert
  2019-08-12 13:56     ` Stefan Hajnoczi
  0 siblings, 1 reply; 18+ messages in thread
From: Dr. David Alan Gilbert @ 2019-07-26 12:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, virtio-dev, Sage Weil, Steven Whitehouse,
	Miklos Szeredi, Vivek Goyal

* Michael S. Tsirkin (mst@redhat.com) wrote:
> Didn't look at the spec yet, but I have a question:
> 
> On Fri, Jul 26, 2019 at 10:53:36AM +0100, Stefan Hajnoczi wrote:
> > v5:
> >  * Explain multiqueue semantics: no ordering, identical functionality on each queue, one FUSE session state shared between all queues
> >  * Explain how the FUSE session is started with a FUSE_INIT request
> >  * Consistently use "submit" vs "made available" and "complete" vs "used" terminology [Michael]
> >  * Explain endianness [Michael]
> >  * Clarify hiprio vs normal queue usage [Michael]
> >  * Move SHOULD, MUST, etc wording into normative sections [Michael]
> >  * Mention that FUSE_INIT negotiated state needs to be transferred during live migration [Michael]
> >  * State that the DAX window is mapped with writeback caching like RAM [Michael]
> >  * Mention DAX window mapping alignment constraints (they are communicated via the FUSE protocol) [Michael]
> >  * Explain that FUSE_SETUPMAPPING fails when device resources are exhausted and that splitting mappings consumes resources too [Michael]
> >  * Clarify access rules to DAX window - only touch memory that has a mapping establised
> >  * Document that DAX data persistence is achieved via FUSE_FSYNC
> 
> 
> The following was raised during v4 review:
> 
> 	> > > > There is a practical problem that the QEMU process may hit the mmap
> 	> > > > limit and be unable to perform its own mmaps due to the DAX Window.  A
> 	> > > > limit must be enforced on the host so that QEMU's internal mmaps
> 	> > > > succeed.
> 	> > >
> 	> > > But number of mmaps are already limited by dax window size which is
> 	> > > controlled by virtiofsd. So all user has to do is start with smaller
> 	> > > dax window size if this ever becomes a concern.
> 	> >
> 	> > Yes, the DAX Window size sets a hard limit on the number of mappings
> 	> > (window_size / page_size).  I think we should still communicate a
> 	> > maximum number of mappings to provide more control for cases where the
> 	> > page size and DAX Window size don't produce a desirable number.
> 	> >
> 	> > Consider that window_size = 8 GB and page_size = 4 KB already yields
> 	> > 2,097,152 maximum mappings.  Oops, that number is too large!
> 	>
> 	> Ok, so that's something which will be in virtiofsd where maximum number
> 	> of outstanding can be configured by user and if we cross that limit,
> 	> FUSE_SETUPMAPPING will be returned with some error?
> 
> 	Yes, virtiofsd will enforce the limit.
> 
> 	The guest driver will also be aware of the limit via VIRTIO
> 	configuration space.
> 
> was this addressed?

In this version Stefan has marked SETUPMAPPING to say it can fail if out
of resources.

The tricky problem with specifying a limit is that the limit is more
likely to be global than to be per-device, and so specifying it in
either the virtio config space or (as looked more appropriate) fuse init
doesn't feel right.

IMHO we should never hit the limit in practical use, so it's really just
a matter of saying that it can fail and setting a large enough limit
that non-abusive drivers dont cause a problem.

Dave


> 
> > v4:
> >  * Clarify that there are no request ordering guarantees between requests in a
> >    single queue [vgoyal]
> >  * Add explanation of FUSE session endianness detection [dgilbert]
> > 
> > v3:
> >  * Remove notifications virtqueue, it's unimplemented and can be added when
> >    needed [Miklos]
> >  * Add Security Considerations and Live Migration Considerations sections
> >    [Michael]
> > v2:
> >  * Clean up core virtio file system device spec
> >  * Add DAX window
> > 
> > These patches add the virtio file system device, which is based on Linux FUSE
> > but includes the DAX window extension.  Similar to virtio-scsi, which
> > transports SCSI commands, virtio-fs transports FUSE requests and the protocol
> > documentation is not duplicated here.
> > 
> > The DAX window allows file contents to be accessed directly from shared memory.
> > This eliminates copying of data, reduces the number of vmexits, and reduces the
> > guest's memory footprint.  It also allows coherent mmap MAP_SHARED semantics
> > between guests on the same host.
> > 
> > Stefan Hajnoczi (2):
> >   content: add virtio file system device
> >   virtio-fs: add DAX window
> > 
> >  content.tex      |   1 +
> >  introduction.tex |   3 +
> >  virtio-fs.tex    | 259 +++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 263 insertions(+)
> >  create mode 100644 virtio-fs.tex
> > 
> > -- 
> > 2.21.0
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] [PATCH v5 1/2] content: add virtio file system device
  2019-07-26  9:53 ` [virtio-dev] [PATCH v5 1/2] content: " Stefan Hajnoczi
@ 2019-07-31  9:58   ` Cornelia Huck
  2019-08-02 15:49     ` Stefan Hajnoczi
  2019-08-03 21:12   ` Michael S. Tsirkin
  2019-08-03 21:21   ` Michael S. Tsirkin
  2 siblings, 1 reply; 18+ messages in thread
From: Cornelia Huck @ 2019-07-31  9:58 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev, Sage Weil, Steven Whitehouse, Miklos Szeredi,
	Michael S. Tsirkin, Vivek Goyal, Dr. David Alan Gilbert

On Fri, 26 Jul 2019 10:53:37 +0100
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> The virtio file system device transports Linux FUSE requests between a
> FUSE daemon running on the host and the FUSE driver inside the guest.
> 
> The actual FUSE request definitions are not duplicated in the virtio
> specification, similar to how virtio-scsi does not document SCSI
> command details.  FUSE request definitions are available here:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fuse.h
> 
> This patch documents the core virtio file system device, which is
> functional but lacks the DAX feature introduced in the next patch.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  content.tex      |   1 +
>  introduction.tex |   3 +
>  virtio-fs.tex    | 214 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 218 insertions(+)
>  create mode 100644 virtio-fs.tex

(...)

> +\subsection{Virtqueues}\label{sec:Device Types / File System Device / Virtqueues}
> +
> +\begin{description}
> +\item[0] hiprio
> +\item[1\ldots n] request queues
> +\end{description}
> +
> +\subsection{Feature bits}\label{sec:Device Types / File System Device / Feature bits}
> +
> +There are currently no feature bits defined.
> +
> +\subsection{Device configuration layout}\label{sec:Device Types / File System Device / Device configuration layout}
> +
> +All fields of this configuration are always available.
> +
> +\begin{lstlisting}
> +struct virtio_fs_config {
> +        char tag[36];
> +        le32 num_queues;
> +};
> +\end{lstlisting}
> +
> +\begin{description}
> +\item[\field{tag}] is the name associated with this file system.  The tag is
> +    encoded in UTF-8 and padded with NUL bytes if shorter than the
> +    available space.  This field is not NUL-terminated if the encoded bytes
> +    take up the entire field.
> +\item[\field{num_queues}] is the total number of request virtqueues exposed by
> +    the device.  Each virtqueue offers identical functionality and there are no
> +    ordering guarantees between requests made available on different queues.
> +    Use of multiple queues is intended to increase performance.
> +\end{description}
> +
> +\drivernormative{\subsubsection}{Device configuration layout}{Device Types / File System Device / Device configuration layout}
> +
> +The driver MUST NOT write to device configuration fields.
> +
> +The driver MAY use from one up to \field{num_queues} virtqueues.

s/virtqueues/request virtqueues/ ? I'm still a bit unsure about the
requirements for queue usage, see below.

> +
> +\devicenormative{\subsubsection}{Device configuration layout}{Device Types / File System Device / Device configuration layout}
> +
> +The device MUST set \field{num_queues} to 1 or greater.
> +
> +\subsection{Device Initialization}\label{Device Types / File System Device / Device Initialization}
> +
> +On initialization the driver first discovers the device's virtqueues.  The FUSE
> +session is started by sending a FUSE\_INIT request as defined by the FUSE
> +protocol on one request virtqueue.  All virtqueues provide access to the same
> +FUSE session and therefore only one FUSE\_INIT request is required regardless
> +of the number of available virtqueues.
> +
> +\subsection{Device Operation}\label{sec:Device Types / File System Device / Device Operation}
> +
> +Device operation consists of operating the virtqueues to facilitate file system
> +access.
> +
> +The FUSE request types are as follows:
> +\begin{itemize}
> +\item Normal requests are made available by the driver and used by the device.

These go via the normal request queues...

> +\item Interrupt requests are made available by the driver to abort requests
> +      that the device has not used yet.

...and these via the hiprio queue?

> +\end{itemize}
> +
> +Note that FUSE notification requests are not supported.
> +
> +\subsubsection{Device Operation: Request Queues}\label{sec:Device Types / File System Device / Device Operation / Device Operation: Request Queues}
> +
> +The driver enqueues normal requests on an arbitrary request queue and they are
> +completed by the device on that same queue.

Do we need a device normative statement that requests MUST be completed
on the queue they have been submitted on?

> The device processes requests in
> +any order.  The driver is responsible for ensuring that ordering constraints
> +are met by making available a dependent request only after its prerequisite
> +request has been used.
> +
> +Requests have the following format:

These can be either LE or BE from what is stated below, right? Maybe
spell it out here already?

> +
> +\begin{lstlisting}
> +struct virtio_fs_req {
> +        // Device-readable part
> +        struct fuse_in_header in;
> +        u8 datain[];
> +
> +        // Device-writable part
> +        struct fuse_out_header out;
> +        u8 dataout[];
> +};
> +\end{lstlisting}
> +
> +Note that the words "in" and "out" follow the FUSE meaning and do not indicate
> +the direction of data transfer under VIRTIO.  "In" means input to a request and
> +"out" means output from processing a request.
> +
> +\field{in} is the common header for all types of FUSE requests.
> +
> +\field{datain} consists of request-specific data, if any.  This is identical to
> +the data read from the /dev/fuse device by a FUSE daemon.
> +
> +\field{out} is the completion header common to all types of FUSE requests.
> +
> +\field{dataout} consists of request-specific data, if any.  This is identical
> +to the data written to the /dev/fuse device by a FUSE daemon.
> +
> +For example, the full layout of a FUSE\_READ request is as follows:
> +
> +\begin{lstlisting}
> +struct virtio_fs_read_req {
> +        // Device-readable part
> +        struct fuse_in_header in;
> +        union {
> +                struct fuse_read_in readin;
> +                u8 datain[sizeof(struct fuse_read_in)];
> +        };
> +
> +        // Device-writable part
> +        struct fuse_out_header out;
> +        u8 dataout[out.len - sizeof(struct fuse_out_header)];
> +};
> +\end{lstlisting}
> +
> +The FUSE protocol documented in \hyperref[intro:FUSE]{FUSE} specifies the set
> +of request types and their contents.
> +
> +The endianness of the FUSE protocol session is detectable by inspecting the
> +uint32\_t \field{in.opcode} field of the FUSE\_INIT request sent by the driver
> +to the device.  This allows the device to determine whether the session is
> +little-endian or big-endian.

Do we need a driver normative statement that the driver MUST NOT send a
request with a different endianness once it established the session's
endianness via FUSE_INIT?

> +
> +\subsubsection{Device Operation: High Priority Queue}\label{sec:Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
> +
> +The hiprio queue follows the same request format as the request queues.  This
> +queue only contains FUSE\_INTERRUPT, FUSE\_FORGET, and FUSE\_BATCH\_FORGET
> +requests.

Does this also imply that normal request queues can't be used for
interrupt and forget requests? The normative statement below seems to
imply that, but I think the text here does not make it clear already.

> +
> +Interrupt and forget requests have a higher priority than normal requests.  The
> +separate hiprio queue is used for these requests to ensure they can be
> +delivered even when all request queues are full.
> +
> +\devicenormative{\paragraph}{Device Operation: High Priority Queue}{Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
> +
> +The device MUST NOT pause processing of the hiprio queue due to activity on a
> +normal request queue.
> +
> +The device MAY process request queues concurrently with the hiprio queue.
> +
> +\drivernormative{\paragraph}{Device Operation: High Priority Queue}{Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
> +
> +The driver MUST submit FUSE\_INTERRUPT, FUSE\_FORGET, and FUSE\_BATCH\_FORGET requests solely on the hiprio queue.

This specifies that the driver MUST submit interrupt and forget request
via the hiprio queue, but it does not specify that no other requests
may go via the hiprio queue, what the descriptive text above implies.

> +
> +The driver MUST anticipate that request queues are processed concurrently with the hiprio queue.

(...)

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] [PATCH v5 1/2] content: add virtio file system device
  2019-07-31  9:58   ` Cornelia Huck
@ 2019-08-02 15:49     ` Stefan Hajnoczi
  2019-08-05 15:38       ` Cornelia Huck
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2019-08-02 15:49 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: virtio-dev, Sage Weil, Steven Whitehouse, Miklos Szeredi,
	Michael S. Tsirkin, Vivek Goyal, Dr. David Alan Gilbert

On Wed, Jul 31, 2019 at 11:58:46AM +0200, Cornelia Huck wrote:
> On Fri, 26 Jul 2019 10:53:37 +0100
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > +\drivernormative{\subsubsection}{Device configuration layout}{Device Types / File System Device / Device configuration layout}
> > +
> > +The driver MUST NOT write to device configuration fields.
> > +
> > +The driver MAY use from one up to \field{num_queues} virtqueues.
> 
> s/virtqueues/request virtqueues/ ? I'm still a bit unsure about the
> requirements for queue usage, see below.

Will fix, thanks.

> 
> > +
> > +\devicenormative{\subsubsection}{Device configuration layout}{Device Types / File System Device / Device configuration layout}
> > +
> > +The device MUST set \field{num_queues} to 1 or greater.
> > +
> > +\subsection{Device Initialization}\label{Device Types / File System Device / Device Initialization}
> > +
> > +On initialization the driver first discovers the device's virtqueues.  The FUSE
> > +session is started by sending a FUSE\_INIT request as defined by the FUSE
> > +protocol on one request virtqueue.  All virtqueues provide access to the same
> > +FUSE session and therefore only one FUSE\_INIT request is required regardless
> > +of the number of available virtqueues.
> > +
> > +\subsection{Device Operation}\label{sec:Device Types / File System Device / Device Operation}
> > +
> > +Device operation consists of operating the virtqueues to facilitate file system
> > +access.
> > +
> > +The FUSE request types are as follows:
> > +\begin{itemize}
> > +\item Normal requests are made available by the driver and used by the device.
> 
> These go via the normal request queues...
> 
> > +\item Interrupt requests are made available by the driver to abort requests
> > +      that the device has not used yet.
> 
> ...and these via the hiprio queue?

Yes.

> 
> > +\end{itemize}
> > +
> > +Note that FUSE notification requests are not supported.
> > +
> > +\subsubsection{Device Operation: Request Queues}\label{sec:Device Types / File System Device / Device Operation / Device Operation: Request Queues}
> > +
> > +The driver enqueues normal requests on an arbitrary request queue and they are
> > +completed by the device on that same queue.
> 
> Do we need a device normative statement that requests MUST be completed
> on the queue they have been submitted on?

That would be impossible since the request struct has both IN and OUT
elements.  virtio-fs is a request+response queue design like
virtio-blk/virtio-scsi, not a rx/tx queue design like
virtio-net/virtio-vsock.

> 
> > The device processes requests in
> > +any order.  The driver is responsible for ensuring that ordering constraints
> > +are met by making available a dependent request only after its prerequisite
> > +request has been used.
> > +
> > +Requests have the following format:
> 
> These can be either LE or BE from what is stated below, right? Maybe
> spell it out here already?

I will add "with the endianness chosen by the driver as detailed below".

> 
> > +
> > +\begin{lstlisting}
> > +struct virtio_fs_req {
> > +        // Device-readable part
> > +        struct fuse_in_header in;
> > +        u8 datain[];
> > +
> > +        // Device-writable part
> > +        struct fuse_out_header out;
> > +        u8 dataout[];
> > +};
> > +\end{lstlisting}
> > +
> > +Note that the words "in" and "out" follow the FUSE meaning and do not indicate
> > +the direction of data transfer under VIRTIO.  "In" means input to a request and
> > +"out" means output from processing a request.
> > +
> > +\field{in} is the common header for all types of FUSE requests.
> > +
> > +\field{datain} consists of request-specific data, if any.  This is identical to
> > +the data read from the /dev/fuse device by a FUSE daemon.
> > +
> > +\field{out} is the completion header common to all types of FUSE requests.
> > +
> > +\field{dataout} consists of request-specific data, if any.  This is identical
> > +to the data written to the /dev/fuse device by a FUSE daemon.
> > +
> > +For example, the full layout of a FUSE\_READ request is as follows:
> > +
> > +\begin{lstlisting}
> > +struct virtio_fs_read_req {
> > +        // Device-readable part
> > +        struct fuse_in_header in;
> > +        union {
> > +                struct fuse_read_in readin;
> > +                u8 datain[sizeof(struct fuse_read_in)];
> > +        };
> > +
> > +        // Device-writable part
> > +        struct fuse_out_header out;
> > +        u8 dataout[out.len - sizeof(struct fuse_out_header)];
> > +};
> > +\end{lstlisting}
> > +
> > +The FUSE protocol documented in \hyperref[intro:FUSE]{FUSE} specifies the set
> > +of request types and their contents.
> > +
> > +The endianness of the FUSE protocol session is detectable by inspecting the
> > +uint32\_t \field{in.opcode} field of the FUSE\_INIT request sent by the driver
> > +to the device.  This allows the device to determine whether the session is
> > +little-endian or big-endian.
> 
> Do we need a driver normative statement that the driver MUST NOT send a
> request with a different endianness once it established the session's
> endianness via FUSE_INIT?

It is valid to send another FUSE_INIT to switch to a fresh session.  In
that case the driver could change the endianness again.  I think we want
this just in case a weird system without fixed endianness has a
bootloader running in little-endian and a guest kernel running in
big-endian, for example.

I will document this in the next revision.

> 
> > +
> > +\subsubsection{Device Operation: High Priority Queue}\label{sec:Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
> > +
> > +The hiprio queue follows the same request format as the request queues.  This
> > +queue only contains FUSE\_INTERRUPT, FUSE\_FORGET, and FUSE\_BATCH\_FORGET
> > +requests.
> 
> Does this also imply that normal request queues can't be used for
> interrupt and forget requests? The normative statement below seems to
> imply that, but I think the text here does not make it clear already.

Yes, these requests must only be placed on the hiprio queue.  Will
clarify, thanks!

> > +
> > +Interrupt and forget requests have a higher priority than normal requests.  The
> > +separate hiprio queue is used for these requests to ensure they can be
> > +delivered even when all request queues are full.
> > +
> > +\devicenormative{\paragraph}{Device Operation: High Priority Queue}{Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
> > +
> > +The device MUST NOT pause processing of the hiprio queue due to activity on a
> > +normal request queue.
> > +
> > +The device MAY process request queues concurrently with the hiprio queue.
> > +
> > +\drivernormative{\paragraph}{Device Operation: High Priority Queue}{Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
> > +
> > +The driver MUST submit FUSE\_INTERRUPT, FUSE\_FORGET, and FUSE\_BATCH\_FORGET requests solely on the hiprio queue.
> 
> This specifies that the driver MUST submit interrupt and forget request
> via the hiprio queue, but it does not specify that no other requests
> may go via the hiprio queue, what the descriptive text above implies.

Will clarify.

Thank you for your review!

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] [PATCH v5 1/2] content: add virtio file system device
  2019-07-26  9:53 ` [virtio-dev] [PATCH v5 1/2] content: " Stefan Hajnoczi
  2019-07-31  9:58   ` Cornelia Huck
@ 2019-08-03 21:12   ` Michael S. Tsirkin
  2019-08-13  9:07     ` Stefan Hajnoczi
  2019-08-03 21:21   ` Michael S. Tsirkin
  2 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2019-08-03 21:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev, Sage Weil, Steven Whitehouse, Miklos Szeredi,
	Vivek Goyal, Dr. David Alan Gilbert

On Fri, Jul 26, 2019 at 10:53:37AM +0100, Stefan Hajnoczi wrote:
> +\subsubsection{Live migration considerations}\label{sec:Device Types / File System Device / Live Migration Considerations}
> +
> +When a guest is migrated to a new host

Please use driver and device here and elsewhere.

> it is necessary to consider the FUSE
> +session and its state.  The continuity of FUSE inode numbers (also known as
> +nodeids) and fh values is necessary so the driver can continue operation
> +without disruption.
> +
> +It is possible to maintain the FUSE session across live migration either by
> +transferring the state or by redirecting requests from the new host to the old
> +host where the state resides.  The details of how to achieve this are
> +implementation-dependent and are not visible at the device interface level.
> +
> +Maintaining version and feature information negotiated by FUSE\_INIT is
> +necessary so that no FUSE protocol feature changes are visible to the driver
> +across live migration.  The FUSE\_INIT information forms part of the FUSE
> +session state that needs to be transferred during live migration.

It bothers me that it's implicit, and is not exposed easily at the
virtio level, so one has to bind a virtqueue and run buffers over it to
even just check whether a given device supports a specific interface and
can be migrated to.

How about exposing the version of the device in the config space?
Spec can require that it matches the contents of the INIT command.

....

> -- 
> 2.21.0
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] [PATCH v5 1/2] content: add virtio file system device
  2019-07-26  9:53 ` [virtio-dev] [PATCH v5 1/2] content: " Stefan Hajnoczi
  2019-07-31  9:58   ` Cornelia Huck
  2019-08-03 21:12   ` Michael S. Tsirkin
@ 2019-08-03 21:21   ` Michael S. Tsirkin
  2019-08-13  9:35     ` Stefan Hajnoczi
  2 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2019-08-03 21:21 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev, Sage Weil, Steven Whitehouse, Miklos Szeredi,
	Vivek Goyal, Dr. David Alan Gilbert

On Fri, Jul 26, 2019 at 10:53:37AM +0100, Stefan Hajnoczi wrote:
> +\subsubsection{Security Considerations}\label{sec:Device Types / File System Device / Security Considerations}

One issue not addressed here at all is sharing the filesystem cache.
Is that a security consiteration at all?
A shared resource is often an information sidechannel.


> +
> +The device provides access to a file system containing files owned by one or
> +more POSIX user ids and group ids.  The device has no secure way of
> +differentiating between users originating requests via the driver.  Therefore
> +the device accepts the POSIX user ids and group ids provided by the driver and
> +security is enforced by the driver rather than the device.  It is nevertheless
> +possible for devices to implement POSIX user id and group id mapping or
> +whitelisting to control the ownership and access available to the driver.
> +
> +File systems containing special files including device nodes and setuid
> +executable files pose a security concern.  These properties are defined by the
> +file type and mode, which are set by the driver when creating new files or by
> +changes at a later time.  These special files present a security risk when the
> +file system is shared with another system, such as the host or another guest.

is this a risk to someone using the device, another system
with which this one is sharing, or both?

> +This issue can be solved on some operating systems using mount options that
> +ignore special files.

For this to be effective should not device provide a way for device
to report that the issue exists?

>  It is also possible for devices to implement
> +restrictions on special files by refusing their creation.

And then it can safely be mounted without above flags?
How would a device do this? The list above is not exhaustive.

> +
> +When the device provides shared access to a file system, symlink race
> +conditions, exhausting file system capacity, and overwriting or deleting files
> +used by others are factors to consider.  These issues have a long history in
> +multi-user operating systems and also apply to virtio-fs.  They are typically
> +managed at the file system administration level by providing shared access only
> +to mutually trusted users.

Can we be a bit more explicit here? I think this discusses
a device accessing a shared filesystem with some other
device. I am guessing it's the administrator of the filesystem, right?


-- 
MST


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] [PATCH v5 1/2] content: add virtio file system device
  2019-08-02 15:49     ` Stefan Hajnoczi
@ 2019-08-05 15:38       ` Cornelia Huck
  2019-08-05 17:03         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 18+ messages in thread
From: Cornelia Huck @ 2019-08-05 15:38 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev, Sage Weil, Steven Whitehouse, Miklos Szeredi,
	Michael S. Tsirkin, Vivek Goyal, Dr. David Alan Gilbert

On Fri, 2 Aug 2019 16:49:51 +0100
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Wed, Jul 31, 2019 at 11:58:46AM +0200, Cornelia Huck wrote:
> > On Fri, 26 Jul 2019 10:53:37 +0100
> > Stefan Hajnoczi <stefanha@redhat.com> wrote:  

> > > +\subsubsection{Device Operation: Request Queues}\label{sec:Device Types / File System Device / Device Operation / Device Operation: Request Queues}
> > > +
> > > +The driver enqueues normal requests on an arbitrary request queue and they are
> > > +completed by the device on that same queue.  
> > 
> > Do we need a device normative statement that requests MUST be completed
> > on the queue they have been submitted on?  
> 
> That would be impossible since the request struct has both IN and OUT
> elements.  virtio-fs is a request+response queue design like
> virtio-blk/virtio-scsi, not a rx/tx queue design like
> virtio-net/virtio-vsock.

Hm, now I'm confused -- what is impossible here? That the requests are
completed somewhere else? (If so, we don't need an extra statement.)

> 
> >   
> > > The device processes requests in
> > > +any order.  The driver is responsible for ensuring that ordering constraints
> > > +are met by making available a dependent request only after its prerequisite
> > > +request has been used.
> > > +
> > > +Requests have the following format:  
> > 
> > These can be either LE or BE from what is stated below, right? Maybe
> > spell it out here already?  
> 
> I will add "with the endianness chosen by the driver as detailed below".

Sounds good.

> 
> >   
> > > +
> > > +\begin{lstlisting}
> > > +struct virtio_fs_req {
> > > +        // Device-readable part
> > > +        struct fuse_in_header in;
> > > +        u8 datain[];
> > > +
> > > +        // Device-writable part
> > > +        struct fuse_out_header out;
> > > +        u8 dataout[];
> > > +};
> > > +\end{lstlisting}
> > > +
> > > +Note that the words "in" and "out" follow the FUSE meaning and do not indicate
> > > +the direction of data transfer under VIRTIO.  "In" means input to a request and
> > > +"out" means output from processing a request.
> > > +
> > > +\field{in} is the common header for all types of FUSE requests.
> > > +
> > > +\field{datain} consists of request-specific data, if any.  This is identical to
> > > +the data read from the /dev/fuse device by a FUSE daemon.
> > > +
> > > +\field{out} is the completion header common to all types of FUSE requests.
> > > +
> > > +\field{dataout} consists of request-specific data, if any.  This is identical
> > > +to the data written to the /dev/fuse device by a FUSE daemon.
> > > +
> > > +For example, the full layout of a FUSE\_READ request is as follows:
> > > +
> > > +\begin{lstlisting}
> > > +struct virtio_fs_read_req {
> > > +        // Device-readable part
> > > +        struct fuse_in_header in;
> > > +        union {
> > > +                struct fuse_read_in readin;
> > > +                u8 datain[sizeof(struct fuse_read_in)];
> > > +        };
> > > +
> > > +        // Device-writable part
> > > +        struct fuse_out_header out;
> > > +        u8 dataout[out.len - sizeof(struct fuse_out_header)];
> > > +};
> > > +\end{lstlisting}
> > > +
> > > +The FUSE protocol documented in \hyperref[intro:FUSE]{FUSE} specifies the set
> > > +of request types and their contents.
> > > +
> > > +The endianness of the FUSE protocol session is detectable by inspecting the
> > > +uint32\_t \field{in.opcode} field of the FUSE\_INIT request sent by the driver
> > > +to the device.  This allows the device to determine whether the session is
> > > +little-endian or big-endian.  
> > 
> > Do we need a driver normative statement that the driver MUST NOT send a
> > request with a different endianness once it established the session's
> > endianness via FUSE_INIT?  
> 
> It is valid to send another FUSE_INIT to switch to a fresh session.  In
> that case the driver could change the endianness again.  I think we want
> this just in case a weird system without fixed endianness has a
> bootloader running in little-endian and a guest kernel running in
> big-endian, for example.

Yes, it's probably a good idea to spell that out.

> 
> I will document this in the next revision.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] [PATCH v5 1/2] content: add virtio file system device
  2019-08-05 15:38       ` Cornelia Huck
@ 2019-08-05 17:03         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert @ 2019-08-05 17:03 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Stefan Hajnoczi, virtio-dev, Sage Weil, Steven Whitehouse,
	Miklos Szeredi, Michael S. Tsirkin, Vivek Goyal

* Cornelia Huck (cohuck@redhat.com) wrote:
> On Fri, 2 Aug 2019 16:49:51 +0100
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > On Wed, Jul 31, 2019 at 11:58:46AM +0200, Cornelia Huck wrote:
> > > On Fri, 26 Jul 2019 10:53:37 +0100
> > > Stefan Hajnoczi <stefanha@redhat.com> wrote:  
> 
> > > > +\subsubsection{Device Operation: Request Queues}\label{sec:Device Types / File System Device / Device Operation / Device Operation: Request Queues}
> > > > +
> > > > +The driver enqueues normal requests on an arbitrary request queue and they are
> > > > +completed by the device on that same queue.  
> > > 
> > > Do we need a device normative statement that requests MUST be completed
> > > on the queue they have been submitted on?  
> > 
> > That would be impossible since the request struct has both IN and OUT
> > elements.  virtio-fs is a request+response queue design like
> > virtio-blk/virtio-scsi, not a rx/tx queue design like
> > virtio-net/virtio-vsock.
> 
> Hm, now I'm confused -- what is impossible here? That the requests are
> completed somewhere else?

Correct - because the response goes back in elements in the same request
struct you can't split them.

Dave

> (If so, we don't need an extra statement.)

> 
> > 
> > >   
> > > > The device processes requests in
> > > > +any order.  The driver is responsible for ensuring that ordering constraints
> > > > +are met by making available a dependent request only after its prerequisite
> > > > +request has been used.
> > > > +
> > > > +Requests have the following format:  
> > > 
> > > These can be either LE or BE from what is stated below, right? Maybe
> > > spell it out here already?  
> > 
> > I will add "with the endianness chosen by the driver as detailed below".
> 
> Sounds good.
> 
> > 
> > >   
> > > > +
> > > > +\begin{lstlisting}
> > > > +struct virtio_fs_req {
> > > > +        // Device-readable part
> > > > +        struct fuse_in_header in;
> > > > +        u8 datain[];
> > > > +
> > > > +        // Device-writable part
> > > > +        struct fuse_out_header out;
> > > > +        u8 dataout[];
> > > > +};
> > > > +\end{lstlisting}
> > > > +
> > > > +Note that the words "in" and "out" follow the FUSE meaning and do not indicate
> > > > +the direction of data transfer under VIRTIO.  "In" means input to a request and
> > > > +"out" means output from processing a request.
> > > > +
> > > > +\field{in} is the common header for all types of FUSE requests.
> > > > +
> > > > +\field{datain} consists of request-specific data, if any.  This is identical to
> > > > +the data read from the /dev/fuse device by a FUSE daemon.
> > > > +
> > > > +\field{out} is the completion header common to all types of FUSE requests.
> > > > +
> > > > +\field{dataout} consists of request-specific data, if any.  This is identical
> > > > +to the data written to the /dev/fuse device by a FUSE daemon.
> > > > +
> > > > +For example, the full layout of a FUSE\_READ request is as follows:
> > > > +
> > > > +\begin{lstlisting}
> > > > +struct virtio_fs_read_req {
> > > > +        // Device-readable part
> > > > +        struct fuse_in_header in;
> > > > +        union {
> > > > +                struct fuse_read_in readin;
> > > > +                u8 datain[sizeof(struct fuse_read_in)];
> > > > +        };
> > > > +
> > > > +        // Device-writable part
> > > > +        struct fuse_out_header out;
> > > > +        u8 dataout[out.len - sizeof(struct fuse_out_header)];
> > > > +};
> > > > +\end{lstlisting}
> > > > +
> > > > +The FUSE protocol documented in \hyperref[intro:FUSE]{FUSE} specifies the set
> > > > +of request types and their contents.
> > > > +
> > > > +The endianness of the FUSE protocol session is detectable by inspecting the
> > > > +uint32\_t \field{in.opcode} field of the FUSE\_INIT request sent by the driver
> > > > +to the device.  This allows the device to determine whether the session is
> > > > +little-endian or big-endian.  
> > > 
> > > Do we need a driver normative statement that the driver MUST NOT send a
> > > request with a different endianness once it established the session's
> > > endianness via FUSE_INIT?  
> > 
> > It is valid to send another FUSE_INIT to switch to a fresh session.  In
> > that case the driver could change the endianness again.  I think we want
> > this just in case a weird system without fixed endianness has a
> > bootloader running in little-endian and a guest kernel running in
> > big-endian, for example.
> 
> Yes, it's probably a good idea to spell that out.
> 
> > 
> > I will document this in the next revision.
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-dev] Re: [PATCH v5 2/2] virtio-fs: add DAX window
  2019-07-26 11:41   ` [virtio-dev] " Michael S. Tsirkin
@ 2019-08-12 13:55     ` Stefan Hajnoczi
  0 siblings, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2019-08-12 13:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Sage Weil, Steven Whitehouse, Miklos Szeredi,
	Vivek Goyal, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 912 bytes --]

On Fri, Jul 26, 2019 at 07:41:57AM -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2019 at 10:53:38AM +0100, Stefan Hajnoczi wrote:
> > Describe how shared memory region ID 0 is the DAX window and how
> > FUSE_SETUPMAPPING maps file ranges into the window.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> 
> Are there any security considerations with this memory sharing?
> If yes it might be a good idea to add a chapter about that,
> and functionality/performance/security tradeoffs involved.

I'll add statements about security.  In light of recent security attacks
we should mention timing side-channel attacks, although the general
security model is that each virtio-fs client has full access to the
files in the shared directory and therefore a certain level of trust is
already required since race conditions and data corruption are possible
anyway.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-dev] Re: [PATCH v5 0/2] virtio-fs: add virtio file system device
  2019-07-26 12:34   ` Dr. David Alan Gilbert
@ 2019-08-12 13:56     ` Stefan Hajnoczi
  0 siblings, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2019-08-12 13:56 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Michael S. Tsirkin, virtio-dev, Sage Weil, Steven Whitehouse,
	Miklos Szeredi, Vivek Goyal

[-- Attachment #1: Type: text/plain, Size: 3598 bytes --]

On Fri, Jul 26, 2019 at 01:34:16PM +0100, Dr. David Alan Gilbert wrote:
> * Michael S. Tsirkin (mst@redhat.com) wrote:
> > Didn't look at the spec yet, but I have a question:
> > 
> > On Fri, Jul 26, 2019 at 10:53:36AM +0100, Stefan Hajnoczi wrote:
> > > v5:
> > >  * Explain multiqueue semantics: no ordering, identical functionality on each queue, one FUSE session state shared between all queues
> > >  * Explain how the FUSE session is started with a FUSE_INIT request
> > >  * Consistently use "submit" vs "made available" and "complete" vs "used" terminology [Michael]
> > >  * Explain endianness [Michael]
> > >  * Clarify hiprio vs normal queue usage [Michael]
> > >  * Move SHOULD, MUST, etc wording into normative sections [Michael]
> > >  * Mention that FUSE_INIT negotiated state needs to be transferred during live migration [Michael]
> > >  * State that the DAX window is mapped with writeback caching like RAM [Michael]
> > >  * Mention DAX window mapping alignment constraints (they are communicated via the FUSE protocol) [Michael]
> > >  * Explain that FUSE_SETUPMAPPING fails when device resources are exhausted and that splitting mappings consumes resources too [Michael]
> > >  * Clarify access rules to DAX window - only touch memory that has a mapping establised
> > >  * Document that DAX data persistence is achieved via FUSE_FSYNC
> > 
> > 
> > The following was raised during v4 review:
> > 
> > 	> > > > There is a practical problem that the QEMU process may hit the mmap
> > 	> > > > limit and be unable to perform its own mmaps due to the DAX Window.  A
> > 	> > > > limit must be enforced on the host so that QEMU's internal mmaps
> > 	> > > > succeed.
> > 	> > >
> > 	> > > But number of mmaps are already limited by dax window size which is
> > 	> > > controlled by virtiofsd. So all user has to do is start with smaller
> > 	> > > dax window size if this ever becomes a concern.
> > 	> >
> > 	> > Yes, the DAX Window size sets a hard limit on the number of mappings
> > 	> > (window_size / page_size).  I think we should still communicate a
> > 	> > maximum number of mappings to provide more control for cases where the
> > 	> > page size and DAX Window size don't produce a desirable number.
> > 	> >
> > 	> > Consider that window_size = 8 GB and page_size = 4 KB already yields
> > 	> > 2,097,152 maximum mappings.  Oops, that number is too large!
> > 	>
> > 	> Ok, so that's something which will be in virtiofsd where maximum number
> > 	> of outstanding can be configured by user and if we cross that limit,
> > 	> FUSE_SETUPMAPPING will be returned with some error?
> > 
> > 	Yes, virtiofsd will enforce the limit.
> > 
> > 	The guest driver will also be aware of the limit via VIRTIO
> > 	configuration space.
> > 
> > was this addressed?
> 
> In this version Stefan has marked SETUPMAPPING to say it can fail if out
> of resources.
> 
> The tricky problem with specifying a limit is that the limit is more
> likely to be global than to be per-device, and so specifying it in
> either the virtio config space or (as looked more appropriate) fuse init
> doesn't feel right.
> 
> IMHO we should never hit the limit in practical use, so it's really just
> a matter of saying that it can fail and setting a large enough limit
> that non-abusive drivers dont cause a problem.

Yep, I updated the spec to mention that FUSE_SETUPMAPPING fails when
there are insufficient resources.  Drivers expect that it can happen and
will have to unmap existing mappings if they wish to make new ones.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] [PATCH v5 1/2] content: add virtio file system device
  2019-08-03 21:12   ` Michael S. Tsirkin
@ 2019-08-13  9:07     ` Stefan Hajnoczi
  2019-08-13 11:24       ` Michael S. Tsirkin
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2019-08-13  9:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Sage Weil, Steven Whitehouse, Miklos Szeredi,
	Vivek Goyal, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 2783 bytes --]

On Sat, Aug 03, 2019 at 05:12:15PM -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2019 at 10:53:37AM +0100, Stefan Hajnoczi wrote:
> > +\subsubsection{Live migration considerations}\label{sec:Device Types / File System Device / Live Migration Considerations}
> > +
> > +When a guest is migrated to a new host
> 
> Please use driver and device here and elsewhere.

Will fix.

> > it is necessary to consider the FUSE
> > +session and its state.  The continuity of FUSE inode numbers (also known as
> > +nodeids) and fh values is necessary so the driver can continue operation
> > +without disruption.
> > +
> > +It is possible to maintain the FUSE session across live migration either by
> > +transferring the state or by redirecting requests from the new host to the old
> > +host where the state resides.  The details of how to achieve this are
> > +implementation-dependent and are not visible at the device interface level.
> > +
> > +Maintaining version and feature information negotiated by FUSE\_INIT is
> > +necessary so that no FUSE protocol feature changes are visible to the driver
> > +across live migration.  The FUSE\_INIT information forms part of the FUSE
> > +session state that needs to be transferred during live migration.
> 
> It bothers me that it's implicit, and is not exposed easily at the
> virtio level, so one has to bind a virtqueue and run buffers over it to
> even just check whether a given device supports a specific interface and
> can be migrated to.
> 
> How about exposing the version of the device in the config space?
> Spec can require that it matches the contents of the INIT command.

FUSE_INIT works like this:

1. The client sends FUSE_INIT with the supported highest protocol
   version.
2. If the server supports the client's version, it replies with the same
   version number and fills in the FUSE_INIT parameters.

   Otherwise the server lowers the version number and replies to the
   client, and the client will resend FUSE_INIT with the lower version
   number to acknowledge that it supports the lower version.  Go to step
   1.

With regards to live migration the server's FUSE_INIT reply structure
needs to be sent to the destination if a FUSE session has been
established.  If the destination does not support this protocol version
then live migration fails.

If management tools want to ensure that migration always succeeds then I
think the way to do that is not via virtio config space (they would
still need to start a guest in order to access it!).

I don't understand the benefit of exposing the device's version in the
config space?  Do you mean the negotiated version of the current FUSE
session or the highest FUSE protocol version supported by the device?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] [PATCH v5 1/2] content: add virtio file system device
  2019-08-03 21:21   ` Michael S. Tsirkin
@ 2019-08-13  9:35     ` Stefan Hajnoczi
  0 siblings, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2019-08-13  9:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Sage Weil, Steven Whitehouse, Miklos Szeredi,
	Vivek Goyal, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 3701 bytes --]

On Sat, Aug 03, 2019 at 05:21:19PM -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2019 at 10:53:37AM +0100, Stefan Hajnoczi wrote:
> > +\subsubsection{Security Considerations}\label{sec:Device Types / File System Device / Security Considerations}
> 
> One issue not addressed here at all is sharing the filesystem cache.
> Is that a security consiteration at all?
> A shared resource is often an information sidechannel.

Will add to the next revision.

> > +
> > +The device provides access to a file system containing files owned by one or
> > +more POSIX user ids and group ids.  The device has no secure way of
> > +differentiating between users originating requests via the driver.  Therefore
> > +the device accepts the POSIX user ids and group ids provided by the driver and
> > +security is enforced by the driver rather than the device.  It is nevertheless
> > +possible for devices to implement POSIX user id and group id mapping or
> > +whitelisting to control the ownership and access available to the driver.
> > +
> > +File systems containing special files including device nodes and setuid
> > +executable files pose a security concern.  These properties are defined by the
> > +file type and mode, which are set by the driver when creating new files or by
> > +changes at a later time.  These special files present a security risk when the
> > +file system is shared with another system, such as the host or another guest.
> 
> is this a risk to someone using the device, another system
> with which this one is sharing, or both?

Both.  Will add to the next revision.

> > +This issue can be solved on some operating systems using mount options that
> > +ignore special files.
> 
> For this to be effective should not device provide a way for device
> to report that the issue exists?

Mounting with nosuid,nodev disables these features and therefore
eliminates the security issue.  The device doesn't need to notify or
being involved directly.

> >  It is also possible for devices to implement
> > +restrictions on special files by refusing their creation.
> 
> And then it can safely be mounted without above flags?
> How would a device do this? The list above is not exhaustive.

The device would fail mknod and setting the setuid/setgid bits so that
clients cannot create these dangerous files.  The exact policy is
implementation-dependent, for example, the device may allow creation of
certain device major:minor nodes if the application depends on them.

Once these dangerous files cannot be created and if we trust that the
file system does not already contain them, then the nosuid,nodev mount
options are no longer necessary.

Since the details are implementation- and application-dependent I don't
want to go into detail in the spec.  The current virtiofsd implemenation
does not do any of this.

> > +
> > +When the device provides shared access to a file system, symlink race
> > +conditions, exhausting file system capacity, and overwriting or deleting files
> > +used by others are factors to consider.  These issues have a long history in
> > +multi-user operating systems and also apply to virtio-fs.  They are typically
> > +managed at the file system administration level by providing shared access only
> > +to mutually trusted users.
> 
> Can we be a bit more explicit here? I think this discusses
> a device accessing a shared filesystem with some other
> device. I am guessing it's the administrator of the filesystem, right?

Yes, by "shared" I meant the scenario where multiple systems are
accessing the same shared directory tree.  I'll reword it in the next
revision.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] [PATCH v5 1/2] content: add virtio file system device
  2019-08-13  9:07     ` Stefan Hajnoczi
@ 2019-08-13 11:24       ` Michael S. Tsirkin
  2019-08-13 15:13         ` Stefan Hajnoczi
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2019-08-13 11:24 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev, Sage Weil, Steven Whitehouse, Miklos Szeredi,
	Vivek Goyal, Dr. David Alan Gilbert

On Tue, Aug 13, 2019 at 10:07:11AM +0100, Stefan Hajnoczi wrote:
> On Sat, Aug 03, 2019 at 05:12:15PM -0400, Michael S. Tsirkin wrote:
> > On Fri, Jul 26, 2019 at 10:53:37AM +0100, Stefan Hajnoczi wrote:
> > > +\subsubsection{Live migration considerations}\label{sec:Device Types / File System Device / Live Migration Considerations}
> > > +
> > > +When a guest is migrated to a new host
> > 
> > Please use driver and device here and elsewhere.
> 
> Will fix.
> 
> > > it is necessary to consider the FUSE
> > > +session and its state.  The continuity of FUSE inode numbers (also known as
> > > +nodeids) and fh values is necessary so the driver can continue operation
> > > +without disruption.
> > > +
> > > +It is possible to maintain the FUSE session across live migration either by
> > > +transferring the state or by redirecting requests from the new host to the old
> > > +host where the state resides.  The details of how to achieve this are
> > > +implementation-dependent and are not visible at the device interface level.
> > > +
> > > +Maintaining version and feature information negotiated by FUSE\_INIT is
> > > +necessary so that no FUSE protocol feature changes are visible to the driver
> > > +across live migration.  The FUSE\_INIT information forms part of the FUSE
> > > +session state that needs to be transferred during live migration.
> > 
> > It bothers me that it's implicit, and is not exposed easily at the
> > virtio level, so one has to bind a virtqueue and run buffers over it to
> > even just check whether a given device supports a specific interface and
> > can be migrated to.
> > 
> > How about exposing the version of the device in the config space?
> > Spec can require that it matches the contents of the INIT command.
> 
> FUSE_INIT works like this:
> 
> 1. The client sends FUSE_INIT with the supported highest protocol
>    version.
> 2. If the server supports the client's version, it replies with the same
>    version number and fills in the FUSE_INIT parameters.
> 
>    Otherwise the server lowers the version number and replies to the
>    client, and the client will resend FUSE_INIT with the lower version
>    number to acknowledge that it supports the lower version.  Go to step
>    1.
> 
> With regards to live migration the server's FUSE_INIT reply structure
> needs to be sent to the destination if a FUSE session has been
> established.  If the destination does not support this protocol version
> then live migration fails.
> 
> If management tools want to ensure that migration always succeeds then I
> think the way to do that is not via virtio config space (they would
> still need to start a guest in order to access it!).
> 
> I don't understand the benefit of exposing the device's version in the
> config space?  Do you mean the negotiated version of the current FUSE
> session or the highest FUSE protocol version supported by the device?

The highest version.
The point is to make it crystal clear that for cross version migration
the highest version must be consistent.
Otherwise this is something implementers might easily overlook.

Another option is to add non-normative text explaining what must
be kept consistent to allow migration between devices.

Generally I think that it is unfortunate that pass-through
devices like this one are growing extensions for
feature negotiation outside the spec.
This hurts attempts at generic transport pass-through like vhost-user.

-- 
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-dev] [PATCH v5 1/2] content: add virtio file system device
  2019-08-13 11:24       ` Michael S. Tsirkin
@ 2019-08-13 15:13         ` Stefan Hajnoczi
  0 siblings, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2019-08-13 15:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Sage Weil, Steven Whitehouse, Miklos Szeredi,
	Vivek Goyal, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 4371 bytes --]

On Tue, Aug 13, 2019 at 07:24:45AM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 13, 2019 at 10:07:11AM +0100, Stefan Hajnoczi wrote:
> > On Sat, Aug 03, 2019 at 05:12:15PM -0400, Michael S. Tsirkin wrote:
> > > On Fri, Jul 26, 2019 at 10:53:37AM +0100, Stefan Hajnoczi wrote:
> > > > +\subsubsection{Live migration considerations}\label{sec:Device Types / File System Device / Live Migration Considerations}
> > > > +
> > > > +When a guest is migrated to a new host
> > > 
> > > Please use driver and device here and elsewhere.
> > 
> > Will fix.
> > 
> > > > it is necessary to consider the FUSE
> > > > +session and its state.  The continuity of FUSE inode numbers (also known as
> > > > +nodeids) and fh values is necessary so the driver can continue operation
> > > > +without disruption.
> > > > +
> > > > +It is possible to maintain the FUSE session across live migration either by
> > > > +transferring the state or by redirecting requests from the new host to the old
> > > > +host where the state resides.  The details of how to achieve this are
> > > > +implementation-dependent and are not visible at the device interface level.
> > > > +
> > > > +Maintaining version and feature information negotiated by FUSE\_INIT is
> > > > +necessary so that no FUSE protocol feature changes are visible to the driver
> > > > +across live migration.  The FUSE\_INIT information forms part of the FUSE
> > > > +session state that needs to be transferred during live migration.
> > > 
> > > It bothers me that it's implicit, and is not exposed easily at the
> > > virtio level, so one has to bind a virtqueue and run buffers over it to
> > > even just check whether a given device supports a specific interface and
> > > can be migrated to.
> > > 
> > > How about exposing the version of the device in the config space?
> > > Spec can require that it matches the contents of the INIT command.
> > 
> > FUSE_INIT works like this:
> > 
> > 1. The client sends FUSE_INIT with the supported highest protocol
> >    version.
> > 2. If the server supports the client's version, it replies with the same
> >    version number and fills in the FUSE_INIT parameters.
> > 
> >    Otherwise the server lowers the version number and replies to the
> >    client, and the client will resend FUSE_INIT with the lower version
> >    number to acknowledge that it supports the lower version.  Go to step
> >    1.
> > 
> > With regards to live migration the server's FUSE_INIT reply structure
> > needs to be sent to the destination if a FUSE session has been
> > established.  If the destination does not support this protocol version
> > then live migration fails.
> > 
> > If management tools want to ensure that migration always succeeds then I
> > think the way to do that is not via virtio config space (they would
> > still need to start a guest in order to access it!).
> > 
> > I don't understand the benefit of exposing the device's version in the
> > config space?  Do you mean the negotiated version of the current FUSE
> > session or the highest FUSE protocol version supported by the device?
> 
> The highest version.
> The point is to make it crystal clear that for cross version migration
> the highest version must be consistent.
> Otherwise this is something implementers might easily overlook.

For migration the current FUSE session's negotiated version must be
supported by the destination.  That's all.  The highest version
supported by the device doesn't matter.

> Another option is to add non-normative text explaining what must
> be kept consistent to allow migration between devices.
> 
> Generally I think that it is unfortunate that pass-through
> devices like this one are growing extensions for
> feature negotiation outside the spec.
> This hurts attempts at generic transport pass-through like vhost-user.

I'm not sure how we'll implement vhost-user live migration.  Currently
the vhost-user protocol doesn't allow the slave to provide migration
state.  But we'll need to do this in order to transfer all the state
associated with the virtio-fs device (FUSE_INIT struct, inodes, file
locks, etc).  It's important to recognize that this state goes way
beyond the FUSE_INIT struct, there is a lot of implementation-specific
state associated with a FUSE session!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-08-13 15:13 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-26  9:53 [virtio-dev] [PATCH v5 0/2] virtio-fs: add virtio file system device Stefan Hajnoczi
2019-07-26  9:53 ` [virtio-dev] [PATCH v5 1/2] content: " Stefan Hajnoczi
2019-07-31  9:58   ` Cornelia Huck
2019-08-02 15:49     ` Stefan Hajnoczi
2019-08-05 15:38       ` Cornelia Huck
2019-08-05 17:03         ` Dr. David Alan Gilbert
2019-08-03 21:12   ` Michael S. Tsirkin
2019-08-13  9:07     ` Stefan Hajnoczi
2019-08-13 11:24       ` Michael S. Tsirkin
2019-08-13 15:13         ` Stefan Hajnoczi
2019-08-03 21:21   ` Michael S. Tsirkin
2019-08-13  9:35     ` Stefan Hajnoczi
2019-07-26  9:53 ` [virtio-dev] [PATCH v5 2/2] virtio-fs: add DAX window Stefan Hajnoczi
2019-07-26 11:41   ` [virtio-dev] " Michael S. Tsirkin
2019-08-12 13:55     ` Stefan Hajnoczi
2019-07-26 12:27 ` [virtio-dev] Re: [PATCH v5 0/2] virtio-fs: add virtio file system device Michael S. Tsirkin
2019-07-26 12:34   ` Dr. David Alan Gilbert
2019-08-12 13:56     ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.