All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-dev] [PATCH v2 0/2] virtio-fs: add virtio file system device
@ 2019-02-13  6:33 Stefan Hajnoczi
  2019-02-13  6:33 ` [virtio-dev] [PATCH v2 1/2] content: " Stefan Hajnoczi
  2019-02-13  6:33 ` [virtio-dev] [PATCH v2 2/2] virtio-fs: add DAX window Stefan Hajnoczi
  0 siblings, 2 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2019-02-13  6:33 UTC (permalink / raw)
  To: virtio-dev
  Cc: Miklos Szeredi, Vivek Goyal, Dr. David Alan Gilbert, Sage Weil,
	Steven Whitehouse, Stefan Hajnoczi

v2:
 * Clean up core virtio file system device spec
 * Add DAX window

These patches add the virtio file system device, which is based on Linux FUSE
but includes the DAX window extension.  Similar to virtio-scsi, which
transports SCSI commands, virtio-fs transports FUSE requests and the protocol
documentation is not duplicated here.

The DAX window allows file contents to be accessed directly from shared memory.
This eliminates copying of data, reduces the number of vmexits, and reduces the
guest's memory footprint.  It also allows coherent mmap MAP_SHARED semantics
between guests on the same host.

Michael Tsirkin has expressed an interest in security and live migration.  I
plan to add these sections in the next revision.

Please let me know which areas should be expanded or are missing.

Stefan Hajnoczi (2):
  content: add virtio file system device
  virtio-fs: add DAX window

 content.tex      |   3 +
 introduction.tex |   3 +
 virtio-fs.tex    | 223 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 229 insertions(+)
 create mode 100644 virtio-fs.tex

-- 
2.20.1


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [virtio-dev] [PATCH v2 1/2] content: add virtio file system device
  2019-02-13  6:33 [virtio-dev] [PATCH v2 0/2] virtio-fs: add virtio file system device Stefan Hajnoczi
@ 2019-02-13  6:33 ` Stefan Hajnoczi
  2019-02-13 16:47   ` Paolo Bonzini
  2019-02-13  6:33 ` [virtio-dev] [PATCH v2 2/2] virtio-fs: add DAX window Stefan Hajnoczi
  1 sibling, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2019-02-13  6:33 UTC (permalink / raw)
  To: virtio-dev
  Cc: Miklos Szeredi, Vivek Goyal, Dr. David Alan Gilbert, Sage Weil,
	Steven Whitehouse, Stefan Hajnoczi

The virtio file system device transports Linux FUSE requests between a
FUSE daemon running on the host and the FUSE driver inside the guest.

The actual FUSE request definitions are not duplicated in the virtio
specification, similar to how virtio-scsi does not document SCSI
command details.  FUSE request definitions are available here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fuse.h

This patch documents the core virtio file system device, which is
functional but lacks the DAX feature introduced in the next patch.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 content.tex      |   3 +
 introduction.tex |   3 +
 virtio-fs.tex    | 198 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 204 insertions(+)
 create mode 100644 virtio-fs.tex

diff --git a/content.tex b/content.tex
index 836ee52..ac41fdb 100644
--- a/content.tex
+++ b/content.tex
@@ -2634,6 +2634,8 @@ Device ID  &  Virtio Device    \\
 \hline
 24         &   Memory device \\
 \hline
+26         &   file system device \\
+\hline
 \end{tabular}
 
 Some of the devices above are unspecified by this document,
@@ -5559,6 +5561,7 @@ descriptor for the \field{sense_len}, \field{residual},
 \input{virtio-input.tex}
 \input{virtio-crypto.tex}
 \input{virtio-vsock.tex}
+\input{virtio-fs.tex}
 
 \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
 
diff --git a/introduction.tex b/introduction.tex
index a4ac01d..6eeda5d 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -60,6 +60,9 @@ Levels'', BCP 14, RFC 2119, March 1997. \newline\url{http://www.ietf.org/rfc/rfc
 	\phantomsection\label{intro:SCSI MMC}\textbf{[SCSI MMC]} &
         SCSI Multimedia Commands,
         \newline\url{http://www.t10.org/cgi-bin/ac.pl?t=f&f=mmc6r00.pdf}\\
+	\phantomsection\label{intro:FUSE}\textbf{[FUSE]} &
+	Linux FUSE interface,
+	\newline\url{https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fuse.h}\\
 
 \end{longtable}
 
diff --git a/virtio-fs.tex b/virtio-fs.tex
new file mode 100644
index 0000000..ffbaa46
--- /dev/null
+++ b/virtio-fs.tex
@@ -0,0 +1,198 @@
+\section{File System Device}\label{sec:Device Types / File System Device}
+
+The virtio file system device provides file system access.  The device may
+directly manage a file system or act as a gateway to a remote file system.  The
+details of how files are accessed are hidden by the device interface, allowing
+for a range of use cases.
+
+Unlike block-level storage devices such as virtio block and SCSI, the virtio
+file system device provides file-level access to data.  The device interface is
+based on the Linux Filesystem in Userspace (FUSE) protocol.  This consists of
+requests for file system traversal and access the files and directories within
+it.  The protocol details are defined by \hyperref[intro:FUSE]{FUSE}.
+
+The device acts as the FUSE file system daemon and the driver acts as the FUSE
+client mounting the file system.  The virtio file system device provides the
+mechanism for transporting FUSE requests, much like /dev/fuse in a traditional
+FUSE application.
+
+This section relies on definitions from \hyperref[intro:FUSE]{FUSE}.
+
+\subsection{Device ID}\label{sec:Device Types / File System Device / Device ID}
+  26
+
+\subsection{Virtqueues}\label{sec:Device Types / File System Device / Virtqueues}
+
+\begin{description}
+\item[0] notifications
+\item[1] hiprio
+\item[2\ldots n] request queues
+\end{description}
+
+\subsection{Feature bits}\label{sec:Device Types / File System Device / Feature bits}
+
+There are currently no feature bits defined.
+
+\subsection{Device configuration layout}\label{sec:Device Types / File System Device / Device configuration layout}
+
+All fields of this configuration are always available.
+
+\begin{lstlisting}
+struct virtio_fs_config {
+        char tag[36];
+        le32 num_queues;
+};
+\end{lstlisting}
+
+\begin{description}
+\item[\field{tag}] is the name associated with this file system.  The tag is
+    encoded in UTF-8 and padded with NUL bytes if shorter than the
+    available space.  This field is not NUL-terminated if the encoded bytes
+    take up the entire field.
+\item[\field{num_queues}] is the total number of request virtqueues exposed by
+    the device. The driver MAY use only one request queue,
+    or it can use more to achieve better performance.
+\end{description}
+
+\drivernormative{\subsubsection}{Device configuration layout}{Device Types / File System Device / Device configuration layout}
+
+The driver MUST NOT write to device configuration fields.
+
+\devicenormative{\subsubsection}{Device configuration layout}{Device Types / File System Device / Device configuration layout}
+
+The device MUST set \field{num_queues} to 1 or greater.
+
+\devicenormative{\subsection}{Device Initialization}{Device Types / File System Device / Device Initialization}
+
+On initialization the driver MUST first discover the
+device's virtqueues.
+
+If the driver uses the notifications queue, the driver SHOULD place at least
+one buffer in the notifications queue before sending requests on other queues.
+
+\subsection{Device Operation}\label{sec:Device Types / File System Device / Device Operation}
+
+Device operation consists of operating the virtqueues to facilitate file system
+access.
+
+The FUSE request types are as follows:
+\begin{itemize}
+\item Normal requests are submitted by the driver and completed by the device.
+\item Interrupt requests are submitted by the driver to abort requests that the
+      device may have yet to complete.
+\item Notifications are submitted by the device and completed by the driver.
+\end{itemize}
+
+\subsubsection{Device Operation: Request Queues}\label{sec:Device Types / File System Device / Device Operation / Device Operation: Request Queues}
+
+The driver enqueues normal requests on an arbitrary request queue and they are
+completed by the device on that same queue. It is the responsibility of the
+driver to ensure strict request ordering for commands placed on different
+queues, because they are consumed with no order constraints.
+
+Requests have the following format:
+
+\begin{lstlisting}
+struct virtio_fs_req {
+        // Device-readable part
+        struct fuse_in_header in;
+        u8 datain[];
+
+        // Device-writable part
+        struct fuse_out_header out;
+        u8 dataout[];
+};
+\end{lstlisting}
+
+Note that the words "in" and "out" follow the FUSE meaning and do not indicate
+the direction of data transfer under VIRTIO.  "In" means input to a request and
+"out" means output from processing a request.
+
+\field{in} is the common header for all types of FUSE requests.
+
+\field{datain} consists of request-specific data, if any.  This is identical to
+the data read from the /dev/fuse device by a FUSE daemon.
+
+\field{out} is the completion header common to all types of FUSE requests.
+
+\field{dataout} consists of request-specific data, if any.  This is identical
+to the data written to the /dev/fuse device by a FUSE daemon.
+
+For example, the full layout of a FUSE_READ request is as follows:
+
+\begin{lstlisting}
+struct virtio_fs_read_req {
+        // Device-readable part
+        struct fuse_in_header in;
+        union {
+                struct fuse_read_in readin;
+                u8 datain[sizeof(struct fuse_read_in)];
+        };
+
+        // Device-writable part
+        struct fuse_out_header out;
+        u8 dataout[out.len - sizeof(struct fuse_out_header)];
+};
+\end{lstlisting}
+
+The FUSE protocol documented in \hyperref[intro:FUSE]{FUSE} specifies the set
+of request types and their contents.  All request fields are little-endian.
+
+\subsubsection{Device Operation: High Priority Queue}\label{sec:Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
+
+The hiprio queue follows the same request format as the requests queue.  This
+queue only contains FUSE_INTERRUPT, FUSE_FORGET, and FUSE_BATCH_FORGET
+requests.
+
+Interrupt and forget requests have a higher priority than normal requests.  In
+order to ensure that they can always be delivered, even if all request queues
+are full, a separate queue is used.
+
+\devicenormative{\paragraph}{Device Operation: High Priority Queue}{Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
+
+The device SHOULD attempt to process the hiprio queue promptly.
+
+The device MAY process request queues concurrently with the hiprio queue.
+
+\drivernormative{\paragraph}{Device Operation: High Priority Queue}{Device Types / File System Device / Device Operation / Device Operation: High Priority Queue}
+
+The driver MUST submit FUSE_INTERRUPT, FUSE_FORGET, and FUSE_BATCH_FORGET requests solely on the hiprio queue.
+
+The driver MUST anticipate that request queues are processed concurrently with the hiprio queue.
+
+\subsubsection{Device Operation: Notifications Queue}\label{sec:Device Types / File System Device / Device Operation / Device Operation: Notifications Queue}
+
+The notifications queue is used for notification requests from the device to
+the driver.  The request queues cannot be used since they only work in the
+direction of the driver to the device.  Therefore the driver enqueues
+notifications ahead of time and the device completes them at the point in time
+when notifications are raised.
+
+Notifications are different from normal requests because they only contain
+device writable fields.  The driver sends notification replies on one of the
+request queues.  The format of notification requests is as follows:
+
+\begin{lstlisting}
+struct virtio_fs_notification_req {
+        // Device-writable part
+        struct fuse_out_header out;
+        u8 dataout[];
+};
+\end{lstlisting}
+
+\field{out} is the completion header common to all types of FUSE requests.  The
+\field{out.unique} field is 0 and the \field{out.error} field contains a
+FUSE_NOTIFY_* code.
+
+\field{dataout} consists of request-specific data, if any.  This is identical
+to the data written to the /dev/fuse device by a FUSE daemon.
+
+\devicenormative{\paragraph}{Device Operation: Notifications Queue}{Device Types / File System Device / Device Operation / Device Operation: Notifications Queue}
+
+The device MUST set \field{out.unique} to 0 and set \field{out.error} to a FUSE_NOTIFY_* code.
+
+\drivernormative{\paragraph}{Device Operation: Notifications Queue}{Device Types / File System Device / Device Operation / Device Operation: Notifications Queue}
+
+The driver MUST verify that \field{out.unique} is 0.
+
+Notifications queue buffers MUST be at least 8192 bytes long.
-- 
2.20.1


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [virtio-dev] [PATCH v2 2/2] virtio-fs: add DAX window
  2019-02-13  6:33 [virtio-dev] [PATCH v2 0/2] virtio-fs: add virtio file system device Stefan Hajnoczi
  2019-02-13  6:33 ` [virtio-dev] [PATCH v2 1/2] content: " Stefan Hajnoczi
@ 2019-02-13  6:33 ` Stefan Hajnoczi
  2019-02-13 16:49   ` Paolo Bonzini
  1 sibling, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2019-02-13  6:33 UTC (permalink / raw)
  To: virtio-dev
  Cc: Miklos Szeredi, Vivek Goyal, Dr. David Alan Gilbert, Sage Weil,
	Steven Whitehouse, Stefan Hajnoczi

Describe how shared memory region ID 0 is the DAX window and how
FUSE_SETUPMAPPING maps file ranges into the window.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 virtio-fs.tex | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/virtio-fs.tex b/virtio-fs.tex
index ffbaa46..37ab5ea 100644
--- a/virtio-fs.tex
+++ b/virtio-fs.tex
@@ -196,3 +196,28 @@ The device MUST set \field{out.unique} to 0 and set \field{out.error} to a FUSE_
 The driver MUST verify that \field{out.unique} is 0.
 
 Notifications queue buffers MUST be at least 8192 bytes long.
+
+\subsubsection{Device Operation: DAX Window}\label{sec:Device Types / File System Device / Device Operation / Device Operation: DAX Window}
+
+FUSE\_READ and FUSE\_WRITE requests transfer file contents between the
+driver-provided buffer and the device.  In cases where data transfer is
+undesirable, the device can map file contents into the DAX window shared memory
+region.  The driver then accesses file contents directly in device-owned memory
+without a data transfer.
+
+Shared memory region ID 0 is called the DAX window.  The driver maps a file
+range into the DAX window using the FUSE\_SETUPMAPPING request.  The mapping is
+removed using the FUSE\_REMOVEMAPPING request.
+
+After FUSE\_SETUPMAPPING has completed successfully the file range is accessible
+from the DAX window at the offset provided by the driver in the request.
+
+\devicenormative{\paragraph}{Device Operation: DAX Window}{Device Types / File System Device / Device Operation / Device Operation: DAX Window}
+
+The device MUST allow mappings that completely or partially overlap existing mappings within the DAX window.
+
+The device MUST reject mappings that would go beyond the end of the DAX window.
+
+\drivernormative{\paragraph}{Device Operation: DAX Window}{Device Types / File System Device / Device Operation / Device Operation: DAX Window}
+
+The driver SHOULD be prepared to find shared memory region ID 0 absent and fall back to FUSE\_READ and FUSE\_WRITE requests.
-- 
2.20.1


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [virtio-dev] [PATCH v2 1/2] content: add virtio file system device
  2019-02-13  6:33 ` [virtio-dev] [PATCH v2 1/2] content: " Stefan Hajnoczi
@ 2019-02-13 16:47   ` Paolo Bonzini
  2019-02-14  3:33     ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2019-02-13 16:47 UTC (permalink / raw)
  To: Stefan Hajnoczi, virtio-dev
  Cc: Miklos Szeredi, Vivek Goyal, Dr. David Alan Gilbert, Sage Weil,
	Steven Whitehouse

On 13/02/19 07:33, Stefan Hajnoczi wrote:
> +Notifications are different from normal requests because they only contain
> +device writable fields.  The driver sends notification replies on one of the
> +request queues.  The format of notification requests is as follows:
> +
> +\begin{lstlisting}
> +struct virtio_fs_notification_req {
> +        // Device-writable part
> +        struct fuse_out_header out;
> +        u8 dataout[];
> +};
> +\end{lstlisting}
> +
> +\field{out} is the completion header common to all types of FUSE requests.  The
> +\field{out.unique} field is 0 and the \field{out.error} field contains a
> +FUSE_NOTIFY_* code.
> +
> +\field{dataout} consists of request-specific data, if any.  This is identical
> +to the data written to the /dev/fuse device by a FUSE daemon.
> +

What happens if notifications are lost because no request was there?
virtio-scsi has a flag for that, would it make sense to add it here too?

Paolo

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [virtio-dev] [PATCH v2 2/2] virtio-fs: add DAX window
  2019-02-13  6:33 ` [virtio-dev] [PATCH v2 2/2] virtio-fs: add DAX window Stefan Hajnoczi
@ 2019-02-13 16:49   ` Paolo Bonzini
  0 siblings, 0 replies; 9+ messages in thread
From: Paolo Bonzini @ 2019-02-13 16:49 UTC (permalink / raw)
  To: virtio-dev

On 13/02/19 07:33, Stefan Hajnoczi wrote:
> Describe how shared memory region ID 0 is the DAX window and how
> FUSE_SETUPMAPPING maps file ranges into the window.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  virtio-fs.tex | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/virtio-fs.tex b/virtio-fs.tex
> index ffbaa46..37ab5ea 100644
> --- a/virtio-fs.tex
> +++ b/virtio-fs.tex
> @@ -196,3 +196,28 @@ The device MUST set \field{out.unique} to 0 and set \field{out.error} to a FUSE_
>  The driver MUST verify that \field{out.unique} is 0.
>  
>  Notifications queue buffers MUST be at least 8192 bytes long.
> +
> +\subsubsection{Device Operation: DAX Window}\label{sec:Device Types / File System Device / Device Operation / Device Operation: DAX Window}
> +
> +FUSE\_READ and FUSE\_WRITE requests transfer file contents between the
> +driver-provided buffer and the device.  In cases where data transfer is
> +undesirable, the device can map file contents into the DAX window shared memory
> +region.  The driver then accesses file contents directly in device-owned memory
> +without a data transfer.
> +
> +Shared memory region ID 0 is called the DAX window.  The driver maps a file
> +range into the DAX window using the FUSE\_SETUPMAPPING request.  The mapping is
> +removed using the FUSE\_REMOVEMAPPING request.
> +
> +After FUSE\_SETUPMAPPING has completed successfully the file range is accessible
> +from the DAX window at the offset provided by the driver in the request.

FYI: this is not upstream yet, see https://lkml.org/lkml/2018/12/10/573
for a definition.

Paolo

> +\devicenormative{\paragraph}{Device Operation: DAX Window}{Device Types / File System Device / Device Operation / Device Operation: DAX Window}
> +
> +The device MUST allow mappings that completely or partially overlap existing mappings within the DAX window.
> +
> +The device MUST reject mappings that would go beyond the end of the DAX window.
> +
> +\drivernormative{\paragraph}{Device Operation: DAX Window}{Device Types / File System Device / Device Operation / Device Operation: DAX Window}
> +
> +The driver SHOULD be prepared to find shared memory region ID 0 absent and fall back to FUSE\_READ and FUSE\_WRITE requests.
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [virtio-dev] [PATCH v2 1/2] content: add virtio file system device
  2019-02-13 16:47   ` Paolo Bonzini
@ 2019-02-14  3:33     ` Stefan Hajnoczi
       [not found]       ` <CAOssrKdfYic5bSz-GkfZUbd=OrADwrVXPRTQiVgOFbzpiVwNZg@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2019-02-14  3:33 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: virtio-dev, Vivek Goyal, Dr. David Alan Gilbert, Sage Weil,
	Steven Whitehouse, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1806 bytes --]

On Wed, Feb 13, 2019 at 05:47:19PM +0100, Paolo Bonzini wrote:
> On 13/02/19 07:33, Stefan Hajnoczi wrote:
> > +Notifications are different from normal requests because they only contain
> > +device writable fields.  The driver sends notification replies on one of the
> > +request queues.  The format of notification requests is as follows:
> > +
> > +\begin{lstlisting}
> > +struct virtio_fs_notification_req {
> > +        // Device-writable part
> > +        struct fuse_out_header out;
> > +        u8 dataout[];
> > +};
> > +\end{lstlisting}
> > +
> > +\field{out} is the completion header common to all types of FUSE requests.  The
> > +\field{out.unique} field is 0 and the \field{out.error} field contains a
> > +FUSE_NOTIFY_* code.
> > +
> > +\field{dataout} consists of request-specific data, if any.  This is identical
> > +to the data written to the /dev/fuse device by a FUSE daemon.
> > +
> 
> What happens if notifications are lost because no request was there?
> virtio-scsi has a flag for that, would it make sense to add it here too?

The FUSE protocol assumes notification delivery is reliable.  Some
notifications can be dropped with no or little impact on functionality,
but others cannot because it would cause a hung operation.

Therefore the device must hold notifications until the driver makes
buffers available.  The question becomes what happens when the device
runs out of space to hold notifications.  At this point the device must
be reset because no further progress is possible.

(In our current implementation notifications aren't used at all, but the
virtio-fs spec should allow for it so that the full FUSE protocol is
available for future applications.)

Miklos: What do you think from the FUSE protocol point of view?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [virtio-dev] [PATCH v2 1/2] content: add virtio file system device
       [not found]       ` <CAOssrKdfYic5bSz-GkfZUbd=OrADwrVXPRTQiVgOFbzpiVwNZg@mail.gmail.com>
@ 2019-02-14  8:36         ` Stefan Hajnoczi
       [not found]           ` <CAOssrKcXftoaszAoNF3936rTP0KgD6mL4P=hoy54v2i+x-1o8A@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2019-02-14  8:36 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: virtio-dev, Vivek Goyal, Dr. David Alan Gilbert, Sage Weil,
	Steven Whitehouse, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3451 bytes --]

On Thu, Feb 14, 2019 at 08:20:58AM +0100, Miklos Szeredi wrote:
> On Thu, Feb 14, 2019 at 4:33 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Wed, Feb 13, 2019 at 05:47:19PM +0100, Paolo Bonzini wrote:
> > > On 13/02/19 07:33, Stefan Hajnoczi wrote:
> > > > +Notifications are different from normal requests because they only contain
> > > > +device writable fields.  The driver sends notification replies on one of the
> > > > +request queues.  The format of notification requests is as follows:
> > > > +
> > > > +\begin{lstlisting}
> > > > +struct virtio_fs_notification_req {
> > > > +        // Device-writable part
> > > > +        struct fuse_out_header out;
> > > > +        u8 dataout[];
> > > > +};
> > > > +\end{lstlisting}
> > > > +
> > > > +\field{out} is the completion header common to all types of FUSE requests.  The
> > > > +\field{out.unique} field is 0 and the \field{out.error} field contains a
> > > > +FUSE_NOTIFY_* code.
> > > > +
> > > > +\field{dataout} consists of request-specific data, if any.  This is identical
> > > > +to the data written to the /dev/fuse device by a FUSE daemon.
> > > > +
> > >
> > > What happens if notifications are lost because no request was there?
> > > virtio-scsi has a flag for that, would it make sense to add it here too?
> >
> > The FUSE protocol assumes notification delivery is reliable.  Some
> > notifications can be dropped with no or little impact on functionality,
> > but others cannot because it would cause a hung operation.
> >
> > Therefore the device must hold notifications until the driver makes
> > buffers available.  The question becomes what happens when the device
> > runs out of space to hold notifications.  At this point the device must
> > be reset because no further progress is possible.
> >
> > (In our current implementation notifications aren't used at all, but the
> > virtio-fs spec should allow for it so that the full FUSE protocol is
> > available for future applications.)
> >
> > Miklos: What do you think from the FUSE protocol point of view?
> 
> I  think notifications would be limited in functionality in the virtio
> case.  E.g. FUSE_NOTIFY_INVAL_INODE could be used if the server
> detects that cached attributes have become invalid.  If this is a best
> effort thing (doesn't block other clients) then it's okay.  But if
> it's for implementing strong coherency, then it doesn't work that
> well, since the broken client can block other clients from making
> progress.
>
> So I'm not sure.  Probably easier to leave notifications out of the
> implementation and the spec, until an actual use case arises.

I'd still like to discuss the options because it would be a real problem
to spec the device without notifications and then find out the design
cannot be extended when we need them.

When the device runs out of space to queue notifications for a slow
client, that client must be kicked out so that others can continue.
This seems like the most robust way to keep the file system available.
Only the client that couldn't keep up is hurt.

In our implementation each virtiofsd has a single client, so I'm not
sure the denial-of-service you described can occur (is that something
that involves the ireg daemon?).  The good thing is this means that
virtiofsd may block until its sole client replenishes notifications
buffers.  No other clients are hurt!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [virtio-dev] [PATCH v2 1/2] content: add virtio file system device
       [not found]           ` <CAOssrKcXftoaszAoNF3936rTP0KgD6mL4P=hoy54v2i+x-1o8A@mail.gmail.com>
@ 2019-02-18 10:20             ` Stefan Hajnoczi
       [not found]               ` <CAOssrKeZhTMbKET=MRjDpqmWNMj00b2MG4Gv6MY-kdKYf1=eyA@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2019-02-18 10:20 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: virtio-dev, Vivek Goyal, Dr. David Alan Gilbert, Sage Weil,
	Steven Whitehouse, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 5841 bytes --]

On Thu, Feb 14, 2019 at 09:50:23AM +0100, Miklos Szeredi wrote:
> On Thu, Feb 14, 2019 at 9:36 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Feb 14, 2019 at 08:20:58AM +0100, Miklos Szeredi wrote:
> > > On Thu, Feb 14, 2019 at 4:33 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Wed, Feb 13, 2019 at 05:47:19PM +0100, Paolo Bonzini wrote:
> > > > > On 13/02/19 07:33, Stefan Hajnoczi wrote:
> > > > > > +Notifications are different from normal requests because they only contain
> > > > > > +device writable fields.  The driver sends notification replies on one of the
> > > > > > +request queues.  The format of notification requests is as follows:
> > > > > > +
> > > > > > +\begin{lstlisting}
> > > > > > +struct virtio_fs_notification_req {
> > > > > > +        // Device-writable part
> > > > > > +        struct fuse_out_header out;
> > > > > > +        u8 dataout[];
> > > > > > +};
> > > > > > +\end{lstlisting}
> > > > > > +
> > > > > > +\field{out} is the completion header common to all types of FUSE requests.  The
> > > > > > +\field{out.unique} field is 0 and the \field{out.error} field contains a
> > > > > > +FUSE_NOTIFY_* code.
> > > > > > +
> > > > > > +\field{dataout} consists of request-specific data, if any.  This is identical
> > > > > > +to the data written to the /dev/fuse device by a FUSE daemon.
> > > > > > +
> > > > >
> > > > > What happens if notifications are lost because no request was there?
> > > > > virtio-scsi has a flag for that, would it make sense to add it here too?
> > > >
> > > > The FUSE protocol assumes notification delivery is reliable.  Some
> > > > notifications can be dropped with no or little impact on functionality,
> > > > but others cannot because it would cause a hung operation.
> > > >
> > > > Therefore the device must hold notifications until the driver makes
> > > > buffers available.  The question becomes what happens when the device
> > > > runs out of space to hold notifications.  At this point the device must
> > > > be reset because no further progress is possible.
> > > >
> > > > (In our current implementation notifications aren't used at all, but the
> > > > virtio-fs spec should allow for it so that the full FUSE protocol is
> > > > available for future applications.)
> > > >
> > > > Miklos: What do you think from the FUSE protocol point of view?
> > >
> > > I  think notifications would be limited in functionality in the virtio
> > > case.  E.g. FUSE_NOTIFY_INVAL_INODE could be used if the server
> > > detects that cached attributes have become invalid.  If this is a best
> > > effort thing (doesn't block other clients) then it's okay.  But if
> > > it's for implementing strong coherency, then it doesn't work that
> > > well, since the broken client can block other clients from making
> > > progress.
> > >
> > > So I'm not sure.  Probably easier to leave notifications out of the
> > > implementation and the spec, until an actual use case arises.
> >
> > I'd still like to discuss the options because it would be a real problem
> > to spec the device without notifications and then find out the design
> > cannot be extended when we need them.
> >
> > When the device runs out of space to queue notifications for a slow
> > client, that client must be kicked out so that others can continue.
> > This seems like the most robust way to keep the file system available.
> > Only the client that couldn't keep up is hurt.
> >
> > In our implementation each virtiofsd has a single client, so I'm not
> > sure the denial-of-service you described can occur (is that something
> > that involves the ireg daemon?).
> 
> The DoS would involve any mechanism synchronizing one client's
> metadata modification with another client's metadata retrieval.  In
> our implementation it is hoped to bypass that synchronization by using
> the version number living in shared memory.
> 
> > The good thing is this means that
> > virtiofsd may block until its sole client replenishes notifications
> > buffers.  No other clients are hurt!
> 
> Right, so the notification in itself is not a source of DoS, but for
> it to be useful, it would necessarily involve some higher level
> synchronization, no?

I think this depends on the file system.  In some cases notifications
require resource allocation and that leads to the problems we've been
discussing.  In other cases the file system daemon may be able to send
notifications at a later point in time without resource exhaustion, so
no flow control or synchronization is needed.

Here are my thoughts on each FUSE notification type:

FUSE_NOTIFY_POLL
Requires guaranteed delivery: Yes, to wake up a sleeping task
Possibility of resource exhaustion: Yes, If other clients generate poll
notifications via their activity, then the notifications buffers can be
exhausted.

When is FUSE_NOTIFY_POLL used?  I guess it's needed in cases where the
file system is changed on the remote side.  virtiofsd should implement
this notification eventually so that virtio-fs clients notice changes.

FUSE_NOTIFY_INVAL_INODE
Alternative: Use version shared memory region instead*

FUSE_NOTIFY_INVAL_ENTRY
Alternative: Use version shared memory region instead*

FUSE_NOTIFY_STORE
Alternative: Use DAX instead

FUSE_NOTIFY_RETRIEVE
Alternative: Use DAX instead

FUSE_NOTIFY_DELETE
Alternative: Use version shared memory region instead*
             (Does it handle deleted inodes?)

* The version shared memory region prevents clients from using stale
  data (good) but doesn't prompt the client that something has changed
  (bad).  This means inotify or similar cannot rely on these
  notifications to detect changes if we use the version shared memory
  instead.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [virtio-dev] [PATCH v2 1/2] content: add virtio file system device
       [not found]               ` <CAOssrKeZhTMbKET=MRjDpqmWNMj00b2MG4Gv6MY-kdKYf1=eyA@mail.gmail.com>
@ 2019-02-18 14:44                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2019-02-18 14:44 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: virtio-dev, Vivek Goyal, Dr. David Alan Gilbert, Sage Weil,
	Steven Whitehouse, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 445 bytes --]

On Mon, Feb 18, 2019 at 11:52:34AM +0100, Miklos Szeredi wrote:
> Indeed, inotify and friends would need notification support, but at
> the moment no network filesystem has that and the kernel doesn't have
> interfaces for filesystems to provide remote notifications, so this is
> quite theoretical at this point.

This discussion makes me more confident that virtio-fs doesn't need
notifications yet.

I will remove them from the spec.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-02-18 14:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-13  6:33 [virtio-dev] [PATCH v2 0/2] virtio-fs: add virtio file system device Stefan Hajnoczi
2019-02-13  6:33 ` [virtio-dev] [PATCH v2 1/2] content: " Stefan Hajnoczi
2019-02-13 16:47   ` Paolo Bonzini
2019-02-14  3:33     ` Stefan Hajnoczi
     [not found]       ` <CAOssrKdfYic5bSz-GkfZUbd=OrADwrVXPRTQiVgOFbzpiVwNZg@mail.gmail.com>
2019-02-14  8:36         ` Stefan Hajnoczi
     [not found]           ` <CAOssrKcXftoaszAoNF3936rTP0KgD6mL4P=hoy54v2i+x-1o8A@mail.gmail.com>
2019-02-18 10:20             ` Stefan Hajnoczi
     [not found]               ` <CAOssrKeZhTMbKET=MRjDpqmWNMj00b2MG4Gv6MY-kdKYf1=eyA@mail.gmail.com>
2019-02-18 14:44                 ` Stefan Hajnoczi
2019-02-13  6:33 ` [virtio-dev] [PATCH v2 2/2] virtio-fs: add DAX window Stefan Hajnoczi
2019-02-13 16:49   ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.