All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6] fuse: Add support for passthrough read/write
@ 2020-08-12 16:14 Alessio Balsini
  2020-08-12 18:29 ` Jann Horn
  2020-08-12 22:06 ` kernel test robot
  0 siblings, 2 replies; 13+ messages in thread
From: Alessio Balsini @ 2020-08-12 16:14 UTC (permalink / raw)
  To: Miklos Szeredi, Nikhilesh Reddy
  Cc: Akilesh Kailash, David Anderson, Eric Yan, Jann Horn,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, kernel-team, linux-fsdevel, linux-kernel

Add support for filesystem passthrough read/write of files when enabled in
userspace through the option FUSE_PASSTHROUGH.

There are filesystems based on FUSE that are intended to enforce special
policies or trigger complicate decision makings at the file operations
level.  Android, for example, uses FUSE to enforce fine-grained access
policies that also depend on the file contents.
Sometimes it happens that at open or create time a file is identified as
not requiring additional checks for consequent reads/writes, thus FUSE
would simply act as a passive bridge between the process accessing the FUSE
filesystem and the lower filesystem. Splicing and caching help reducing the
FUSE overhead, but there are still read/write operations forwarded to the
userspace FUSE daemon that could be avoided.

When the FUSE_PASSTHROUGH capability is enabled the FUSE daemon may decide
while handling the open or create operations if the given file can be
accessed in passthrough mode, meaning that all the further read and write
operations would be forwarded by the kernel directly to the lower
filesystem rather than to the FUSE daemon. All requests that are not reads
or writes are still handled by the userspace FUSE daemon.
This allows for improved performance on reads and writes. Benchmarks show
improved performance that is close to native filesystem access when doing
massive manipulations on a single opened file, especially in the case of
random reads, for which the bandwidth increased by almost 2X or sequential
writes for which the improvement is close to 3X.

The creation of this direct connection (passthrough) between FUSE file
objects and file objects in the lower filesystem happens in a way that
reminds of passing file descriptors via sockets:
- a process opens a file handled by FUSE, so the kernel forwards the
  request to the FUSE daemon;
- the FUSE daemon opens the target file in the lower filesystem, getting
  its file descriptor;
- the file descriptor is passed to the kernel via /dev/fuse;
- the kernel gets the file pointer navigating through the opened files of
  the "current" process and stores it in an additional field in the
  fuse_file owned by the process accessing the FUSE filesystem.
From now all the read/write operations performed by that process will be
redirected to the file pointer pointing at the lower filesystem's file.

Signed-off-by: Nikhilesh Reddy <reddyn@codeaurora.org>
Signed-off-by: Alessio Balsini <balsini@android.com>
--

    Performance

What follows has been performed with this change rebased on top of a
vanilla v5.8 Linux kernel, using a custom passthrough_hp FUSE daemon that
enables pass-through for each file that is opened during both “open” and
“create”. Tests were run on an Intel Xeon E5-2678V3, 32GiB of RAM, with an
ext4-formatted SSD as the lower filesystem, with no special tuning, e.g.,
all the involved processes are SCHED_OTHER, ondemand is the frequency
governor with no frequency restrictions, and turbo-boost, as well as
p-state, are active. This is because I noticed that, for such high-level
benchmarks, results consistency was minimally affected by these features.

The source code of the updated libfuse library and passthrough_hp is shared
at the following repository:

    https://github.com/balsini/libfuse/tree/fuse-passthrough-stable-v.3.9.4

Two different kinds of benchmarks were done for this change, the first set
of tests evaluates the bandwidth improvements when manipulating a huge
single file, the second set of tests verify that no performance regressions
were introduced when handling many small files.

The first benchmarks were done by running FIO (fio-3.21) with:
- bs=4Ki;
- file size: 50Gi;
- ioengine: sync;
- fsync_on_close: true.
The target file has been chosen large enough to avoid it to be entirely
loaded into the page cache.
Results are presented in the following table:

+-----------+--------+-------------+--------+
| Bandwidth |  FUSE  |     FUSE    |  Bind  |
|  (KiB/s)  |        | passthrough |  mount |
+-----------+--------+-------------+--------+
| read      | 468897 |      502085 | 516830 |
+-----------+--------+-------------+--------+
| randread  |  15773 |       26632 |  21386 |
+-----------+--------+-------------+--------+
| write     |  58185 |      141272 | 141671 |
+-----------+--------+-------------+--------+
| randwrite |  59892 |       75236 |  76486 |
+-----------+--------+-------------+--------+

As long as this patch has the primary objective of improving bandwidth,
another set of tests has been performed to see how this behaves on a
totally different scenario that involves accessing many small files. For
this purpose, measuring the build time of the Linux kernel has been chosen
as a well-known workload.
The kernel has been built with as many processes as the number of logical
CPUs (-j $(nproc)), that besides being a reasonable number, is also enough
to saturate the processor’s utilization thanks to the additional FUSE
daemon’s threads, making it even harder to get closer to the native
filesystem performance.
The following table shows the total build times in the different
configurations:

+------------------+--------------+-----------+
|                  | AVG duration |  Standard |
|                  |     (sec)    | deviation |
+------------------+--------------+-----------+
| FUSE             |      144.566 |     0.697 |
+------------------+--------------+-----------+
| FUSE passthrough |      133.820 |     0.341 |
+------------------+--------------+-----------+
| Raw              |      109.423 |     0.724 |
+------------------+--------------+-----------+

Further testing and performance evaluations are welcome.

Changes in v6:
* Port to kernel v5.8:
  * fuse_file_{read,write}_iter() changed since the v5 of this patch was
    proposed.
* Simplify fuse_simple_request().
* Merge fuse_passthrough.h into fuse_i.h
* Refactor of passthrough.c:
  * Remove BUG_ON()s.
  * Simplified error checking and request arguments indexing.
  * Use call_{read,write}_iter() utility functions.
  * Remove get_file() and fputs() during read/write: handle the extra FUSE
    references to the lower file object when the fuse_file is
    created/deleted.
  [Proposed by Jann Horn]

Changes in v5:
* Fix the check when setting the passthrough file.
  [Found when testing by Mike Shal]

Changes in v3 and v4:
* Use the fs_stack_depth to prevent further stacking and a minor fix.
  [Proposed by Jann Horn]

Changes in v2:
* Changed the feature name to passthrough from stacked_io.
  [Proposed by Linus Torvalds]
---
 fs/fuse/Makefile          |   1 +
 fs/fuse/dev.c             |   3 ++
 fs/fuse/dir.c             |   2 +
 fs/fuse/file.c            |  25 ++++++---
 fs/fuse/fuse_i.h          |  18 +++++++
 fs/fuse/inode.c           |   9 +++-
 fs/fuse/passthrough.c     | 110 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fuse.h |   4 +-
 8 files changed, 164 insertions(+), 8 deletions(-)
 create mode 100644 fs/fuse/passthrough.c

diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 3e8cebfb59b7..6971454a2bdf 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -8,4 +8,5 @@ obj-$(CONFIG_CUSE) += cuse.o
 obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
 
 fuse-objs := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o
+fuse-objs += passthrough.o
 virtiofs-y += virtio_fs.o
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 02b3c36b3676..c2131301eeba 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -506,6 +506,7 @@ ssize_t fuse_simple_request(struct fuse_conn *fc, struct fuse_args *args)
 		BUG_ON(args->out_numargs == 0);
 		ret = args->out_args[args->out_numargs - 1].size;
 	}
+	args->passthrough_filp = req->passthrough_filp;
 	fuse_put_request(fc, req);
 
 	return ret;
@@ -1897,6 +1898,8 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
 		err = copy_out_args(cs, req->args, nbytes);
 	fuse_copy_finish(cs);
 
+	fuse_setup_passthrough(fc, req);
+
 	spin_lock(&fpq->lock);
 	clear_bit(FR_LOCKED, &req->flags);
 	if (!fpq->connected)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 26f028bc760b..531de0c5c9e8 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -477,6 +477,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
 	args.out_args[0].value = &outentry;
 	args.out_args[1].size = sizeof(outopen);
 	args.out_args[1].value = &outopen;
+	args.passthrough_filp = NULL;
 	err = fuse_simple_request(fc, &args);
 	if (err)
 		goto out_free_ff;
@@ -489,6 +490,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
 	ff->fh = outopen.fh;
 	ff->nodeid = outentry.nodeid;
 	ff->open_flags = outopen.open_flags;
+	ff->passthrough_filp = args.passthrough_filp;
 	inode = fuse_iget(dir->i_sb, outentry.nodeid, outentry.generation,
 			  &outentry.attr, entry_attr_timeout(&outentry), 0);
 	if (!inode) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 83d917f7e542..c3289ff0cd33 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -33,10 +33,12 @@ static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
 }
 
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
-			  int opcode, struct fuse_open_out *outargp)
+			  int opcode, struct fuse_open_out *outargp,
+			  struct file **passthrough_filp)
 {
 	struct fuse_open_in inarg;
 	FUSE_ARGS(args);
+	int ret;
 
 	memset(&inarg, 0, sizeof(inarg));
 	inarg.flags = file->f_flags & ~(O_CREAT | O_EXCL | O_NOCTTY);
@@ -51,7 +53,10 @@ static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 	args.out_args[0].size = sizeof(*outargp);
 	args.out_args[0].value = outargp;
 
-	return fuse_simple_request(fc, &args);
+	ret = fuse_simple_request(fc, &args);
+	*passthrough_filp = args.passthrough_filp;
+
+	return ret;
 }
 
 struct fuse_release_args {
@@ -144,14 +149,16 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 	/* Default for no-open */
 	ff->open_flags = FOPEN_KEEP_CACHE | (isdir ? FOPEN_CACHE_DIR : 0);
 	if (isdir ? !fc->no_opendir : !fc->no_open) {
+		struct file *passthrough_filp;
 		struct fuse_open_out outarg;
 		int err;
 
-		err = fuse_send_open(fc, nodeid, file, opcode, &outarg);
+		err = fuse_send_open(fc, nodeid, file, opcode, &outarg,
+				     &passthrough_filp);
 		if (!err) {
 			ff->fh = outarg.fh;
 			ff->open_flags = outarg.open_flags;
-
+			ff->passthrough_filp = passthrough_filp;
 		} else if (err != -ENOSYS) {
 			fuse_file_free(ff);
 			return err;
@@ -281,6 +288,8 @@ void fuse_release_common(struct file *file, bool isdir)
 	struct fuse_release_args *ra = ff->release_args;
 	int opcode = isdir ? FUSE_RELEASEDIR : FUSE_RELEASE;
 
+	fuse_passthrough_release(ff);
+
 	fuse_prepare_release(fi, ff, file->f_flags, opcode);
 
 	if (ff->flock) {
@@ -1543,7 +1552,9 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (is_bad_inode(file_inode(file)))
 		return -EIO;
 
-	if (!(ff->open_flags & FOPEN_DIRECT_IO))
+	if (ff->passthrough_filp)
+		return fuse_passthrough_read_iter(iocb, to);
+	else if (!(ff->open_flags & FOPEN_DIRECT_IO))
 		return fuse_cache_read_iter(iocb, to);
 	else
 		return fuse_direct_read_iter(iocb, to);
@@ -1557,7 +1568,9 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (is_bad_inode(file_inode(file)))
 		return -EIO;
 
-	if (!(ff->open_flags & FOPEN_DIRECT_IO))
+	if (ff->passthrough_filp)
+		return fuse_passthrough_write_iter(iocb, from);
+	else if (!(ff->open_flags & FOPEN_DIRECT_IO))
 		return fuse_cache_write_iter(iocb, from);
 	else
 		return fuse_direct_write_iter(iocb, from);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 740a8a7d7ae6..5d6f2f913c71 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -208,6 +208,12 @@ struct fuse_file {
 
 	} readdir;
 
+	/**
+	 * Reference to lower filesystem file for read/write operations
+	 * handled in pass-through mode
+	 */
+	struct file *passthrough_filp;
+
 	/** RB node to be linked on fuse_conn->polled_files */
 	struct rb_node polled_node;
 
@@ -252,6 +258,7 @@ struct fuse_args {
 	bool may_block:1;
 	struct fuse_in_arg in_args[3];
 	struct fuse_arg out_args[2];
+	struct file *passthrough_filp;
 	void (*end)(struct fuse_conn *fc, struct fuse_args *args, int error);
 };
 
@@ -353,6 +360,9 @@ struct fuse_req {
 		struct fuse_out_header h;
 	} out;
 
+	/** Lower filesystem file pointer used in pass-through mode */
+	struct file *passthrough_filp;
+
 	/** Used to wake up the task waiting for completion of request*/
 	wait_queue_head_t waitq;
 
@@ -720,6 +730,9 @@ struct fuse_conn {
 	/* Do not show mount options */
 	unsigned int no_mount_options:1;
 
+	/** Pass-through mode for read/write IO */
+	unsigned int passthrough:1;
+
 	/** The number of requests waiting for completion */
 	atomic_t num_waiting;
 
@@ -1093,4 +1106,9 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
 u64 fuse_get_unique(struct fuse_iqueue *fiq);
 void fuse_free_conn(struct fuse_conn *fc);
 
+void fuse_setup_passthrough(struct fuse_conn *fc, struct fuse_req *req);
+ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *to);
+ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *from);
+void fuse_passthrough_release(struct fuse_file *ff);
+
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index bba747520e9b..eb223130a917 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -965,6 +965,12 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
 					min_t(unsigned int, FUSE_MAX_MAX_PAGES,
 					max_t(unsigned int, arg->max_pages, 1));
 			}
+			if (arg->flags & FUSE_PASSTHROUGH) {
+				fc->passthrough = 1;
+				/* Prevent further stacking */
+				fc->sb->s_stack_depth =
+					FILESYSTEM_MAX_STACK_DEPTH;
+			}
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1002,7 +1008,8 @@ void fuse_send_init(struct fuse_conn *fc)
 		FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
 		FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
 		FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
-		FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
+		FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
+		FUSE_PASSTHROUGH;
 	ia->args.opcode = FUSE_INIT;
 	ia->args.in_numargs = 1;
 	ia->args.in_args[0].size = sizeof(ia->in);
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
new file mode 100644
index 000000000000..02786a55c3ab
--- /dev/null
+++ b/fs/fuse/passthrough.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "fuse_i.h"
+
+#include <linux/aio.h>
+#include <linux/fs_stack.h>
+
+void fuse_setup_passthrough(struct fuse_conn *fc, struct fuse_req *req)
+{
+	struct super_block *passthrough_sb;
+	struct inode *passthrough_inode;
+	struct fuse_open_out *open_out;
+	struct file *passthrough_filp;
+	unsigned short open_out_index;
+	int fs_stack_depth;
+
+	req->passthrough_filp = NULL;
+
+	if (!fc->passthrough)
+		return;
+
+	if (!(req->in.h.opcode == FUSE_OPEN && req->args->out_numargs == 1) &&
+	    !(req->in.h.opcode == FUSE_CREATE && req->args->out_numargs == 2))
+		return;
+
+	open_out_index = req->args->out_numargs - 1;
+
+	if (req->args->out_args[open_out_index].size != sizeof(*open_out))
+		return;
+
+	open_out = req->args->out_args[open_out_index].value;
+
+	if (!(open_out->open_flags & FOPEN_PASSTHROUGH))
+		return;
+
+	if (open_out->fd < 0)
+		return;
+
+	passthrough_filp = fget_raw(open_out->fd);
+	if (!passthrough_filp)
+		return;
+
+	passthrough_inode = file_inode(passthrough_filp);
+	passthrough_sb = passthrough_inode->i_sb;
+	fs_stack_depth = passthrough_sb->s_stack_depth + 1;
+	if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
+		fput(passthrough_filp);
+		return;
+	}
+
+	req->passthrough_filp = passthrough_filp;
+}
+
+static inline ssize_t fuse_passthrough_read_write_iter(struct kiocb *iocb,
+						       struct iov_iter *iter,
+						       bool write)
+{
+	struct file *fuse_filp = iocb->ki_filp;
+	struct fuse_file *ff = fuse_filp->private_data;
+	struct file *passthrough_filp = ff->passthrough_filp;
+	struct inode *passthrough_inode;
+	struct inode *fuse_inode;
+	ssize_t ret = -EIO;
+
+	fuse_inode = fuse_filp->f_path.dentry->d_inode;
+	passthrough_inode = file_inode(passthrough_filp);
+
+	iocb->ki_filp = passthrough_filp;
+
+	if (write) {
+		if (!passthrough_filp->f_op->write_iter)
+			goto out;
+
+		ret = call_write_iter(passthrough_filp, iocb, iter);
+		if (ret >= 0 || ret == -EIOCBQUEUED) {
+			fsstack_copy_inode_size(fuse_inode, passthrough_inode);
+			fsstack_copy_attr_times(fuse_inode, passthrough_inode);
+		}
+	} else {
+		if (!passthrough_filp->f_op->read_iter)
+			goto out;
+
+		ret = call_read_iter(passthrough_filp, iocb, iter);
+		if (ret >= 0 || ret == -EIOCBQUEUED)
+			fsstack_copy_attr_atime(fuse_inode, passthrough_inode);
+	}
+
+out:
+	iocb->ki_filp = fuse_filp;
+
+	return ret;
+}
+
+ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	return fuse_passthrough_read_write_iter(iocb, to, false);
+}
+
+ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	return fuse_passthrough_read_write_iter(iocb, from, true);
+}
+
+void fuse_passthrough_release(struct fuse_file *ff)
+{
+	if (ff->passthrough_filp) {
+		fput(ff->passthrough_filp);
+		ff->passthrough_filp = NULL;
+	}
+}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 373cada89815..e50bd775210a 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -283,6 +283,7 @@ struct fuse_file_lock {
 #define FOPEN_NONSEEKABLE	(1 << 2)
 #define FOPEN_CACHE_DIR		(1 << 3)
 #define FOPEN_STREAM		(1 << 4)
+#define FOPEN_PASSTHROUGH	(1 << 5)
 
 /**
  * INIT request/reply flags
@@ -342,6 +343,7 @@ struct fuse_file_lock {
 #define FUSE_NO_OPENDIR_SUPPORT (1 << 24)
 #define FUSE_EXPLICIT_INVAL_DATA (1 << 25)
 #define FUSE_MAP_ALIGNMENT	(1 << 26)
+#define FUSE_PASSTHROUGH	(1 << 27)
 
 /**
  * CUSE INIT request/reply flags
@@ -591,7 +593,7 @@ struct fuse_create_in {
 struct fuse_open_out {
 	uint64_t	fh;
 	uint32_t	open_flags;
-	uint32_t	padding;
+	int32_t		fd;
 };
 
 struct fuse_release_in {
-- 
2.28.0.236.gb10cc79966-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v6] fuse: Add support for passthrough read/write
  2020-08-12 16:14 [PATCH v6] fuse: Add support for passthrough read/write Alessio Balsini
@ 2020-08-12 18:29 ` Jann Horn
  2020-08-13 13:28   ` Alessio Balsini
  2020-08-13 18:41   ` Jens Axboe
  2020-08-12 22:06 ` kernel test robot
  1 sibling, 2 replies; 13+ messages in thread
From: Jann Horn @ 2020-08-12 18:29 UTC (permalink / raw)
  To: Alessio Balsini, Jens Axboe
  Cc: Miklos Szeredi, Nikhilesh Reddy, Akilesh Kailash, David Anderson,
	Eric Yan, Martijn Coenen, Palmer Dabbelt, Paul Lawrence,
	Stefano Duo, Zimuzo Ezeozue, kernel-team, linux-fsdevel,
	kernel list

[+Jens: can you have a look at that ->ki_filp switcheroo in
fuse_passthrough_read_write_iter() and help figure out whether that's
fine? This seems like your area of expertise.]

On Wed, Aug 12, 2020 at 6:15 PM Alessio Balsini <balsini@android.com> wrote:
> Add support for filesystem passthrough read/write of files when enabled in
> userspace through the option FUSE_PASSTHROUGH.

Oh, I remember this old series... v5 was from 2016? Nice to see
someone picking this up again, I liked the idea quite a bit.

> There are filesystems based on FUSE that are intended to enforce special
> policies or trigger complicate decision makings at the file operations
> level.  Android, for example, uses FUSE to enforce fine-grained access
> policies that also depend on the file contents.
> Sometimes it happens that at open or create time a file is identified as
> not requiring additional checks for consequent reads/writes, thus FUSE
> would simply act as a passive bridge between the process accessing the FUSE
> filesystem and the lower filesystem. Splicing and caching help reducing the
> FUSE overhead, but there are still read/write operations forwarded to the
> userspace FUSE daemon that could be avoided.
>
> When the FUSE_PASSTHROUGH capability is enabled the FUSE daemon may decide
> while handling the open or create operations if the given file can be
> accessed in passthrough mode, meaning that all the further read and write
> operations would be forwarded by the kernel directly to the lower
> filesystem rather than to the FUSE daemon. All requests that are not reads
> or writes are still handled by the userspace FUSE daemon.
> This allows for improved performance on reads and writes. Benchmarks show
> improved performance that is close to native filesystem access when doing
> massive manipulations on a single opened file, especially in the case of
> random reads, for which the bandwidth increased by almost 2X or sequential
> writes for which the improvement is close to 3X.
>
> The creation of this direct connection (passthrough) between FUSE file
> objects and file objects in the lower filesystem happens in a way that
> reminds of passing file descriptors via sockets:
> - a process opens a file handled by FUSE, so the kernel forwards the
>   request to the FUSE daemon;
> - the FUSE daemon opens the target file in the lower filesystem, getting
>   its file descriptor;
> - the file descriptor is passed to the kernel via /dev/fuse;
> - the kernel gets the file pointer navigating through the opened files of
>   the "current" process and stores it in an additional field in the
>   fuse_file owned by the process accessing the FUSE filesystem.

Unfortunately, this step isn't really compatible with how the current
FUSE API works. Essentially, the current FUSE API is reached via
operations like the write() syscall, and reaches FUSE through either
->write_iter or ->splice_write in fuse_dev_operations. In that
context, operations like fget_raw() that operate on the calling
process are unsafe.

The reason why operations like fget() are unsafe in this context is
that the security model of Linux fundamentally assumes that if you get
a file descriptor from an untrusted process, and you write stuff into
it, anything that happens will be limited to things that the process
that gave you the file descriptor would've been able to do on its own.
Or in other words, an attacker shouldn't be able to gain anything by
convincing a privileged process to write attacker-controlled data into
an attacker-supplied file descriptor. With the current design, an
attacker may be able to trick a privileged process into installing one
of its FDs as a passthrough FD into an attacker-controlled FUSE
instance (while the privileged process thinks that it's just writing
some opaque data into some file), and thereby make it possible for an
attacker to indirectly gain the ability to read/write that FD.

The only way I see to fix this somewhat cleanly would be to add a new
command to fuse_dev_ioctl() that can be used to submit replies as an
alternative to the write()-based interface. (That should probably be a
separate patch in this patch series.) Then, you could have a flag
argument to fuse_dev_do_write() that tells it whether the ioctl
interface was used, and use that information to decide whether
fuse_setup_passthrough() is allowed.
(An alternative approach would be to require userspace to set some
flag on the write operation that says "I am intentionally performing
an operation that depends on caller identity" and pass that through -
e.g. by using pwritev2()'s flags argument. But I think historically
the stance has been that stuff like write() simply should not be
looking at the calling process.)

> From now all the read/write operations performed by that process will be
> redirected to the file pointer pointing at the lower filesystem's file.
[...]
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
[...]
> +void fuse_setup_passthrough(struct fuse_conn *fc, struct fuse_req *req)
> +{
> +       struct super_block *passthrough_sb;
> +       struct inode *passthrough_inode;
> +       struct fuse_open_out *open_out;
> +       struct file *passthrough_filp;
> +       unsigned short open_out_index;
> +       int fs_stack_depth;
> +
> +       req->passthrough_filp = NULL;
> +
> +       if (!fc->passthrough)
> +               return;
> +
> +       if (!(req->in.h.opcode == FUSE_OPEN && req->args->out_numargs == 1) &&
> +           !(req->in.h.opcode == FUSE_CREATE && req->args->out_numargs == 2))
> +               return;
> +
> +       open_out_index = req->args->out_numargs - 1;
> +
> +       if (req->args->out_args[open_out_index].size != sizeof(*open_out))
> +               return;
> +
> +       open_out = req->args->out_args[open_out_index].value;
> +
> +       if (!(open_out->open_flags & FOPEN_PASSTHROUGH))
> +               return;
> +
> +       if (open_out->fd < 0)
> +               return;

What is the intent here? fget() will fail anyway if the fd is invalid.

> +       passthrough_filp = fget_raw(open_out->fd);

This should probably be a normal fget()? fget_raw() is just necessary
if you want to permit using O_PATH file descriptors, and read/write
operations don't work on those.

> +       if (!passthrough_filp)
> +               return;

This error path can only be reached if the caller supplied invalid
data. IMO this should bail out with an error.

> +       passthrough_inode = file_inode(passthrough_filp);
> +       passthrough_sb = passthrough_inode->i_sb;
> +       fs_stack_depth = passthrough_sb->s_stack_depth + 1;
> +       if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> +               fput(passthrough_filp);
> +               return;
> +       }

This is an error path that silently ignores the error and continues to
open the file normally as if FOPEN_PASSTHROUGH hadn't been set. Is
this an intentional fallback? If so, maybe you could add a comment on
top of fuse_setup_passthrough() like: "If setting up passthrough fails
due to incompatibility, we ignore the error and continue setting up
the file as normal."

> +       req->passthrough_filp = passthrough_filp;
> +}
> +
> +static inline ssize_t fuse_passthrough_read_write_iter(struct kiocb *iocb,
> +                                                      struct iov_iter *iter,
> +                                                      bool write)
> +{
> +       struct file *fuse_filp = iocb->ki_filp;
> +       struct fuse_file *ff = fuse_filp->private_data;
> +       struct file *passthrough_filp = ff->passthrough_filp;
> +       struct inode *passthrough_inode;
> +       struct inode *fuse_inode;
> +       ssize_t ret = -EIO;
> +
> +       fuse_inode = fuse_filp->f_path.dentry->d_inode;

I think this could just use file_inode(fuse_filp)?


> +       passthrough_inode = file_inode(passthrough_filp);
> +
> +       iocb->ki_filp = passthrough_filp;

Hmm... so we're temporarily switching out the iocb's ->ki_filp here? I
wonder whether it is possible for some other code to look at ->ki_filp
concurrently... maybe Jens Axboe knows whether that could plausibly
happen?

Second question about this switcheroo below...

> +       if (write) {
> +               if (!passthrough_filp->f_op->write_iter)
> +                       goto out;
> +
> +               ret = call_write_iter(passthrough_filp, iocb, iter);

Hmm, I don't think we can just invoke
call_write_iter()/call_read_iter() like this. If you look at something
like vfs_writev(), you can see that normally, there are a bunch of
other things that happen:

 - file_start_write() before the write
 - check whether the file's ->f_mode permits writing with FMODE_WRITE
and FMODE_CAN_WRITE
 - rw_verify_area() for stuff like mandatory locking and LSM security
checks (although admittedly this LSM security check is pretty useless)
 - fsnotify_modify() to trigger inotify watches and such that notify
userspace of file modifications
 - file_end_write() after the write

You should probably try to use vfs_iocb_iter_write() here, and figure
out how to properly add file_start_write()/file_end_write() calls
around this. Perhaps ovl_write_iter() from fs/overlayfs/file.c can
help with this? Note that you can't always just call file_end_write()
synchronously, since the request may complete asynchronously.

> +               if (ret >= 0 || ret == -EIOCBQUEUED) {
> +                       fsstack_copy_inode_size(fuse_inode, passthrough_inode);
> +                       fsstack_copy_attr_times(fuse_inode, passthrough_inode);
> +               }
> +       } else {
> +               if (!passthrough_filp->f_op->read_iter)
> +                       goto out;
> +
> +               ret = call_read_iter(passthrough_filp, iocb, iter);
> +               if (ret >= 0 || ret == -EIOCBQUEUED)
> +                       fsstack_copy_attr_atime(fuse_inode, passthrough_inode);
> +       }
> +
> +out:
> +       iocb->ki_filp = fuse_filp;

Also a question that I hope Jens can help with: If this is an
asynchronous request, would something bad happen if the request
completes before we reach that last ->ki_filp write (if that is even
possible)? Or could an asynchronous request blow up because this
->ki_filp write happens before the request has completed?

> +       return ret;
> +}

Overall, I wonder whether it would be better to copy overlayfs'
approach of creating a new iocb instead of trying to switch out parts
of the existing iocb (see ovl_write_iter()). That would simplify this
weirdness a lot, at the cost of having to allocate slab memory to
store the copied iocb.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6] fuse: Add support for passthrough read/write
  2020-08-12 16:14 [PATCH v6] fuse: Add support for passthrough read/write Alessio Balsini
  2020-08-12 18:29 ` Jann Horn
@ 2020-08-12 22:06 ` kernel test robot
  1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2020-08-12 22:06 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2848 bytes --]

Hi Alessio,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on fuse/for-next]
[also build test WARNING on v5.8 next-20200812]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Alessio-Balsini/fuse-Add-support-for-passthrough-read-write/20200813-001709
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
compiler: h8300-linux-gcc (GCC) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


cppcheck warnings: (new ones prefixed by >>)

>> fs/fuse/passthrough.c:74:7: warning: Variable 'ret' is reassigned a value before the old one has been used. [redundantAssignment]
     ret = call_write_iter(passthrough_filp, iocb, iter);
         ^
   fs/fuse/passthrough.c:63:0: note: Variable 'ret' is reassigned a value before the old one has been used.
    ssize_t ret = -EIO;
   ^
   fs/fuse/passthrough.c:74:7: note: Variable 'ret' is reassigned a value before the old one has been used.
     ret = call_write_iter(passthrough_filp, iocb, iter);
         ^

vim +/ret +74 fs/fuse/passthrough.c

    53	
    54	static inline ssize_t fuse_passthrough_read_write_iter(struct kiocb *iocb,
    55							       struct iov_iter *iter,
    56							       bool write)
    57	{
    58		struct file *fuse_filp = iocb->ki_filp;
    59		struct fuse_file *ff = fuse_filp->private_data;
    60		struct file *passthrough_filp = ff->passthrough_filp;
    61		struct inode *passthrough_inode;
    62		struct inode *fuse_inode;
    63		ssize_t ret = -EIO;
    64	
    65		fuse_inode = fuse_filp->f_path.dentry->d_inode;
    66		passthrough_inode = file_inode(passthrough_filp);
    67	
    68		iocb->ki_filp = passthrough_filp;
    69	
    70		if (write) {
    71			if (!passthrough_filp->f_op->write_iter)
    72				goto out;
    73	
  > 74			ret = call_write_iter(passthrough_filp, iocb, iter);
    75			if (ret >= 0 || ret == -EIOCBQUEUED) {
    76				fsstack_copy_inode_size(fuse_inode, passthrough_inode);
    77				fsstack_copy_attr_times(fuse_inode, passthrough_inode);
    78			}
    79		} else {
    80			if (!passthrough_filp->f_op->read_iter)
    81				goto out;
    82	
    83			ret = call_read_iter(passthrough_filp, iocb, iter);
    84			if (ret >= 0 || ret == -EIOCBQUEUED)
    85				fsstack_copy_attr_atime(fuse_inode, passthrough_inode);
    86		}
    87	
    88	out:
    89		iocb->ki_filp = fuse_filp;
    90	
    91		return ret;
    92	}
    93	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6] fuse: Add support for passthrough read/write
  2020-08-12 18:29 ` Jann Horn
@ 2020-08-13 13:28   ` Alessio Balsini
  2020-08-13 18:30     ` Jann Horn
  2020-08-13 18:41   ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: Alessio Balsini @ 2020-08-13 13:28 UTC (permalink / raw)
  To: Jann Horn
  Cc: Jens Axboe, Miklos Szeredi, Nikhilesh Reddy, Akilesh Kailash,
	David Anderson, Eric Yan, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Stefano Duo, Zimuzo Ezeozue, kernel-team,
	linux-fsdevel, kernel list

Hi Jann,

Thank you for looking into this.

On Wed, Aug 12, 2020 at 08:29:58PM +0200, 'Jann Horn' via kernel-team wrote:
> [+Jens: can you have a look at that ->ki_filp switcheroo in
> fuse_passthrough_read_write_iter() and help figure out whether that's
> fine? This seems like your area of expertise.]
> 
> On Wed, Aug 12, 2020 at 6:15 PM Alessio Balsini <balsini@android.com> wrote:
> > Add support for filesystem passthrough read/write of files when enabled in
> > userspace through the option FUSE_PASSTHROUGH.
> 
> Oh, I remember this old series... v5 was from 2016? Nice to see
> someone picking this up again, I liked the idea quite a bit.
> 

Yes, it's been a while since the last iteration.
The idea of the original series is as interesting as simple and I think is
a good tradeoff between the flexibility of FUSE and the speed performance
of SDCardFS, that Android has been using for a few years.

> [...]
> Unfortunately, this step isn't really compatible with how the current
> FUSE API works. Essentially, the current FUSE API is reached via
> operations like the write() syscall, and reaches FUSE through either
> ->write_iter or ->splice_write in fuse_dev_operations. In that
> context, operations like fget_raw() that operate on the calling
> process are unsafe.
> 
> The reason why operations like fget() are unsafe in this context is
> that the security model of Linux fundamentally assumes that if you get
> a file descriptor from an untrusted process, and you write stuff into
> it, anything that happens will be limited to things that the process
> that gave you the file descriptor would've been able to do on its own.
> Or in other words, an attacker shouldn't be able to gain anything by
> convincing a privileged process to write attacker-controlled data into
> an attacker-supplied file descriptor. With the current design, an
> attacker may be able to trick a privileged process into installing one
> of its FDs as a passthrough FD into an attacker-controlled FUSE
> instance (while the privileged process thinks that it's just writing
> some opaque data into some file), and thereby make it possible for an
> attacker to indirectly gain the ability to read/write that FD.
> 

This is a great explanation.

I've been thinking about this before posting the patch and my final thought
was that being the FUSE daemon already responsible for handling file ops
coming from untrusted processes, and the privileges of these ops are anyway
escalated to the daemon's, if the FUSE daemon has vulnerabilities to
exploit, there's not much we can do to avoid an attacker to mess with files
at the same permission level as the FUSE daemon. And this is true also
without this patch, right?
IOW, the feature introduced here is something that I agree should be
handled with care, but is there something that can go wrong if the FUSE
daemon is written properly? If we cannot trust the FUSE daemon, then we
should also not trust what it would be able to access, so isn't it enough
to prove that an attacker wouldn't be able to get more privileges than the
FUSE daemon? Sorry if I missed something.

> The only way I see to fix this somewhat cleanly would be to add a new
> command to fuse_dev_ioctl() that can be used to submit replies as an
> alternative to the write()-based interface. (That should probably be a
> separate patch in this patch series.) Then, you could have a flag
> argument to fuse_dev_do_write() that tells it whether the ioctl
> interface was used, and use that information to decide whether
> fuse_setup_passthrough() is allowed.
> (An alternative approach would be to require userspace to set some
> flag on the write operation that says "I am intentionally performing
> an operation that depends on caller identity" and pass that through -
> e.g. by using pwritev2()'s flags argument. But I think historically
> the stance has been that stuff like write() simply should not be
> looking at the calling process.)
> 

I'm not sure I got it right. Could you please elaborate on what is the
purpose of the new fuse_dev_ioctl() command?
Do you mean letting the kernel decide whether a FUSE daemon is allowed to
run fuse_setup_passthrough() or to decide if passthrough should be allowed
on a specific file?
In general, I prefer to avoid as much as I can API changes, let's see if
there is a way to leave them untouched. :)
What do you think about adding some extra checkings in
fuse_setup_passthrough() to let the kernel decide if passthrough is doable?
Do you think this would make things better regarding what you mentioned?

> > +       if (open_out->fd < 0)
> > +               return;
> 
> What is the intent here? fget() will fail anyway if the fd is invalid.

Right, I think I forgot this extra check when I was still unsure about how
to communicate from the FUSE daemon that something was wrong with the given
fd, but you are right, I'll remove this useless check.

> > +       passthrough_filp = fget_raw(open_out->fd);
> 
> This should probably be a normal fget()? fget_raw() is just necessary
> if you want to permit using O_PATH file descriptors, and read/write
> operations don't work on those.
> 

Fair enough. Testing and adding to v7.

> > +       if (!passthrough_filp)
> > +               return;
> 
> This error path can only be reached if the caller supplied invalid
> data. IMO this should bail out with an error.

As you can see I switched from the BUG_ON() approach of the previous patch
to the extreme opposite of transparent, graceful error handling.
Do you think we should abort the whole open operation, or adding a few
warning messages may suffice?

> 
> > +       passthrough_inode = file_inode(passthrough_filp);
> > +       passthrough_sb = passthrough_inode->i_sb;
> > +       fs_stack_depth = passthrough_sb->s_stack_depth + 1;
> > +       if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> > +               fput(passthrough_filp);
> > +               return;
> > +       }
> 
> This is an error path that silently ignores the error and continues to
> open the file normally as if FOPEN_PASSTHROUGH hadn't been set. Is
> this an intentional fallback? If so, maybe you could add a comment on
> top of fuse_setup_passthrough() like: "If setting up passthrough fails
> due to incompatibility, we ignore the error and continue setting up
> the file as normal."

This is intentional behavior, but I'll be glad to add _at least_ a comment,
or as before, a warning message.

> 
> > +       req->passthrough_filp = passthrough_filp;
> > +}
> > +
> > +static inline ssize_t fuse_passthrough_read_write_iter(struct kiocb *iocb,
> > +                                                      struct iov_iter *iter,
> > +                                                      bool write)
> > +{
> > +       struct file *fuse_filp = iocb->ki_filp;
> > +       struct fuse_file *ff = fuse_filp->private_data;
> > +       struct file *passthrough_filp = ff->passthrough_filp;
> > +       struct inode *passthrough_inode;
> > +       struct inode *fuse_inode;
> > +       ssize_t ret = -EIO;
> > +
> > +       fuse_inode = fuse_filp->f_path.dentry->d_inode;
> 
> I think this could just use file_inode(fuse_filp)?

Nice catch! I think I lost this bit in a rebase, thanks!

> 
> 
> > +       passthrough_inode = file_inode(passthrough_filp);
> > +
> > +       iocb->ki_filp = passthrough_filp;
> 
> Hmm... so we're temporarily switching out the iocb's ->ki_filp here? I
> wonder whether it is possible for some other code to look at ->ki_filp
> concurrently... maybe Jens Axboe knows whether that could plausibly
> happen?
> 
> Second question about this switcheroo below...
> 
> > +       if (write) {
> > +               if (!passthrough_filp->f_op->write_iter)
> > +                       goto out;
> > +
> > +               ret = call_write_iter(passthrough_filp, iocb, iter);
> 
> Hmm, I don't think we can just invoke
> call_write_iter()/call_read_iter() like this. If you look at something
> like vfs_writev(), you can see that normally, there are a bunch of
> other things that happen:
> 
>  - file_start_write() before the write
>  - check whether the file's ->f_mode permits writing with FMODE_WRITE
> and FMODE_CAN_WRITE
>  - rw_verify_area() for stuff like mandatory locking and LSM security
> checks (although admittedly this LSM security check is pretty useless)
>  - fsnotify_modify() to trigger inotify watches and such that notify
> userspace of file modifications
>  - file_end_write() after the write
> 
> You should probably try to use vfs_iocb_iter_write() here, and figure
> out how to properly add file_start_write()/file_end_write() calls
> around this. Perhaps ovl_write_iter() from fs/overlayfs/file.c can
> help with this? Note that you can't always just call file_end_write()
> synchronously, since the request may complete asynchronously.

Answering here both the two previous questions, that are strictly related.
I couldn't find any racy path for a specific iocp->ki_filp. Glad to be
proven wrong, let's see if Jens can bring some light here.

Jumping back to vfs with call_{read,write}_iter(), this looked to me as the
most elegant solution, and I find it handy that they also perform extra
checks on the target file. I was worried at the beginning about this vfs
nesting weirdness, but the nesting unrolls just fine, and with visual
inspection I couldn't spot any dangerous situations.
Are you just worried about write operations or reads as well? Maybe Jens
can give some extra advice here too.

> > +out:
> > +       iocb->ki_filp = fuse_filp;
> 
> Also a question that I hope Jens can help with: If this is an
> asynchronous request, would something bad happen if the request
> completes before we reach that last ->ki_filp write (if that is even
> possible)? Or could an asynchronous request blow up because this
> ->ki_filp write happens before the request has completed?

I cannot think of a scenario where we don't complete the request before
restoring the original ki_filp. Jens again to prove me wrong.

> Overall, I wonder whether it would be better to copy overlayfs'
> approach of creating a new iocb instead of trying to switch out parts
> of the existing iocb (see ovl_write_iter()). That would simplify this
> weirdness a lot, at the cost of having to allocate slab memory to
> store the copied iocb.

I'm not familiar with the internals of overlayfs, I will surely give it a
look, seeking for ideas.

Thanks a lot for all the hints!

Cheers,
Alessio


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6] fuse: Add support for passthrough read/write
  2020-08-13 13:28   ` Alessio Balsini
@ 2020-08-13 18:30     ` Jann Horn
  2020-08-18 13:53       ` Alessio Balsini
  0 siblings, 1 reply; 13+ messages in thread
From: Jann Horn @ 2020-08-13 18:30 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: Jens Axboe, Miklos Szeredi, Nikhilesh Reddy, Akilesh Kailash,
	David Anderson, Eric Yan, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Stefano Duo, Zimuzo Ezeozue, kernel-team,
	linux-fsdevel, kernel list

On Thu, Aug 13, 2020 at 3:28 PM Alessio Balsini <balsini@android.com> wrote:
> On Wed, Aug 12, 2020 at 08:29:58PM +0200, 'Jann Horn' via kernel-team wrote:
[...]
> > On Wed, Aug 12, 2020 at 6:15 PM Alessio Balsini <balsini@android.com> wrote:
> > > Add support for filesystem passthrough read/write of files when enabled in
> > > userspace through the option FUSE_PASSTHROUGH.
[...]
> > Unfortunately, this step isn't really compatible with how the current
> > FUSE API works. Essentially, the current FUSE API is reached via
> > operations like the write() syscall, and reaches FUSE through either
> > ->write_iter or ->splice_write in fuse_dev_operations. In that
> > context, operations like fget_raw() that operate on the calling
> > process are unsafe.
> >
> > The reason why operations like fget() are unsafe in this context is
> > that the security model of Linux fundamentally assumes that if you get
> > a file descriptor from an untrusted process, and you write stuff into
> > it, anything that happens will be limited to things that the process
> > that gave you the file descriptor would've been able to do on its own.
> > Or in other words, an attacker shouldn't be able to gain anything by
> > convincing a privileged process to write attacker-controlled data into
> > an attacker-supplied file descriptor. With the current design, an
> > attacker may be able to trick a privileged process into installing one
> > of its FDs as a passthrough FD into an attacker-controlled FUSE
> > instance (while the privileged process thinks that it's just writing
> > some opaque data into some file), and thereby make it possible for an
> > attacker to indirectly gain the ability to read/write that FD.
> >
>
> This is a great explanation.
>
> I've been thinking about this before posting the patch and my final thought
> was that being the FUSE daemon already responsible for handling file ops
> coming from untrusted processes, and the privileges of these ops are anyway
> escalated to the daemon's, if the FUSE daemon has vulnerabilities to
> exploit,

This is not about vulnerabilities in the FUSE daemon.

> there's not much we can do to avoid an attacker to mess with files
> at the same permission level as the FUSE daemon. And this is true also
> without this patch, right?
> IOW, the feature introduced here is something that I agree should be
> handled with care, but is there something that can go wrong if the FUSE
> daemon is written properly? If we cannot trust the FUSE daemon, then we
> should also not trust what it would be able to access, so isn't it enough
> to prove that an attacker wouldn't be able to get more privileges than the
> FUSE daemon? Sorry if I missed something.

My point is that you can use this security issue to steal file
descriptors from processes that are _not_ supposed to act as FUSE
daemons. For example, let's say there is some setuid root executable
that opens /etc/shadow and then writes a message to stdout. If I
invoke this setuid root executable with stdout pointing to a FUSE
device fd, and I can arrange for the message written by the setuid
process to look like a FUSE message, I can trick the setuid process to
install its /etc/shadow fd as a passthrough fd on a FUSE file, and
then, through the FUSE filesystem, mess with /etc/shadow.

Also, for context: While on Android, access to /dev/fuse is
restricted, desktop Linux systems allow unprivileged users to use
FUSE.

> > The only way I see to fix this somewhat cleanly would be to add a new
> > command to fuse_dev_ioctl() that can be used to submit replies as an
> > alternative to the write()-based interface. (That should probably be a
> > separate patch in this patch series.) Then, you could have a flag
> > argument to fuse_dev_do_write() that tells it whether the ioctl
> > interface was used, and use that information to decide whether
> > fuse_setup_passthrough() is allowed.
> > (An alternative approach would be to require userspace to set some
> > flag on the write operation that says "I am intentionally performing
> > an operation that depends on caller identity" and pass that through -
> > e.g. by using pwritev2()'s flags argument. But I think historically
> > the stance has been that stuff like write() simply should not be
> > looking at the calling process.)
> >
>
> I'm not sure I got it right. Could you please elaborate on what is the
> purpose of the new fuse_dev_ioctl() command?
> Do you mean letting the kernel decide whether a FUSE daemon is allowed to
> run fuse_setup_passthrough() or to decide if passthrough should be allowed
> on a specific file?

The new fuse_dev_ioctl() command would behave just like
fuse_dev_write(), except that the ioctl-based interface would permit
OPEN/CREATE replies with FOPEN_PASSTHROUGH, while the write()-based
interface would reject them.

> In general, I prefer to avoid as much as I can API changes, let's see if
> there is a way to leave them untouched. :)
> What do you think about adding some extra checkings in
> fuse_setup_passthrough() to let the kernel decide if passthrough is doable?
> Do you think this would make things better regarding what you mentioned?

I can't think of any checks that you could do there to make it safe,
because fundamentally what you have to figure out is _userspace's
intent_, and you can't figure out what that is if userspace just calls
write().

[...]
> > > +       if (!passthrough_filp)
> > > +               return;
> >
> > This error path can only be reached if the caller supplied invalid
> > data. IMO this should bail out with an error.
>
> As you can see I switched from the BUG_ON() approach of the previous patch
> to the extreme opposite of transparent, graceful error handling.
> Do you think we should abort the whole open operation, or adding a few
> warning messages may suffice?

In this case, an error indicates that the userspace programmer made a
mistake. So even if the userspace programmer is not looking at kernel
logs, we should indicate to them that they messed up - and we can do
that by returning an error code from the syscall. So I think we should
ideally abort the operation in this case.

[...]
> > > +       passthrough_inode = file_inode(passthrough_filp);
> > > +
> > > +       iocb->ki_filp = passthrough_filp;
> >
> > Hmm... so we're temporarily switching out the iocb's ->ki_filp here? I
> > wonder whether it is possible for some other code to look at ->ki_filp
> > concurrently... maybe Jens Axboe knows whether that could plausibly
> > happen?
> >
> > Second question about this switcheroo below...
> >
> > > +       if (write) {
> > > +               if (!passthrough_filp->f_op->write_iter)
> > > +                       goto out;
> > > +
> > > +               ret = call_write_iter(passthrough_filp, iocb, iter);
> >
> > Hmm, I don't think we can just invoke
> > call_write_iter()/call_read_iter() like this. If you look at something
> > like vfs_writev(), you can see that normally, there are a bunch of
> > other things that happen:
> >
> >  - file_start_write() before the write
> >  - check whether the file's ->f_mode permits writing with FMODE_WRITE
> > and FMODE_CAN_WRITE
> >  - rw_verify_area() for stuff like mandatory locking and LSM security
> > checks (although admittedly this LSM security check is pretty useless)
> >  - fsnotify_modify() to trigger inotify watches and such that notify
> > userspace of file modifications
> >  - file_end_write() after the write
> >
> > You should probably try to use vfs_iocb_iter_write() here, and figure
> > out how to properly add file_start_write()/file_end_write() calls
> > around this. Perhaps ovl_write_iter() from fs/overlayfs/file.c can
> > help with this? Note that you can't always just call file_end_write()
> > synchronously, since the request may complete asynchronously.
>
> Answering here both the two previous questions, that are strictly related.
> I couldn't find any racy path for a specific iocp->ki_filp. Glad to be
> proven wrong, let's see if Jens can bring some light here.
>
> Jumping back to vfs with call_{read,write}_iter(), this looked to me as the
> most elegant solution, and I find it handy that they also perform extra
> checks on the target file.

No, call_write_iter() doesn't do any of the required checking and
setup and notification steps, it just calls into the file's
->write_iter handler:

static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
      struct iov_iter *iter)
{
        return file->f_op->write_iter(kio, iter);
}

Did you confuse call_write_iter() with vfs_iocb_iter_write(), or
something like that?

> I was worried at the beginning about this vfs
> nesting weirdness, but the nesting unrolls just fine, and with visual
> inspection I couldn't spot any dangerous situations.
> Are you just worried about write operations or reads as well? Maybe Jens
> can give some extra advice here too.
>
> > > +out:
> > > +       iocb->ki_filp = fuse_filp;
> >
> > Also a question that I hope Jens can help with: If this is an
> > asynchronous request, would something bad happen if the request
> > completes before we reach that last ->ki_filp write (if that is even
> > possible)? Or could an asynchronous request blow up because this
> > ->ki_filp write happens before the request has completed?
>
> I cannot think of a scenario where we don't complete the request before
> restoring the original ki_filp. Jens again to prove me wrong.

Requests can complete asynchronously. That means call_write_iter() can
return more or less immediately, while the request is processed in the
background, and at some point, a callback is invoked. That's what that
-EIOCBQUEUED return value is about. In that case, the current code
will change ->ki_filp while the request is still being handled in the
background.

I recommend looking at ovl_write_iter() in overlayfs to see how
they're dealing with this case.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6] fuse: Add support for passthrough read/write
  2020-08-12 18:29 ` Jann Horn
  2020-08-13 13:28   ` Alessio Balsini
@ 2020-08-13 18:41   ` Jens Axboe
  1 sibling, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2020-08-13 18:41 UTC (permalink / raw)
  To: Jann Horn, Alessio Balsini
  Cc: Miklos Szeredi, Nikhilesh Reddy, Akilesh Kailash, David Anderson,
	Eric Yan, Martijn Coenen, Palmer Dabbelt, Paul Lawrence,
	Stefano Duo, Zimuzo Ezeozue, kernel-team, linux-fsdevel,
	kernel list

On 8/12/20 12:29 PM, Jann Horn wrote:
>> +       passthrough_inode = file_inode(passthrough_filp);
>> +
>> +       iocb->ki_filp = passthrough_filp;
> 
> Hmm... so we're temporarily switching out the iocb's ->ki_filp here? I
> wonder whether it is possible for some other code to look at ->ki_filp
> concurrently... maybe Jens Axboe knows whether that could plausibly
> happen?

I looked into the io_uring use case, and we're using req->file (which is
the same as kiocb->ki_filp) after submission for the polled-IO case.
That's IOCB_HIPRI, not poll(2) related. So it's not safe for that case,
but that probably isn't supported by fuse. But something to keep in
mind...

In general, kiocb->ki_filp is used for setup, and then at IO completion.
That use case appears safe, as long as the ki_filp is restored before
->ki_complete() is called.

Only other exception should be the poll handlers. They arm at setup
time, which is still fine, but re-arm if we get triggered and the file
is still not ready. I _think_ this case is still fine without having
seen all of the bits for this use case, as we haven't actually called
read/write_iter at that point on it.

But in general, I'd say it looks a bit iffy to be fiddling with ki_filp.
Maybe use a new kiocb and stack them like that instead?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6] fuse: Add support for passthrough read/write
  2020-08-13 18:30     ` Jann Horn
@ 2020-08-18 13:53       ` Alessio Balsini
  2020-08-19  6:01         ` Nikolaus Rath
  2020-08-19 10:04         ` Jann Horn
  0 siblings, 2 replies; 13+ messages in thread
From: Alessio Balsini @ 2020-08-18 13:53 UTC (permalink / raw)
  To: Jann Horn, Jens Axboe, Miklos Szeredi
  Cc: Nikhilesh Reddy, Akilesh Kailash, David Anderson, Eric Yan,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, kernel-team, linux-fsdevel, kernel list

Thank you both for the important feedback,

I tried to consolidate all your suggestions in the new version of the
patch, shared below.

As you both recommended, that tricky ki_filp swapping has been removed,
taking overlayfs as a reference for the management of asynchronous
requests.
The V7 below went again through some testing with fio (sync and libaio
engines with several jobs in randrw), fsstress, and a bunch of kernel
builds to trigger both the sync and async paths. Gladly everything worked
as expected.

What I didn't get from overlayfs are the temporary cred switchings, as I
think that in the FUSE use case it is still the FUSE daemon that should
take care of identifying if the requesting process has the right
permissions before allowing the passthrough feature.

On Thu, Aug 13, 2020 at 08:30:21PM +0200, 'Jann Horn' via kernel-team wrote:
> On Thu, Aug 13, 2020 at 3:28 PM Alessio Balsini <balsini@android.com> wrote:
> [...]
> 
> My point is that you can use this security issue to steal file
> descriptors from processes that are _not_ supposed to act as FUSE
> daemons. For example, let's say there is some setuid root executable
> that opens /etc/shadow and then writes a message to stdout. If I
> invoke this setuid root executable with stdout pointing to a FUSE
> device fd, and I can arrange for the message written by the setuid
> process to look like a FUSE message, I can trick the setuid process to
> install its /etc/shadow fd as a passthrough fd on a FUSE file, and
> then, through the FUSE filesystem, mess with /etc/shadow.
> 
> Also, for context: While on Android, access to /dev/fuse is
> restricted, desktop Linux systems allow unprivileged users to use
> FUSE.
> [...]
> The new fuse_dev_ioctl() command would behave just like
> fuse_dev_write(), except that the ioctl-based interface would permit
> OPEN/CREATE replies with FOPEN_PASSTHROUGH, while the write()-based
> interface would reject them.
> [...]
> I can't think of any checks that you could do there to make it safe,
> because fundamentally what you have to figure out is _userspace's
> intent_, and you can't figure out what that is if userspace just calls
> write().
> [...]
> In this case, an error indicates that the userspace programmer made a
> mistake. So even if the userspace programmer is not looking at kernel
> logs, we should indicate to them that they messed up - and we can do
> that by returning an error code from the syscall. So I think we should
> ideally abort the operation in this case.

I've been thinking about the security issue you mentioned, but I'm thinking
that besides the extra Android restrictions, it shouldn't be that easy to
play with /dev/fuse, also for a privileged process in Linux.
In order for a process to successfully send commands to /dev/fuse, the
command has to match a pending request coming from the FUSE filesystem,
meaning that the same privileged process should have mounted the FUSE
filesystem, actually becoming a FUSE daemon itself.
Is this still an eventually exploitable scenario?

> 
> No, call_write_iter() doesn't do any of the required checking and
> setup and notification steps, it just calls into the file's
> ->write_iter handler:
> 
> static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
>       struct iov_iter *iter)
> {
>         return file->f_op->write_iter(kio, iter);
> }
> 
> Did you confuse call_write_iter() with vfs_iocb_iter_write(), or
> something like that?

Ops, you are right, I was referring to vfs_iocb_iter_write(), that was
actually part of my first design, but ended up simplifying in favor of
call_write_iter().

> Requests can complete asynchronously. That means call_write_iter() can
> return more or less immediately, while the request is processed in the
> background, and at some point, a callback is invoked. That's what that
> -EIOCBQUEUED return value is about. In that case, the current code
> will change ->ki_filp while the request is still being handled in the
> background.
> 
> I recommend looking at ovl_write_iter() in overlayfs to see how
> they're dealing with this case.

That was a great suggestion, I hopefully picked the right things from
overlayfs, without overlooking some security aspects.

Thanks again,
Alessio

---8<---

From 5f0e162d2bb39acf41687ca722e15e95d0721c43 Mon Sep 17 00:00:00 2001
From: Alessio Balsini <balsini@android.com>
Date: Wed, 12 Aug 2020 17:14:52 +0100
Subject: [PATCH v7] fuse: Add support for passthrough read/write

Add support for filesystem passthrough read/write of files when enabled
in userspace through the option FUSE_PASSTHROUGH.

There are filesystems based on FUSE that are intended to enforce special
policies or trigger complicate decision makings at the file operations
level.  Android, for example, uses FUSE to enforce fine-grained access
policies that also depend on the file contents.
Sometimes it happens that at open or create time a file is identified as
not requiring additional checks for consequent reads/writes, thus FUSE
would simply act as a passive bridge between the process accessing the
FUSE filesystem and the lower filesystem. Splicing and caching help
reducing the FUSE overhead, but there are still read/write operations
forwarded to the userspace FUSE daemon that could be avoided.

When the FUSE_PASSTHROUGH capability is enabled, the FUSE daemon may
decide while handling the open or create operations, if the given file
can be accessed in passthrough mode, meaning that all the further read
and write operations would be forwarded by the kernel directly to the
lower filesystem rather than to the FUSE daemon. All requests that are
not reads or writes are still handled by the userspace code.
This allows for improved performance on reads and writes, especially in
the case of reads at random offsets, for which no (readahead) caching
mechanism would help.
Benchmarks show improved performance that is close to native filesystem
access when doing massive manipulations on a single opened file,
especially in the case of random reads, for which the bandwidth
increased by almost 2X or sequential writes for which the improvement is
close to 3X.

The creation of this direct connection (passthrough) between FUSE file
objects and file objects in the lower filesystem happens in a way that
reminds of passing file descriptors via sockets:
- a process opens a file handled by FUSE, so the kernel forwards the
  request to the FUSE daemon;
- the FUSE daemon opens the target file in the lower filesystem, getting
  its file descriptor;
- the file descriptor is passed to the kernel via /dev/fuse;
- the kernel gets the file pointer navigating through the opened files
  of the "current" process and stores it in an additional field in the
  fuse_file owned by the process accessing the FUSE filesystem.
From now all the read/write operations performed by that process will be
redirected to the file pointer pointing at the lower filesystem's file.
Handling asynchronous IO is done by creating separate AIO requests for
the lower filesystem that will be internally tracked by FUSE, that
intercepts and propagates their completion through an internal
ki_completed callback similar to the current implementation of
overlayfs.

Original-patch-by: Nikhilesh Reddy <reddyn@codeaurora.org>
Signed-off-by: Alessio Balsini <balsini@android.com>
--
    Performance

What follows has been performed with this change [V6] rebased on top of a
vanilla v5.8 Linux kernel, using a custom passthrough_hp FUSE daemon that
enables pass-through for each file that is opened during both “open” and
“create”. Tests were run on an Intel Xeon E5-2678V3, 32GiB of RAM, with an
ext4-formatted SSD as the lower filesystem, with no special tuning, e.g., all
the involved processes are SCHED_OTHER, ondemand is the frequency governor with
no frequency restrictions, and turbo-boost, as well as p-state, are active.
This is because I noticed that, for such high-level benchmarks, results
consistency was minimally affected by these features.
The source code of the updated libfuse library and passthrough_hp is shared at
the following repository:

    https://github.com/balsini/libfuse/tree/fuse-passthrough-stable-v.3.9.4

Two different kinds of benchmarks were done for this change, the first set of
tests evaluates the bandwidth improvements when manipulating a huge single
file, the second set of tests verify that no performance regressions were
introduced when handling many small files.

The first benchmarks were done by running FIO (fio-3.21) with:
- bs=4Ki;
- file size: 50Gi;
- ioengine: sync;
- fsync_on_close: true.
The target file has been chosen large enough to avoid it to be entirely loaded
into the page cache.
Results are presented in the following table:

+-----------+--------+-------------+--------+
| Bandwidth |  FUSE  |     FUSE    |  Bind  |
|  (KiB/s)  |        | passthrough |  mount |
+-----------+--------+-------------+--------+
| read      | 468897 |      502085 | 516830 |
+-----------+--------+-------------+--------+
| randread  |  15773 |       26632 |  21386 |
+-----------+--------+-------------+--------+
| write     |  58185 |      141272 | 141671 |
+-----------+--------+-------------+--------+
| randwrite |  59892 |       75236 |  76486 |
+-----------+--------+-------------+--------+

As long as this patch has the primary objective of improving bandwidth, another
set of tests has been performed to see how this behaves on a totally different
scenario that involves accessing many small files. For this purpose, measuring
the build time of the Linux kernel has been chosen as a well-known workload.
The kernel has been built with as many processes as the number of logical CPUs
(-j $(nproc)), that besides being a reasonable number, is also enough to
saturate the processor’s utilization thanks to the additional FUSE daemon’s
threads, making it even harder to get closer to the native filesystem
performance.
The following table shows the total build times in the different
configurations:

+------------------+--------------+-----------+
|                  | AVG duration |  Standard |
|                  |     (sec)    | deviation |
+------------------+--------------+-----------+
| FUSE             |      144.566 |     0.697 |
+------------------+--------------+-----------+
| FUSE passthrough |      133.820 |     0.341 |
+------------------+--------------+-----------+
| Raw              |      109.423 |     0.724 |
+------------------+--------------+-----------+

Further testing and performance evaluations are welcome.

Changes in v7:
* Full handling of aio requests as done in overlayfs (update commit message).
* s/fget_raw/fget.
* Open fails in case of passthrough errors, emitting warning messages.
  [Proposed by Jann Horn]
* Create new local kiocb, getting rid of the previously proposed ki_filp
  swapping.
  [Proposed by Jann Horn and Jens Axboe]
* Code polishing.

Changes in v6:
* Port to kernel v5.8:
  * fuse_file_{read,write}_iter() changed since the v5 of this patch was
    proposed.
* Simplify fuse_simple_request().
* Merge fuse_passthrough.h into fuse_i.h
* Refactor of passthrough.c:
  * Remove BUG_ON()s.
  * Simplified error checking and request arguments indexing.
  * Use call_{read,write}_iter() utility functions.
  * Remove get_file() and fputs() during read/write: handle the extra FUSE
    references to the lower file object when the fuse_file is created/deleted.
  [Proposed by Jann Horn]

Changes in v5:
* Fix the check when setting the passthrough file.
  [Found when testing by Mike Shal]

Changes in v3 and v4:
* Use the fs_stack_depth to prevent further stacking and a minor fix.
  [Proposed by Jann Horn]

Changes in v2:
* Changed the feature name to passthrough from stacked_io.
  [Proposed by Linus Torvalds]

---
 fs/fuse/Makefile          |   1 +
 fs/fuse/dev.c             |   4 +
 fs/fuse/dir.c             |   2 +
 fs/fuse/file.c            |  25 +++--
 fs/fuse/fuse_i.h          |  20 ++++
 fs/fuse/inode.c           |  22 +++-
 fs/fuse/passthrough.c     | 219 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fuse.h |   4 +-
 8 files changed, 286 insertions(+), 11 deletions(-)
 create mode 100644 fs/fuse/passthrough.c

diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 3e8cebfb59b7..6971454a2bdf 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -8,4 +8,5 @@ obj-$(CONFIG_CUSE) += cuse.o
 obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
 
 fuse-objs := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o
+fuse-objs += passthrough.o
 virtiofs-y += virtio_fs.o
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 02b3c36b3676..a99c28fd4566 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -506,6 +506,7 @@ ssize_t fuse_simple_request(struct fuse_conn *fc, struct fuse_args *args)
 		BUG_ON(args->out_numargs == 0);
 		ret = args->out_args[args->out_numargs - 1].size;
 	}
+	args->passthrough_filp = req->passthrough_filp;
 	fuse_put_request(fc, req);
 
 	return ret;
@@ -1897,6 +1898,9 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
 		err = copy_out_args(cs, req->args, nbytes);
 	fuse_copy_finish(cs);
 
+	if (fuse_passthrough_setup(fc, req))
+		err = -EINVAL;
+
 	spin_lock(&fpq->lock);
 	clear_bit(FR_LOCKED, &req->flags);
 	if (!fpq->connected)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 26f028bc760b..531de0c5c9e8 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -477,6 +477,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
 	args.out_args[0].value = &outentry;
 	args.out_args[1].size = sizeof(outopen);
 	args.out_args[1].value = &outopen;
+	args.passthrough_filp = NULL;
 	err = fuse_simple_request(fc, &args);
 	if (err)
 		goto out_free_ff;
@@ -489,6 +490,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
 	ff->fh = outopen.fh;
 	ff->nodeid = outentry.nodeid;
 	ff->open_flags = outopen.open_flags;
+	ff->passthrough_filp = args.passthrough_filp;
 	inode = fuse_iget(dir->i_sb, outentry.nodeid, outentry.generation,
 			  &outentry.attr, entry_attr_timeout(&outentry), 0);
 	if (!inode) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 83d917f7e542..c3289ff0cd33 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -33,10 +33,12 @@ static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
 }
 
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
-			  int opcode, struct fuse_open_out *outargp)
+			  int opcode, struct fuse_open_out *outargp,
+			  struct file **passthrough_filp)
 {
 	struct fuse_open_in inarg;
 	FUSE_ARGS(args);
+	int ret;
 
 	memset(&inarg, 0, sizeof(inarg));
 	inarg.flags = file->f_flags & ~(O_CREAT | O_EXCL | O_NOCTTY);
@@ -51,7 +53,10 @@ static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 	args.out_args[0].size = sizeof(*outargp);
 	args.out_args[0].value = outargp;
 
-	return fuse_simple_request(fc, &args);
+	ret = fuse_simple_request(fc, &args);
+	*passthrough_filp = args.passthrough_filp;
+
+	return ret;
 }
 
 struct fuse_release_args {
@@ -144,14 +149,16 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 	/* Default for no-open */
 	ff->open_flags = FOPEN_KEEP_CACHE | (isdir ? FOPEN_CACHE_DIR : 0);
 	if (isdir ? !fc->no_opendir : !fc->no_open) {
+		struct file *passthrough_filp;
 		struct fuse_open_out outarg;
 		int err;
 
-		err = fuse_send_open(fc, nodeid, file, opcode, &outarg);
+		err = fuse_send_open(fc, nodeid, file, opcode, &outarg,
+				     &passthrough_filp);
 		if (!err) {
 			ff->fh = outarg.fh;
 			ff->open_flags = outarg.open_flags;
-
+			ff->passthrough_filp = passthrough_filp;
 		} else if (err != -ENOSYS) {
 			fuse_file_free(ff);
 			return err;
@@ -281,6 +288,8 @@ void fuse_release_common(struct file *file, bool isdir)
 	struct fuse_release_args *ra = ff->release_args;
 	int opcode = isdir ? FUSE_RELEASEDIR : FUSE_RELEASE;
 
+	fuse_passthrough_release(ff);
+
 	fuse_prepare_release(fi, ff, file->f_flags, opcode);
 
 	if (ff->flock) {
@@ -1543,7 +1552,9 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (is_bad_inode(file_inode(file)))
 		return -EIO;
 
-	if (!(ff->open_flags & FOPEN_DIRECT_IO))
+	if (ff->passthrough_filp)
+		return fuse_passthrough_read_iter(iocb, to);
+	else if (!(ff->open_flags & FOPEN_DIRECT_IO))
 		return fuse_cache_read_iter(iocb, to);
 	else
 		return fuse_direct_read_iter(iocb, to);
@@ -1557,7 +1568,9 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (is_bad_inode(file_inode(file)))
 		return -EIO;
 
-	if (!(ff->open_flags & FOPEN_DIRECT_IO))
+	if (ff->passthrough_filp)
+		return fuse_passthrough_write_iter(iocb, from);
+	else if (!(ff->open_flags & FOPEN_DIRECT_IO))
 		return fuse_cache_write_iter(iocb, from);
 	else
 		return fuse_direct_write_iter(iocb, from);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 740a8a7d7ae6..4e2ef3c436d0 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -208,6 +208,12 @@ struct fuse_file {
 
 	} readdir;
 
+	/**
+	 * Reference to lower filesystem file for read/write operations
+	 * handled in pass-through mode
+	 */
+	struct file *passthrough_filp;
+
 	/** RB node to be linked on fuse_conn->polled_files */
 	struct rb_node polled_node;
 
@@ -252,6 +258,7 @@ struct fuse_args {
 	bool may_block:1;
 	struct fuse_in_arg in_args[3];
 	struct fuse_arg out_args[2];
+	struct file *passthrough_filp;
 	void (*end)(struct fuse_conn *fc, struct fuse_args *args, int error);
 };
 
@@ -353,6 +360,9 @@ struct fuse_req {
 		struct fuse_out_header h;
 	} out;
 
+	/** Lower filesystem file pointer used in pass-through mode */
+	struct file *passthrough_filp;
+
 	/** Used to wake up the task waiting for completion of request*/
 	wait_queue_head_t waitq;
 
@@ -720,6 +730,9 @@ struct fuse_conn {
 	/* Do not show mount options */
 	unsigned int no_mount_options:1;
 
+	/** Pass-through mode for read/write IO */
+	unsigned int passthrough:1;
+
 	/** The number of requests waiting for completion */
 	atomic_t num_waiting;
 
@@ -1093,4 +1106,11 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
 u64 fuse_get_unique(struct fuse_iqueue *fiq);
 void fuse_free_conn(struct fuse_conn *fc);
 
+int __init fuse_passthrough_aio_request_cache_init(void);
+void fuse_passthrough_aio_request_cache_destroy(void);
+int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_req *req);
+ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *to);
+ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *from);
+void fuse_passthrough_release(struct fuse_file *ff);
+
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index bba747520e9b..a8fc9e7fd82e 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -965,6 +965,12 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
 					min_t(unsigned int, FUSE_MAX_MAX_PAGES,
 					max_t(unsigned int, arg->max_pages, 1));
 			}
+			if (arg->flags & FUSE_PASSTHROUGH) {
+				fc->passthrough = 1;
+				/* Prevent further stacking */
+				fc->sb->s_stack_depth =
+					FILESYSTEM_MAX_STACK_DEPTH;
+			}
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1002,7 +1008,8 @@ void fuse_send_init(struct fuse_conn *fc)
 		FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
 		FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
 		FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
-		FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
+		FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
+		FUSE_PASSTHROUGH;
 	ia->args.opcode = FUSE_INIT;
 	ia->args.in_numargs = 1;
 	ia->args.in_args[0].size = sizeof(ia->in);
@@ -1428,18 +1435,24 @@ static int __init fuse_fs_init(void)
 	if (!fuse_inode_cachep)
 		goto out;
 
-	err = register_fuseblk();
+	err = fuse_passthrough_aio_request_cache_init();
 	if (err)
 		goto out2;
 
-	err = register_filesystem(&fuse_fs_type);
+	err = register_fuseblk();
 	if (err)
 		goto out3;
 
+	err = register_filesystem(&fuse_fs_type);
+	if (err)
+		goto out4;
+
 	return 0;
 
- out3:
+ out4:
 	unregister_fuseblk();
+ out3:
+	fuse_passthrough_aio_request_cache_destroy();
  out2:
 	kmem_cache_destroy(fuse_inode_cachep);
  out:
@@ -1457,6 +1470,7 @@ static void fuse_fs_cleanup(void)
 	 */
 	rcu_barrier();
 	kmem_cache_destroy(fuse_inode_cachep);
+	fuse_passthrough_aio_request_cache_destroy();
 }
 
 static struct kobject *fuse_kobj;
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
new file mode 100644
index 000000000000..b5eee5acd2a2
--- /dev/null
+++ b/fs/fuse/passthrough.c
@@ -0,0 +1,219 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "fuse_i.h"
+
+#include <linux/aio.h>
+#include <linux/fs_stack.h>
+#include <linux/uio.h>
+
+static struct kmem_cache *fuse_aio_request_cachep;
+
+struct fuse_aio_req {
+	struct kiocb iocb;
+	struct kiocb *iocb_fuse;
+};
+
+static void fuse_copyattr(struct file *dst_file, struct file *src_file,
+			  bool write)
+{
+	struct inode *dst = file_inode(dst_file);
+	struct inode *src = file_inode(src_file);
+
+	fsstack_copy_attr_times(dst, src);
+	if (write)
+		fsstack_copy_inode_size(dst, src);
+}
+
+static void fuse_aio_cleanup_handler(struct fuse_aio_req *aio_req)
+{
+	struct kiocb *iocb_fuse = aio_req->iocb_fuse;
+	struct kiocb *iocb = &aio_req->iocb;
+
+	if (iocb->ki_flags & IOCB_WRITE) {
+		__sb_writers_acquired(file_inode(iocb->ki_filp)->i_sb,
+				      SB_FREEZE_WRITE);
+		file_end_write(iocb->ki_filp);
+		fuse_copyattr(iocb_fuse->ki_filp, iocb->ki_filp, true);
+	}
+
+	iocb_fuse->ki_pos = iocb->ki_pos;
+	kmem_cache_free(fuse_aio_request_cachep, aio_req);
+}
+
+static void fuse_aio_rw_complete(struct kiocb *iocb, long res, long res2)
+{
+	struct fuse_aio_req *aio_req =
+		container_of(iocb, struct fuse_aio_req, iocb);
+	struct kiocb *iocb_fuse = aio_req->iocb_fuse;
+
+	fuse_aio_cleanup_handler(aio_req);
+	iocb_fuse->ki_complete(iocb_fuse, res, res2);
+}
+
+ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
+				   struct iov_iter *iter)
+{
+	struct file *fuse_filp = iocb_fuse->ki_filp;
+	struct fuse_file *ff = fuse_filp->private_data;
+	struct file *passthrough_filp = ff->passthrough_filp;
+	ssize_t ret;
+
+	if (!iov_iter_count(iter))
+		return 0;
+
+	if (is_sync_kiocb(iocb_fuse)) {
+		struct kiocb iocb;
+
+		kiocb_clone(&iocb, iocb_fuse, passthrough_filp);
+		ret = call_read_iter(passthrough_filp, &iocb, iter);
+		iocb_fuse->ki_pos = iocb.ki_pos;
+	} else {
+		struct fuse_aio_req *aio_req;
+
+		aio_req =
+			kmem_cache_zalloc(fuse_aio_request_cachep, GFP_KERNEL);
+		if (!aio_req)
+			return -ENOMEM;
+
+		aio_req->iocb_fuse = iocb_fuse;
+		kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
+		aio_req->iocb.ki_complete = fuse_aio_rw_complete;
+		ret = vfs_iocb_iter_read(passthrough_filp, &aio_req->iocb,
+					 iter);
+		if (ret != -EIOCBQUEUED)
+			fuse_aio_cleanup_handler(aio_req);
+	}
+
+	fuse_copyattr(fuse_filp, passthrough_filp, false);
+
+	return ret;
+}
+
+ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
+				    struct iov_iter *iter)
+{
+	struct file *fuse_filp = iocb_fuse->ki_filp;
+	struct fuse_file *ff = fuse_filp->private_data;
+	struct file *passthrough_filp = ff->passthrough_filp;
+	struct inode *passthrough_inode = file_inode(passthrough_filp);
+	struct inode *fuse_inode = file_inode(fuse_filp);
+	ssize_t ret;
+
+	if (!iov_iter_count(iter))
+		return 0;
+
+	inode_lock(fuse_inode);
+
+	if (is_sync_kiocb(iocb_fuse)) {
+		struct kiocb iocb;
+
+		kiocb_clone(&iocb, iocb_fuse, passthrough_filp);
+
+		file_start_write(passthrough_filp);
+		ret = call_write_iter(passthrough_filp, &iocb, iter);
+		file_end_write(passthrough_filp);
+
+		iocb_fuse->ki_pos = iocb.ki_pos;
+		fuse_copyattr(fuse_filp, passthrough_filp, true);
+	} else {
+		struct fuse_aio_req *aio_req;
+
+		aio_req =
+			kmem_cache_zalloc(fuse_aio_request_cachep, GFP_KERNEL);
+		if (!aio_req)
+			return -ENOMEM;
+
+		file_start_write(passthrough_filp);
+		__sb_writers_release(passthrough_inode->i_sb, SB_FREEZE_WRITE);
+
+		aio_req->iocb_fuse = iocb_fuse;
+		kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
+		aio_req->iocb.ki_complete = fuse_aio_rw_complete;
+		ret = vfs_iocb_iter_write(passthrough_filp, &aio_req->iocb,
+					  iter);
+		if (ret != -EIOCBQUEUED)
+			fuse_aio_cleanup_handler(aio_req);
+	}
+
+	inode_unlock(fuse_inode);
+
+	return ret;
+}
+
+int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_req *req)
+{
+	struct super_block *passthrough_sb;
+	struct inode *passthrough_inode;
+	struct fuse_open_out *open_out;
+	struct file *passthrough_filp;
+	unsigned short open_out_index;
+	int fs_stack_depth;
+
+	req->passthrough_filp = NULL;
+
+	if (!fc->passthrough)
+		return 0;
+
+	if (!(req->in.h.opcode == FUSE_OPEN && req->args->out_numargs == 1) &&
+	    !(req->in.h.opcode == FUSE_CREATE && req->args->out_numargs == 2))
+		return 0;
+
+	open_out_index = req->args->out_numargs - 1;
+
+	if (req->args->out_args[open_out_index].size != sizeof(*open_out))
+		return 0;
+
+	open_out = req->args->out_args[open_out_index].value;
+
+	if (!(open_out->open_flags & FOPEN_PASSTHROUGH))
+		return 0;
+
+	passthrough_filp = fget(open_out->fd);
+	if (!passthrough_filp) {
+		pr_err("FUSE: invalid file descriptor for passthrough.\n");
+		return -1;
+	}
+
+	if (!passthrough_filp->f_op->read_iter ||
+	    !passthrough_filp->f_op->write_iter) {
+		pr_err("FUSE: passthrough file misses file operations.\n");
+		fput(passthrough_filp);
+		return -1;
+	}
+
+	passthrough_inode = file_inode(passthrough_filp);
+	passthrough_sb = passthrough_inode->i_sb;
+	fs_stack_depth = passthrough_sb->s_stack_depth + 1;
+	if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
+		pr_err("FUSE: maximum fs stacking depth exceeded for passthrough\n");
+		fput(passthrough_filp);
+		return -1;
+	}
+
+	req->passthrough_filp = passthrough_filp;
+	return 0;
+}
+
+void fuse_passthrough_release(struct fuse_file *ff)
+{
+	if (ff->passthrough_filp) {
+		fput(ff->passthrough_filp);
+		ff->passthrough_filp = NULL;
+	}
+}
+
+int __init fuse_passthrough_aio_request_cache_init(void)
+{
+	fuse_aio_request_cachep =
+		kmem_cache_create("fuse_aio_req", sizeof(struct fuse_aio_req),
+				  0, SLAB_HWCACHE_ALIGN, NULL);
+	if (!fuse_aio_request_cachep)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void fuse_passthrough_aio_request_cache_destroy(void)
+{
+	kmem_cache_destroy(fuse_aio_request_cachep);
+}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 373cada89815..e50bd775210a 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -283,6 +283,7 @@ struct fuse_file_lock {
 #define FOPEN_NONSEEKABLE	(1 << 2)
 #define FOPEN_CACHE_DIR		(1 << 3)
 #define FOPEN_STREAM		(1 << 4)
+#define FOPEN_PASSTHROUGH	(1 << 5)
 
 /**
  * INIT request/reply flags
@@ -342,6 +343,7 @@ struct fuse_file_lock {
 #define FUSE_NO_OPENDIR_SUPPORT (1 << 24)
 #define FUSE_EXPLICIT_INVAL_DATA (1 << 25)
 #define FUSE_MAP_ALIGNMENT	(1 << 26)
+#define FUSE_PASSTHROUGH	(1 << 27)
 
 /**
  * CUSE INIT request/reply flags
@@ -591,7 +593,7 @@ struct fuse_create_in {
 struct fuse_open_out {
 	uint64_t	fh;
 	uint32_t	open_flags;
-	uint32_t	padding;
+	int32_t		fd;
 };
 
 struct fuse_release_in {
-- 
2.28.0.220.ged08abb693-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v6] fuse: Add support for passthrough read/write
  2020-08-18 13:53       ` Alessio Balsini
@ 2020-08-19  6:01         ` Nikolaus Rath
  2020-08-19  9:24           ` Amir Goldstein
  2020-08-19 10:04         ` Jann Horn
  1 sibling, 1 reply; 13+ messages in thread
From: Nikolaus Rath @ 2020-08-19  6:01 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: Jann Horn, Jens Axboe, Miklos Szeredi, Nikhilesh Reddy,
	Akilesh Kailash, David Anderson, Eric Yan, Martijn Coenen,
	Palmer Dabbelt, Paul Lawrence, Stefano Duo, Zimuzo Ezeozue,
	kernel-team, linux-fsdevel, kernel list

Hi Alessio,

Thank you for working on this, I'm excited to see things moving again on
this front!

What I would really like to see in the long-term is the ability for FUSE
to support passthrough for specific areas of a file, i.e. the ability to
specify different passthrough fds for different regions of the FUSE
file.

I do not want to ask you to extend the patch, but I am a little worried
that the proposed interface would make a future extension more
complicated than necessary. Could you comment on that? Is there a way to
design the libfuse interface in a way that would - in the future - make
it easier to extend it along these lines?

What I have in mind is things like not coupling the setup of the
passthrough fds to open(), but having a separate notification message for
this (like what we use for invalidation of cache), and adding not just
an "fd" field but also "offset" and "length" fields (which would
currently be required to be both zero to get the "full file" semantics).

What do you think?

Best,
-Nikolaus

On Aug 18 2020, Alessio Balsini <balsini@android.com> wrote:
> Thank you both for the important feedback,
>
> I tried to consolidate all your suggestions in the new version of the
> patch, shared below.
>
> As you both recommended, that tricky ki_filp swapping has been removed,
> taking overlayfs as a reference for the management of asynchronous
> requests.
> The V7 below went again through some testing with fio (sync and libaio
> engines with several jobs in randrw), fsstress, and a bunch of kernel
> builds to trigger both the sync and async paths. Gladly everything worked
> as expected.
>
> What I didn't get from overlayfs are the temporary cred switchings, as I
> think that in the FUSE use case it is still the FUSE daemon that should
> take care of identifying if the requesting process has the right
> permissions before allowing the passthrough feature.
>
> On Thu, Aug 13, 2020 at 08:30:21PM +0200, 'Jann Horn' via kernel-team wrote:
>> On Thu, Aug 13, 2020 at 3:28 PM Alessio Balsini <balsini@android.com> wrote:
>> [...]
>> 
>> My point is that you can use this security issue to steal file
>> descriptors from processes that are _not_ supposed to act as FUSE
>> daemons. For example, let's say there is some setuid root executable
>> that opens /etc/shadow and then writes a message to stdout. If I
>> invoke this setuid root executable with stdout pointing to a FUSE
>> device fd, and I can arrange for the message written by the setuid
>> process to look like a FUSE message, I can trick the setuid process to
>> install its /etc/shadow fd as a passthrough fd on a FUSE file, and
>> then, through the FUSE filesystem, mess with /etc/shadow.
>> 
>> Also, for context: While on Android, access to /dev/fuse is
>> restricted, desktop Linux systems allow unprivileged users to use
>> FUSE.
>> [...]
>> The new fuse_dev_ioctl() command would behave just like
>> fuse_dev_write(), except that the ioctl-based interface would permit
>> OPEN/CREATE replies with FOPEN_PASSTHROUGH, while the write()-based
>> interface would reject them.
>> [...]
>> I can't think of any checks that you could do there to make it safe,
>> because fundamentally what you have to figure out is _userspace's
>> intent_, and you can't figure out what that is if userspace just calls
>> write().
>> [...]
>> In this case, an error indicates that the userspace programmer made a
>> mistake. So even if the userspace programmer is not looking at kernel
>> logs, we should indicate to them that they messed up - and we can do
>> that by returning an error code from the syscall. So I think we should
>> ideally abort the operation in this case.
>
> I've been thinking about the security issue you mentioned, but I'm thinking
> that besides the extra Android restrictions, it shouldn't be that easy to
> play with /dev/fuse, also for a privileged process in Linux.
> In order for a process to successfully send commands to /dev/fuse, the
> command has to match a pending request coming from the FUSE filesystem,
> meaning that the same privileged process should have mounted the FUSE
> filesystem, actually becoming a FUSE daemon itself.
> Is this still an eventually exploitable scenario?
>
>> 
>> No, call_write_iter() doesn't do any of the required checking and
>> setup and notification steps, it just calls into the file's
>> ->write_iter handler:
>> 
>> static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
>>       struct iov_iter *iter)
>> {
>>         return file->f_op->write_iter(kio, iter);
>> }
>> 
>> Did you confuse call_write_iter() with vfs_iocb_iter_write(), or
>> something like that?
>
> Ops, you are right, I was referring to vfs_iocb_iter_write(), that was
> actually part of my first design, but ended up simplifying in favor of
> call_write_iter().
>
>> Requests can complete asynchronously. That means call_write_iter() can
>> return more or less immediately, while the request is processed in the
>> background, and at some point, a callback is invoked. That's what that
>> -EIOCBQUEUED return value is about. In that case, the current code
>> will change ->ki_filp while the request is still being handled in the
>> background.
>> 
>> I recommend looking at ovl_write_iter() in overlayfs to see how
>> they're dealing with this case.
>
> That was a great suggestion, I hopefully picked the right things from
> overlayfs, without overlooking some security aspects.
>
> Thanks again,
> Alessio
>
> ---8<---
>
> From 5f0e162d2bb39acf41687ca722e15e95d0721c43 Mon Sep 17 00:00:00 2001
> From: Alessio Balsini <balsini@android.com>
> Date: Wed, 12 Aug 2020 17:14:52 +0100
> Subject: [PATCH v7] fuse: Add support for passthrough read/write
>
> Add support for filesystem passthrough read/write of files when enabled
> in userspace through the option FUSE_PASSTHROUGH.
>
> There are filesystems based on FUSE that are intended to enforce special
> policies or trigger complicate decision makings at the file operations
> level.  Android, for example, uses FUSE to enforce fine-grained access
> policies that also depend on the file contents.
> Sometimes it happens that at open or create time a file is identified as
> not requiring additional checks for consequent reads/writes, thus FUSE
> would simply act as a passive bridge between the process accessing the
> FUSE filesystem and the lower filesystem. Splicing and caching help
> reducing the FUSE overhead, but there are still read/write operations
> forwarded to the userspace FUSE daemon that could be avoided.
>
> When the FUSE_PASSTHROUGH capability is enabled, the FUSE daemon may
> decide while handling the open or create operations, if the given file
> can be accessed in passthrough mode, meaning that all the further read
> and write operations would be forwarded by the kernel directly to the
> lower filesystem rather than to the FUSE daemon. All requests that are
> not reads or writes are still handled by the userspace code.
> This allows for improved performance on reads and writes, especially in
> the case of reads at random offsets, for which no (readahead) caching
> mechanism would help.
> Benchmarks show improved performance that is close to native filesystem
> access when doing massive manipulations on a single opened file,
> especially in the case of random reads, for which the bandwidth
> increased by almost 2X or sequential writes for which the improvement is
> close to 3X.
>
> The creation of this direct connection (passthrough) between FUSE file
> objects and file objects in the lower filesystem happens in a way that
> reminds of passing file descriptors via sockets:
> - a process opens a file handled by FUSE, so the kernel forwards the
>   request to the FUSE daemon;
> - the FUSE daemon opens the target file in the lower filesystem, getting
>   its file descriptor;
> - the file descriptor is passed to the kernel via /dev/fuse;
> - the kernel gets the file pointer navigating through the opened files
>   of the "current" process and stores it in an additional field in the
>   fuse_file owned by the process accessing the FUSE filesystem.
> From now all the read/write operations performed by that process will be
> redirected to the file pointer pointing at the lower filesystem's file.
> Handling asynchronous IO is done by creating separate AIO requests for
> the lower filesystem that will be internally tracked by FUSE, that
> intercepts and propagates their completion through an internal
> ki_completed callback similar to the current implementation of
> overlayfs.
>
> Original-patch-by: Nikhilesh Reddy <reddyn@codeaurora.org>
> Signed-off-by: Alessio Balsini <balsini@android.com>
> --
>     Performance
>
> What follows has been performed with this change [V6] rebased on top of a
> vanilla v5.8 Linux kernel, using a custom passthrough_hp FUSE daemon that
> enables pass-through for each file that is opened during both “open” and
> “create”. Tests were run on an Intel Xeon E5-2678V3, 32GiB of RAM, with an
> ext4-formatted SSD as the lower filesystem, with no special tuning, e.g., all
> the involved processes are SCHED_OTHER, ondemand is the frequency governor with
> no frequency restrictions, and turbo-boost, as well as p-state, are active.
> This is because I noticed that, for such high-level benchmarks, results
> consistency was minimally affected by these features.
> The source code of the updated libfuse library and passthrough_hp is shared at
> the following repository:
>
>     https://github.com/balsini/libfuse/tree/fuse-passthrough-stable-v.3.9.4
>
> Two different kinds of benchmarks were done for this change, the first set of
> tests evaluates the bandwidth improvements when manipulating a huge single
> file, the second set of tests verify that no performance regressions were
> introduced when handling many small files.
>
> The first benchmarks were done by running FIO (fio-3.21) with:
> - bs=4Ki;
> - file size: 50Gi;
> - ioengine: sync;
> - fsync_on_close: true.
> The target file has been chosen large enough to avoid it to be entirely loaded
> into the page cache.
> Results are presented in the following table:
>
> +-----------+--------+-------------+--------+
> | Bandwidth |  FUSE  |     FUSE    |  Bind  |
> |  (KiB/s)  |        | passthrough |  mount |
> +-----------+--------+-------------+--------+
> | read      | 468897 |      502085 | 516830 |
> +-----------+--------+-------------+--------+
> | randread  |  15773 |       26632 |  21386 |
> +-----------+--------+-------------+--------+
> | write     |  58185 |      141272 | 141671 |
> +-----------+--------+-------------+--------+
> | randwrite |  59892 |       75236 |  76486 |
> +-----------+--------+-------------+--------+
>
> As long as this patch has the primary objective of improving bandwidth, another
> set of tests has been performed to see how this behaves on a totally different
> scenario that involves accessing many small files. For this purpose, measuring
> the build time of the Linux kernel has been chosen as a well-known workload.
> The kernel has been built with as many processes as the number of logical CPUs
> (-j $(nproc)), that besides being a reasonable number, is also enough to
> saturate the processor’s utilization thanks to the additional FUSE daemon’s
> threads, making it even harder to get closer to the native filesystem
> performance.
> The following table shows the total build times in the different
> configurations:
>
> +------------------+--------------+-----------+
> |                  | AVG duration |  Standard |
> |                  |     (sec)    | deviation |
> +------------------+--------------+-----------+
> | FUSE             |      144.566 |     0.697 |
> +------------------+--------------+-----------+
> | FUSE passthrough |      133.820 |     0.341 |
> +------------------+--------------+-----------+
> | Raw              |      109.423 |     0.724 |
> +------------------+--------------+-----------+
>
> Further testing and performance evaluations are welcome.
>
> Changes in v7:
> * Full handling of aio requests as done in overlayfs (update commit message).
> * s/fget_raw/fget.
> * Open fails in case of passthrough errors, emitting warning messages.
>   [Proposed by Jann Horn]
> * Create new local kiocb, getting rid of the previously proposed ki_filp
>   swapping.
>   [Proposed by Jann Horn and Jens Axboe]
> * Code polishing.
>
> Changes in v6:
> * Port to kernel v5.8:
>   * fuse_file_{read,write}_iter() changed since the v5 of this patch was
>     proposed.
> * Simplify fuse_simple_request().
> * Merge fuse_passthrough.h into fuse_i.h
> * Refactor of passthrough.c:
>   * Remove BUG_ON()s.
>   * Simplified error checking and request arguments indexing.
>   * Use call_{read,write}_iter() utility functions.
>   * Remove get_file() and fputs() during read/write: handle the extra FUSE
>     references to the lower file object when the fuse_file is created/deleted.
>   [Proposed by Jann Horn]
>
> Changes in v5:
> * Fix the check when setting the passthrough file.
>   [Found when testing by Mike Shal]
>
> Changes in v3 and v4:
> * Use the fs_stack_depth to prevent further stacking and a minor fix.
>   [Proposed by Jann Horn]
>
> Changes in v2:
> * Changed the feature name to passthrough from stacked_io.
>   [Proposed by Linus Torvalds]
>
> ---
>  fs/fuse/Makefile          |   1 +
>  fs/fuse/dev.c             |   4 +
>  fs/fuse/dir.c             |   2 +
>  fs/fuse/file.c            |  25 +++--
>  fs/fuse/fuse_i.h          |  20 ++++
>  fs/fuse/inode.c           |  22 +++-
>  fs/fuse/passthrough.c     | 219 ++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fuse.h |   4 +-
>  8 files changed, 286 insertions(+), 11 deletions(-)
>  create mode 100644 fs/fuse/passthrough.c
>
> diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> index 3e8cebfb59b7..6971454a2bdf 100644
> --- a/fs/fuse/Makefile
> +++ b/fs/fuse/Makefile
> @@ -8,4 +8,5 @@ obj-$(CONFIG_CUSE) += cuse.o
>  obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
>  
>  fuse-objs := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o
> +fuse-objs += passthrough.o
>  virtiofs-y += virtio_fs.o
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 02b3c36b3676..a99c28fd4566 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -506,6 +506,7 @@ ssize_t fuse_simple_request(struct fuse_conn *fc, struct fuse_args *args)
>  		BUG_ON(args->out_numargs == 0);
>  		ret = args->out_args[args->out_numargs - 1].size;
>  	}
> +	args->passthrough_filp = req->passthrough_filp;
>  	fuse_put_request(fc, req);
>  
>  	return ret;
> @@ -1897,6 +1898,9 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
>  		err = copy_out_args(cs, req->args, nbytes);
>  	fuse_copy_finish(cs);
>  
> +	if (fuse_passthrough_setup(fc, req))
> +		err = -EINVAL;
> +
>  	spin_lock(&fpq->lock);
>  	clear_bit(FR_LOCKED, &req->flags);
>  	if (!fpq->connected)
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 26f028bc760b..531de0c5c9e8 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -477,6 +477,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
>  	args.out_args[0].value = &outentry;
>  	args.out_args[1].size = sizeof(outopen);
>  	args.out_args[1].value = &outopen;
> +	args.passthrough_filp = NULL;
>  	err = fuse_simple_request(fc, &args);
>  	if (err)
>  		goto out_free_ff;
> @@ -489,6 +490,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
>  	ff->fh = outopen.fh;
>  	ff->nodeid = outentry.nodeid;
>  	ff->open_flags = outopen.open_flags;
> +	ff->passthrough_filp = args.passthrough_filp;
>  	inode = fuse_iget(dir->i_sb, outentry.nodeid, outentry.generation,
>  			  &outentry.attr, entry_attr_timeout(&outentry), 0);
>  	if (!inode) {
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 83d917f7e542..c3289ff0cd33 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -33,10 +33,12 @@ static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
>  }
>  
>  static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
> -			  int opcode, struct fuse_open_out *outargp)
> +			  int opcode, struct fuse_open_out *outargp,
> +			  struct file **passthrough_filp)
>  {
>  	struct fuse_open_in inarg;
>  	FUSE_ARGS(args);
> +	int ret;
>  
>  	memset(&inarg, 0, sizeof(inarg));
>  	inarg.flags = file->f_flags & ~(O_CREAT | O_EXCL | O_NOCTTY);
> @@ -51,7 +53,10 @@ static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
>  	args.out_args[0].size = sizeof(*outargp);
>  	args.out_args[0].value = outargp;
>  
> -	return fuse_simple_request(fc, &args);
> +	ret = fuse_simple_request(fc, &args);
> +	*passthrough_filp = args.passthrough_filp;
> +
> +	return ret;
>  }
>  
>  struct fuse_release_args {
> @@ -144,14 +149,16 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
>  	/* Default for no-open */
>  	ff->open_flags = FOPEN_KEEP_CACHE | (isdir ? FOPEN_CACHE_DIR : 0);
>  	if (isdir ? !fc->no_opendir : !fc->no_open) {
> +		struct file *passthrough_filp;
>  		struct fuse_open_out outarg;
>  		int err;
>  
> -		err = fuse_send_open(fc, nodeid, file, opcode, &outarg);
> +		err = fuse_send_open(fc, nodeid, file, opcode, &outarg,
> +				     &passthrough_filp);
>  		if (!err) {
>  			ff->fh = outarg.fh;
>  			ff->open_flags = outarg.open_flags;
> -
> +			ff->passthrough_filp = passthrough_filp;
>  		} else if (err != -ENOSYS) {
>  			fuse_file_free(ff);
>  			return err;
> @@ -281,6 +288,8 @@ void fuse_release_common(struct file *file, bool isdir)
>  	struct fuse_release_args *ra = ff->release_args;
>  	int opcode = isdir ? FUSE_RELEASEDIR : FUSE_RELEASE;
>  
> +	fuse_passthrough_release(ff);
> +
>  	fuse_prepare_release(fi, ff, file->f_flags, opcode);
>  
>  	if (ff->flock) {
> @@ -1543,7 +1552,9 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  	if (is_bad_inode(file_inode(file)))
>  		return -EIO;
>  
> -	if (!(ff->open_flags & FOPEN_DIRECT_IO))
> +	if (ff->passthrough_filp)
> +		return fuse_passthrough_read_iter(iocb, to);
> +	else if (!(ff->open_flags & FOPEN_DIRECT_IO))
>  		return fuse_cache_read_iter(iocb, to);
>  	else
>  		return fuse_direct_read_iter(iocb, to);
> @@ -1557,7 +1568,9 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	if (is_bad_inode(file_inode(file)))
>  		return -EIO;
>  
> -	if (!(ff->open_flags & FOPEN_DIRECT_IO))
> +	if (ff->passthrough_filp)
> +		return fuse_passthrough_write_iter(iocb, from);
> +	else if (!(ff->open_flags & FOPEN_DIRECT_IO))
>  		return fuse_cache_write_iter(iocb, from);
>  	else
>  		return fuse_direct_write_iter(iocb, from);
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 740a8a7d7ae6..4e2ef3c436d0 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -208,6 +208,12 @@ struct fuse_file {
>  
>  	} readdir;
>  
> +	/**
> +	 * Reference to lower filesystem file for read/write operations
> +	 * handled in pass-through mode
> +	 */
> +	struct file *passthrough_filp;
> +
>  	/** RB node to be linked on fuse_conn->polled_files */
>  	struct rb_node polled_node;
>  
> @@ -252,6 +258,7 @@ struct fuse_args {
>  	bool may_block:1;
>  	struct fuse_in_arg in_args[3];
>  	struct fuse_arg out_args[2];
> +	struct file *passthrough_filp;
>  	void (*end)(struct fuse_conn *fc, struct fuse_args *args, int error);
>  };
>  
> @@ -353,6 +360,9 @@ struct fuse_req {
>  		struct fuse_out_header h;
>  	} out;
>  
> +	/** Lower filesystem file pointer used in pass-through mode */
> +	struct file *passthrough_filp;
> +
>  	/** Used to wake up the task waiting for completion of request*/
>  	wait_queue_head_t waitq;
>  
> @@ -720,6 +730,9 @@ struct fuse_conn {
>  	/* Do not show mount options */
>  	unsigned int no_mount_options:1;
>  
> +	/** Pass-through mode for read/write IO */
> +	unsigned int passthrough:1;
> +
>  	/** The number of requests waiting for completion */
>  	atomic_t num_waiting;
>  
> @@ -1093,4 +1106,11 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
>  u64 fuse_get_unique(struct fuse_iqueue *fiq);
>  void fuse_free_conn(struct fuse_conn *fc);
>  
> +int __init fuse_passthrough_aio_request_cache_init(void);
> +void fuse_passthrough_aio_request_cache_destroy(void);
> +int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_req *req);
> +ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *to);
> +ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *from);
> +void fuse_passthrough_release(struct fuse_file *ff);
> +
>  #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index bba747520e9b..a8fc9e7fd82e 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -965,6 +965,12 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
>  					min_t(unsigned int, FUSE_MAX_MAX_PAGES,
>  					max_t(unsigned int, arg->max_pages, 1));
>  			}
> +			if (arg->flags & FUSE_PASSTHROUGH) {
> +				fc->passthrough = 1;
> +				/* Prevent further stacking */
> +				fc->sb->s_stack_depth =
> +					FILESYSTEM_MAX_STACK_DEPTH;
> +			}
>  		} else {
>  			ra_pages = fc->max_read / PAGE_SIZE;
>  			fc->no_lock = 1;
> @@ -1002,7 +1008,8 @@ void fuse_send_init(struct fuse_conn *fc)
>  		FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
>  		FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
>  		FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> -		FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
> +		FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> +		FUSE_PASSTHROUGH;
>  	ia->args.opcode = FUSE_INIT;
>  	ia->args.in_numargs = 1;
>  	ia->args.in_args[0].size = sizeof(ia->in);
> @@ -1428,18 +1435,24 @@ static int __init fuse_fs_init(void)
>  	if (!fuse_inode_cachep)
>  		goto out;
>  
> -	err = register_fuseblk();
> +	err = fuse_passthrough_aio_request_cache_init();
>  	if (err)
>  		goto out2;
>  
> -	err = register_filesystem(&fuse_fs_type);
> +	err = register_fuseblk();
>  	if (err)
>  		goto out3;
>  
> +	err = register_filesystem(&fuse_fs_type);
> +	if (err)
> +		goto out4;
> +
>  	return 0;
>  
> - out3:
> + out4:
>  	unregister_fuseblk();
> + out3:
> +	fuse_passthrough_aio_request_cache_destroy();
>   out2:
>  	kmem_cache_destroy(fuse_inode_cachep);
>   out:
> @@ -1457,6 +1470,7 @@ static void fuse_fs_cleanup(void)
>  	 */
>  	rcu_barrier();
>  	kmem_cache_destroy(fuse_inode_cachep);
> +	fuse_passthrough_aio_request_cache_destroy();
>  }
>  
>  static struct kobject *fuse_kobj;
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> new file mode 100644
> index 000000000000..b5eee5acd2a2
> --- /dev/null
> +++ b/fs/fuse/passthrough.c
> @@ -0,0 +1,219 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "fuse_i.h"
> +
> +#include <linux/aio.h>
> +#include <linux/fs_stack.h>
> +#include <linux/uio.h>
> +
> +static struct kmem_cache *fuse_aio_request_cachep;
> +
> +struct fuse_aio_req {
> +	struct kiocb iocb;
> +	struct kiocb *iocb_fuse;
> +};
> +
> +static void fuse_copyattr(struct file *dst_file, struct file *src_file,
> +			  bool write)
> +{
> +	struct inode *dst = file_inode(dst_file);
> +	struct inode *src = file_inode(src_file);
> +
> +	fsstack_copy_attr_times(dst, src);
> +	if (write)
> +		fsstack_copy_inode_size(dst, src);
> +}
> +
> +static void fuse_aio_cleanup_handler(struct fuse_aio_req *aio_req)
> +{
> +	struct kiocb *iocb_fuse = aio_req->iocb_fuse;
> +	struct kiocb *iocb = &aio_req->iocb;
> +
> +	if (iocb->ki_flags & IOCB_WRITE) {
> +		__sb_writers_acquired(file_inode(iocb->ki_filp)->i_sb,
> +				      SB_FREEZE_WRITE);
> +		file_end_write(iocb->ki_filp);
> +		fuse_copyattr(iocb_fuse->ki_filp, iocb->ki_filp, true);
> +	}
> +
> +	iocb_fuse->ki_pos = iocb->ki_pos;
> +	kmem_cache_free(fuse_aio_request_cachep, aio_req);
> +}
> +
> +static void fuse_aio_rw_complete(struct kiocb *iocb, long res, long res2)
> +{
> +	struct fuse_aio_req *aio_req =
> +		container_of(iocb, struct fuse_aio_req, iocb);
> +	struct kiocb *iocb_fuse = aio_req->iocb_fuse;
> +
> +	fuse_aio_cleanup_handler(aio_req);
> +	iocb_fuse->ki_complete(iocb_fuse, res, res2);
> +}
> +
> +ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> +				   struct iov_iter *iter)
> +{
> +	struct file *fuse_filp = iocb_fuse->ki_filp;
> +	struct fuse_file *ff = fuse_filp->private_data;
> +	struct file *passthrough_filp = ff->passthrough_filp;
> +	ssize_t ret;
> +
> +	if (!iov_iter_count(iter))
> +		return 0;
> +
> +	if (is_sync_kiocb(iocb_fuse)) {
> +		struct kiocb iocb;
> +
> +		kiocb_clone(&iocb, iocb_fuse, passthrough_filp);
> +		ret = call_read_iter(passthrough_filp, &iocb, iter);
> +		iocb_fuse->ki_pos = iocb.ki_pos;
> +	} else {
> +		struct fuse_aio_req *aio_req;
> +
> +		aio_req =
> +			kmem_cache_zalloc(fuse_aio_request_cachep, GFP_KERNEL);
> +		if (!aio_req)
> +			return -ENOMEM;
> +
> +		aio_req->iocb_fuse = iocb_fuse;
> +		kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
> +		aio_req->iocb.ki_complete = fuse_aio_rw_complete;
> +		ret = vfs_iocb_iter_read(passthrough_filp, &aio_req->iocb,
> +					 iter);
> +		if (ret != -EIOCBQUEUED)
> +			fuse_aio_cleanup_handler(aio_req);
> +	}
> +
> +	fuse_copyattr(fuse_filp, passthrough_filp, false);
> +
> +	return ret;
> +}
> +
> +ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
> +				    struct iov_iter *iter)
> +{
> +	struct file *fuse_filp = iocb_fuse->ki_filp;
> +	struct fuse_file *ff = fuse_filp->private_data;
> +	struct file *passthrough_filp = ff->passthrough_filp;
> +	struct inode *passthrough_inode = file_inode(passthrough_filp);
> +	struct inode *fuse_inode = file_inode(fuse_filp);
> +	ssize_t ret;
> +
> +	if (!iov_iter_count(iter))
> +		return 0;
> +
> +	inode_lock(fuse_inode);
> +
> +	if (is_sync_kiocb(iocb_fuse)) {
> +		struct kiocb iocb;
> +
> +		kiocb_clone(&iocb, iocb_fuse, passthrough_filp);
> +
> +		file_start_write(passthrough_filp);
> +		ret = call_write_iter(passthrough_filp, &iocb, iter);
> +		file_end_write(passthrough_filp);
> +
> +		iocb_fuse->ki_pos = iocb.ki_pos;
> +		fuse_copyattr(fuse_filp, passthrough_filp, true);
> +	} else {
> +		struct fuse_aio_req *aio_req;
> +
> +		aio_req =
> +			kmem_cache_zalloc(fuse_aio_request_cachep, GFP_KERNEL);
> +		if (!aio_req)
> +			return -ENOMEM;
> +
> +		file_start_write(passthrough_filp);
> +		__sb_writers_release(passthrough_inode->i_sb, SB_FREEZE_WRITE);
> +
> +		aio_req->iocb_fuse = iocb_fuse;
> +		kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
> +		aio_req->iocb.ki_complete = fuse_aio_rw_complete;
> +		ret = vfs_iocb_iter_write(passthrough_filp, &aio_req->iocb,
> +					  iter);
> +		if (ret != -EIOCBQUEUED)
> +			fuse_aio_cleanup_handler(aio_req);
> +	}
> +
> +	inode_unlock(fuse_inode);
> +
> +	return ret;
> +}
> +
> +int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_req *req)
> +{
> +	struct super_block *passthrough_sb;
> +	struct inode *passthrough_inode;
> +	struct fuse_open_out *open_out;
> +	struct file *passthrough_filp;
> +	unsigned short open_out_index;
> +	int fs_stack_depth;
> +
> +	req->passthrough_filp = NULL;
> +
> +	if (!fc->passthrough)
> +		return 0;
> +
> +	if (!(req->in.h.opcode == FUSE_OPEN && req->args->out_numargs == 1) &&
> +	    !(req->in.h.opcode == FUSE_CREATE && req->args->out_numargs == 2))
> +		return 0;
> +
> +	open_out_index = req->args->out_numargs - 1;
> +
> +	if (req->args->out_args[open_out_index].size != sizeof(*open_out))
> +		return 0;
> +
> +	open_out = req->args->out_args[open_out_index].value;
> +
> +	if (!(open_out->open_flags & FOPEN_PASSTHROUGH))
> +		return 0;
> +
> +	passthrough_filp = fget(open_out->fd);
> +	if (!passthrough_filp) {
> +		pr_err("FUSE: invalid file descriptor for passthrough.\n");
> +		return -1;
> +	}
> +
> +	if (!passthrough_filp->f_op->read_iter ||
> +	    !passthrough_filp->f_op->write_iter) {
> +		pr_err("FUSE: passthrough file misses file operations.\n");
> +		fput(passthrough_filp);
> +		return -1;
> +	}
> +
> +	passthrough_inode = file_inode(passthrough_filp);
> +	passthrough_sb = passthrough_inode->i_sb;
> +	fs_stack_depth = passthrough_sb->s_stack_depth + 1;
> +	if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> +		pr_err("FUSE: maximum fs stacking depth exceeded for passthrough\n");
> +		fput(passthrough_filp);
> +		return -1;
> +	}
> +
> +	req->passthrough_filp = passthrough_filp;
> +	return 0;
> +}
> +
> +void fuse_passthrough_release(struct fuse_file *ff)
> +{
> +	if (ff->passthrough_filp) {
> +		fput(ff->passthrough_filp);
> +		ff->passthrough_filp = NULL;
> +	}
> +}
> +
> +int __init fuse_passthrough_aio_request_cache_init(void)
> +{
> +	fuse_aio_request_cachep =
> +		kmem_cache_create("fuse_aio_req", sizeof(struct fuse_aio_req),
> +				  0, SLAB_HWCACHE_ALIGN, NULL);
> +	if (!fuse_aio_request_cachep)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +void fuse_passthrough_aio_request_cache_destroy(void)
> +{
> +	kmem_cache_destroy(fuse_aio_request_cachep);
> +}
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 373cada89815..e50bd775210a 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -283,6 +283,7 @@ struct fuse_file_lock {
>  #define FOPEN_NONSEEKABLE	(1 << 2)
>  #define FOPEN_CACHE_DIR		(1 << 3)
>  #define FOPEN_STREAM		(1 << 4)
> +#define FOPEN_PASSTHROUGH	(1 << 5)
>  
>  /**
>   * INIT request/reply flags
> @@ -342,6 +343,7 @@ struct fuse_file_lock {
>  #define FUSE_NO_OPENDIR_SUPPORT (1 << 24)
>  #define FUSE_EXPLICIT_INVAL_DATA (1 << 25)
>  #define FUSE_MAP_ALIGNMENT	(1 << 26)
> +#define FUSE_PASSTHROUGH	(1 << 27)
>  
>  /**
>   * CUSE INIT request/reply flags
> @@ -591,7 +593,7 @@ struct fuse_create_in {
>  struct fuse_open_out {
>  	uint64_t	fh;
>  	uint32_t	open_flags;
> -	uint32_t	padding;
> +	int32_t		fd;
>  };
>  
>  struct fuse_release_in {
> -- 
> 2.28.0.220.ged08abb693-goog


-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6] fuse: Add support for passthrough read/write
  2020-08-19  6:01         ` Nikolaus Rath
@ 2020-08-19  9:24           ` Amir Goldstein
  2020-08-24 12:48             ` Miklos Szeredi
  0 siblings, 1 reply; 13+ messages in thread
From: Amir Goldstein @ 2020-08-19  9:24 UTC (permalink / raw)
  To: Alessio Balsini, Jann Horn, Jens Axboe, Miklos Szeredi,
	Nikhilesh Reddy, Akilesh Kailash, David Anderson, Eric Yan,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, kernel-team, linux-fsdevel, kernel list

On Wed, Aug 19, 2020 at 9:03 AM Nikolaus Rath <Nikolaus@rath.org> wrote:
>
> Hi Alessio,
>
> Thank you for working on this, I'm excited to see things moving again on
> this front!
>
> What I would really like to see in the long-term is the ability for FUSE
> to support passthrough for specific areas of a file, i.e. the ability to
> specify different passthrough fds for different regions of the FUSE
> file.
>
> I do not want to ask you to extend the patch, but I am a little worried
> that the proposed interface would make a future extension more
> complicated than necessary. Could you comment on that? Is there a way to
> design the libfuse interface in a way that would - in the future - make
> it easier to extend it along these lines?
>
> What I have in mind is things like not coupling the setup of the
> passthrough fds to open(), but having a separate notification message for
> this (like what we use for invalidation of cache), and adding not just
> an "fd" field but also "offset" and "length" fields (which would
> currently be required to be both zero to get the "full file" semantics).
>

You mean like this?

https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/commit/?h=fuse2

I cannot blame Alessio for missing this POC patch by Miklos.
It was well hidden in another thread:
https://lore.kernel.org/linux-fsdevel/CAJfpegtjEoE7H8tayLaQHG9fRSBiVuaspnmPr2oQiOZXVB1+7g@mail.gmail.com/

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6] fuse: Add support for passthrough read/write
  2020-08-18 13:53       ` Alessio Balsini
  2020-08-19  6:01         ` Nikolaus Rath
@ 2020-08-19 10:04         ` Jann Horn
  2020-09-08 21:42           ` Alessio Balsini
  1 sibling, 1 reply; 13+ messages in thread
From: Jann Horn @ 2020-08-19 10:04 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: Jens Axboe, Miklos Szeredi, Nikhilesh Reddy, Akilesh Kailash,
	David Anderson, Eric Yan, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Stefano Duo, Zimuzo Ezeozue, kernel-team,
	linux-fsdevel, kernel list

On Tue, Aug 18, 2020 at 3:53 PM Alessio Balsini <balsini@android.com> wrote:
> Thank you both for the important feedback,
>
> I tried to consolidate all your suggestions in the new version of the
> patch, shared below.
>
> As you both recommended, that tricky ki_filp swapping has been removed,
> taking overlayfs as a reference for the management of asynchronous
> requests.
> The V7 below went again through some testing with fio (sync and libaio
> engines with several jobs in randrw), fsstress, and a bunch of kernel
> builds to trigger both the sync and async paths. Gladly everything worked
> as expected.
>
> What I didn't get from overlayfs are the temporary cred switchings, as I
> think that in the FUSE use case it is still the FUSE daemon that should
> take care of identifying if the requesting process has the right
> permissions before allowing the passthrough feature.

FYI, rw_verify_area() annoyingly calls security_file_permission(),
which is hooked by some stuff like SELinux. This doesn't really fit
well into the normal Linux security model, and it means that when a
userspace process tries to read from a FUSE file with passthrough
enabled, SELinux will actually perform access checks against *both*
the FUSE file itself and the backing file. That's kinda stupid, but
SELinux does it. If this ends up being a problem, switching
credentials would help. (Changing rw_verify_area() to not call
security_file_permission() would also help, but some of the LSM folks
might disagree with my opinion that that check is silly and should be
removed.)

But for now, I guess it might be fine to leave things as-is and not do
the extra credential switching.

> On Thu, Aug 13, 2020 at 08:30:21PM +0200, 'Jann Horn' via kernel-team wrote:
> > On Thu, Aug 13, 2020 at 3:28 PM Alessio Balsini <balsini@android.com> wrote:
> > [...]
> >
> > My point is that you can use this security issue to steal file
> > descriptors from processes that are _not_ supposed to act as FUSE
> > daemons. For example, let's say there is some setuid root executable
> > that opens /etc/shadow and then writes a message to stdout. If I
> > invoke this setuid root executable with stdout pointing to a FUSE
> > device fd, and I can arrange for the message written by the setuid
> > process to look like a FUSE message, I can trick the setuid process to
> > install its /etc/shadow fd as a passthrough fd on a FUSE file, and
> > then, through the FUSE filesystem, mess with /etc/shadow.
> >
> > Also, for context: While on Android, access to /dev/fuse is
> > restricted, desktop Linux systems allow unprivileged users to use
> > FUSE.
> > [...]
> > The new fuse_dev_ioctl() command would behave just like
> > fuse_dev_write(), except that the ioctl-based interface would permit
> > OPEN/CREATE replies with FOPEN_PASSTHROUGH, while the write()-based
> > interface would reject them.
> > [...]
> > I can't think of any checks that you could do there to make it safe,
> > because fundamentally what you have to figure out is _userspace's
> > intent_, and you can't figure out what that is if userspace just calls
> > write().
> > [...]
> > In this case, an error indicates that the userspace programmer made a
> > mistake. So even if the userspace programmer is not looking at kernel
> > logs, we should indicate to them that they messed up - and we can do
> > that by returning an error code from the syscall. So I think we should
> > ideally abort the operation in this case.
>
> I've been thinking about the security issue you mentioned, but I'm thinking
> that besides the extra Android restrictions, it shouldn't be that easy to
> play with /dev/fuse, also for a privileged process in Linux.
> In order for a process to successfully send commands to /dev/fuse, the
> command has to match a pending request coming from the FUSE filesystem,
> meaning that the same privileged process should have mounted the FUSE
> filesystem, actually becoming a FUSE daemon itself.
> Is this still an eventually exploitable scenario?

There are setuid root binaries that will print attacker-controlled
data to stdout, or stuff like that. An attacker could set up a normal
FUSE filesystem, read an OPEN request from the /dev/fuse file
descriptor, construct a reply that would install a passthrough fd (but
NOT actually send it), and then invoke a setuid root binary with the
/dev/fuse fd as stderr and with some arguments that trick it into
writing the attacker-constructed reply to stderr. For example, you can
make "su" write a string starting with attacker-controlled data like
this:

$ cat blah.c
#include <unistd.h>
#include <err.h>
int main(void) {
  execlp("su", "ATTACKER-CONTROLLED STRING", "-BLAH", NULL);
  err(1, "execlp");
}
$ gcc -o blah blah.c
$ ./blah
ATTACKER-CONTROLLED STRING: invalid option -- 'B'
Try 'ATTACKER-CONTROLLED STRING --help' for more information.
$

And yes, sure, I know, you can't write nullbytes with this specific
example; but in principle, you have to assume that a privileged victim
process might be writing arbitrary attacker-controlled data to
attacker-supplied file descriptors.

> > No, call_write_iter() doesn't do any of the required checking and
> > setup and notification steps, it just calls into the file's
> > ->write_iter handler:
> >
> > static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
> >       struct iov_iter *iter)
> > {
> >         return file->f_op->write_iter(kio, iter);
> > }
> >
> > Did you confuse call_write_iter() with vfs_iocb_iter_write(), or
> > something like that?
>
> Ops, you are right, I was referring to vfs_iocb_iter_write(), that was
> actually part of my first design, but ended up simplifying in favor of
> call_write_iter().
>
> > Requests can complete asynchronously. That means call_write_iter() can
> > return more or less immediately, while the request is processed in the
> > background, and at some point, a callback is invoked. That's what that
> > -EIOCBQUEUED return value is about. In that case, the current code
> > will change ->ki_filp while the request is still being handled in the
> > background.
> >
> > I recommend looking at ovl_write_iter() in overlayfs to see how
> > they're dealing with this case.
>
> That was a great suggestion, I hopefully picked the right things from
> overlayfs, without overlooking some security aspects.
[...]
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
[...]
> +static void fuse_copyattr(struct file *dst_file, struct file *src_file,
> +                         bool write)
> +{
> +       struct inode *dst = file_inode(dst_file);
> +       struct inode *src = file_inode(src_file);
> +
> +       fsstack_copy_attr_times(dst, src);
> +       if (write)
> +               fsstack_copy_inode_size(dst, src);
> +}
[...]
> +ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> +                                  struct iov_iter *iter)
> +{
> +       struct file *fuse_filp = iocb_fuse->ki_filp;
> +       struct fuse_file *ff = fuse_filp->private_data;
> +       struct file *passthrough_filp = ff->passthrough_filp;
> +       ssize_t ret;
> +
> +       if (!iov_iter_count(iter))
> +               return 0;
> +
> +       if (is_sync_kiocb(iocb_fuse)) {
> +               struct kiocb iocb;
> +
> +               kiocb_clone(&iocb, iocb_fuse, passthrough_filp);
> +               ret = call_read_iter(passthrough_filp, &iocb, iter);
> +               iocb_fuse->ki_pos = iocb.ki_pos;
> +       } else {
> +               struct fuse_aio_req *aio_req;
> +
> +               aio_req =
> +                       kmem_cache_zalloc(fuse_aio_request_cachep, GFP_KERNEL);
> +               if (!aio_req)
> +                       return -ENOMEM;
> +
> +               aio_req->iocb_fuse = iocb_fuse;
> +               kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
> +               aio_req->iocb.ki_complete = fuse_aio_rw_complete;
> +               ret = vfs_iocb_iter_read(passthrough_filp, &aio_req->iocb,
> +                                        iter);
> +               if (ret != -EIOCBQUEUED)
> +                       fuse_aio_cleanup_handler(aio_req);
> +       }
> +
> +       fuse_copyattr(fuse_filp, passthrough_filp, false);

This copies timestamps from the passthrough_filp to the fuse_filp
without any locking I can see - which means that depending on the
architecture and the compiler's decisions, the copied timestamps can
end up getting corrupted to some degree. On 64-bit platforms, in
practice, you might e.g. end up with a timestamp where the tv_nsec
part is from an older timestamp while tv_sec is from a newer one,
yielding an overall timestamp that might e.g. be almost a second in
the future; and on 32-bit platforms, the results could even be much
worse. In theory, this is a data race, which should generally be
avoided.

The same thing applies to some of the other places where
fuse_copyattr() is used.

From a higher-level perspective, are you sure that you even want to do
this timestamp-copying at all, given that the userspace fuse daemon
will also be supplying its own timestamps? Especially considering that
the timestamps are at the inode level while the passthrough logic is
at the file level, that seems pretty wonky.

> +       return ret;
> +}
[...]
> +int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_req *req)
> +{
> +       struct super_block *passthrough_sb;
> +       struct inode *passthrough_inode;
> +       struct fuse_open_out *open_out;
> +       struct file *passthrough_filp;
> +       unsigned short open_out_index;
> +       int fs_stack_depth;
> +
> +       req->passthrough_filp = NULL;
> +
> +       if (!fc->passthrough)
> +               return 0;
> +
> +       if (!(req->in.h.opcode == FUSE_OPEN && req->args->out_numargs == 1) &&
> +           !(req->in.h.opcode == FUSE_CREATE && req->args->out_numargs == 2))
> +               return 0;
> +
> +       open_out_index = req->args->out_numargs - 1;
> +
> +       if (req->args->out_args[open_out_index].size != sizeof(*open_out))
> +               return 0;

Can this ever legitimately happen, or would that indicate a kernel
bug? If it indicates a kernel bug, please write WARN_ON() around the
condition, like this: "if
(WARN_ON(req->args->out_args[open_out_index].size !=
sizeof(*open_out)))".

> +       open_out = req->args->out_args[open_out_index].value;
> +
> +       if (!(open_out->open_flags & FOPEN_PASSTHROUGH))
> +               return 0;
> +
> +       passthrough_filp = fget(open_out->fd);
> +       if (!passthrough_filp) {
> +               pr_err("FUSE: invalid file descriptor for passthrough.\n");
> +               return -1;
> +       }
> +
> +       if (!passthrough_filp->f_op->read_iter ||
> +           !passthrough_filp->f_op->write_iter) {
> +               pr_err("FUSE: passthrough file misses file operations.\n");
> +               fput(passthrough_filp);
> +               return -1;
> +       }

(Just as a non-actionable comment: This means that passthrough files
won't work with files that use ->read and ->write handlers instead of
->read_iter and ->write_iter. But that's probably actually a good
thing - ->read and ->write are mostly used by weird pseudo-files, and
I don't see any legitimate reason why anyone would want to use those
things with FUSE passthrough.)

> +       passthrough_inode = file_inode(passthrough_filp);
> +       passthrough_sb = passthrough_inode->i_sb;
> +       fs_stack_depth = passthrough_sb->s_stack_depth + 1;
> +       if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> +               pr_err("FUSE: maximum fs stacking depth exceeded for passthrough\n");
> +               fput(passthrough_filp);
> +               return -1;
> +       }
> +
> +       req->passthrough_filp = passthrough_filp;
> +       return 0;
> +}
[...]
> +int __init fuse_passthrough_aio_request_cache_init(void)
> +{
> +       fuse_aio_request_cachep =
> +               kmem_cache_create("fuse_aio_req", sizeof(struct fuse_aio_req),
> +                                 0, SLAB_HWCACHE_ALIGN, NULL);
> +       if (!fuse_aio_request_cachep)
> +               return -ENOMEM;
> +
> +       return 0;
> +}
> +
> +void fuse_passthrough_aio_request_cache_destroy(void)
> +{
> +       kmem_cache_destroy(fuse_aio_request_cachep);
> +}

I think you don't have to make your own slab cache for this; it should
be fine to just use kmalloc() instead. Using a dedicated cache will be
more space-efficient if you have tons of allocations, but you won't
have a huge number of these allocations existing simultaneously, so
that shouldn't matter much. And if you only have a very small number
of allocations, you'll probably actually end up using more memory.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6] fuse: Add support for passthrough read/write
  2020-08-19  9:24           ` Amir Goldstein
@ 2020-08-24 12:48             ` Miklos Szeredi
  2020-09-11 16:23               ` Alessio Balsini
  0 siblings, 1 reply; 13+ messages in thread
From: Miklos Szeredi @ 2020-08-24 12:48 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Alessio Balsini, Jann Horn, Jens Axboe, Nikhilesh Reddy,
	Akilesh Kailash, David Anderson, Eric Yan, Martijn Coenen,
	Palmer Dabbelt, Paul Lawrence, Stefano Duo, Zimuzo Ezeozue,
	kernel-team, linux-fsdevel, kernel list

On Wed, Aug 19, 2020 at 11:25 AM Amir Goldstein <amir73il@gmail.com> wrote:

> > What I have in mind is things like not coupling the setup of the
> > passthrough fds to open(), but having a separate notification message for
> > this (like what we use for invalidation of cache), and adding not just
> > an "fd" field but also "offset" and "length" fields (which would
> > currently be required to be both zero to get the "full file" semantics).
> >
>
> You mean like this?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/commit/?h=fuse2

Look specifically at fuse_file_map_iter():

https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/tree/fs/fuse2/file.c?h=fuse2#n582

and fudev_map_ioctl():

https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/tree/fs/fuse2/fudev.c?h=fuse2#n601

This avoids the security issue Jann mentioned as well as allowing
arbitrary mapping of file ranges.  E.g. it could also  be used by a
block based filesystem to map I/O directly into the block device.

What the implementation lacks is any kind of caching.  Since your
usecase involves just one map extent per file, special casing that
would be trivial.  We can revisit general caching later.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6] fuse: Add support for passthrough read/write
  2020-08-19 10:04         ` Jann Horn
@ 2020-09-08 21:42           ` Alessio Balsini
  0 siblings, 0 replies; 13+ messages in thread
From: Alessio Balsini @ 2020-09-08 21:42 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alessio Balsini, Jens Axboe, Miklos Szeredi, Nikhilesh Reddy,
	Akilesh Kailash, David Anderson, Eric Yan, Martijn Coenen,
	Palmer Dabbelt, Paul Lawrence, Stefano Duo, Zimuzo Ezeozue,
	kernel-team, linux-fsdevel, kernel list

Thanks Jann,

Sorry for the late reply, I wanted to better understand the problems you
mentioned and explore the ioctl solution before coming back to discussion.
I have a new patch ready which addresses the feedbacks received, and that
has been converted to using a new ioctl. And I have to say I'm happy with
this ioctl-based solution. :)
I'll finish replying to the different branches of this thread and post the
patch as soon as I'm done with testing, hopefully by tomorrow EOD.

Inline replies below.

On Wed, Aug 19, 2020 at 12:04:03PM +0200, Jann Horn wrote:
> [...]
> FYI, rw_verify_area() annoyingly calls security_file_permission(),
> which is hooked by some stuff like SELinux. This doesn't really fit
> well into the normal Linux security model, and it means that when a
> userspace process tries to read from a FUSE file with passthrough
> enabled, SELinux will actually perform access checks against *both*
> the FUSE file itself and the backing file. That's kinda stupid, but
> SELinux does it. If this ends up being a problem, switching
> credentials would help. (Changing rw_verify_area() to not call
> security_file_permission() would also help, but some of the LSM folks
> might disagree with my opinion that that check is silly and should be
> removed.)
> 
> But for now, I guess it might be fine to leave things as-is and not do
> the extra credential switching.

I would be happy to leave this simplified access as-is too, or in the case
something bad comes out, to implement the credential switching solution.
Let's see if also Miklos or Jens have an opinion on this.

> 
> > On Thu, Aug 13, 2020 at 08:30:21PM +0200, 'Jann Horn' via kernel-team wrote:
> [...]
> There are setuid root binaries that will print attacker-controlled
> data to stdout, or stuff like that. An attacker could set up a normal
> FUSE filesystem, read an OPEN request from the /dev/fuse file
> descriptor, construct a reply that would install a passthrough fd (but
> NOT actually send it), and then invoke a setuid root binary with the
> /dev/fuse fd as stderr and with some arguments that trick it into
> writing the attacker-constructed reply to stderr. For example, you can
> make "su" write a string starting with attacker-controlled data like
> this:
> 
> $ cat blah.c
> #include <unistd.h>
> #include <err.h>
> int main(void) {
>   execlp("su", "ATTACKER-CONTROLLED STRING", "-BLAH", NULL);
>   err(1, "execlp");
> }
> $ gcc -o blah blah.c
> $ ./blah
> ATTACKER-CONTROLLED STRING: invalid option -- 'B'
> Try 'ATTACKER-CONTROLLED STRING --help' for more information.
> $
> 
> And yes, sure, I know, you can't write nullbytes with this specific
> example; but in principle, you have to assume that a privileged victim
> process might be writing arbitrary attacker-controlled data to
> attacker-supplied file descriptors.

Now I see your point! The ioctl-based solution is on the way.

> [...]
> > +ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> > +                                  struct iov_iter *iter)
> > +{
> > +       struct file *fuse_filp = iocb_fuse->ki_filp;
> > +       struct fuse_file *ff = fuse_filp->private_data;
> > +       struct file *passthrough_filp = ff->passthrough_filp;
> > +       ssize_t ret;
> > +
> > +       if (!iov_iter_count(iter))
> > +               return 0;
> > +
> > +       if (is_sync_kiocb(iocb_fuse)) {
> > +               struct kiocb iocb;
> > +
> > +               kiocb_clone(&iocb, iocb_fuse, passthrough_filp);
> > +               ret = call_read_iter(passthrough_filp, &iocb, iter);
> > +               iocb_fuse->ki_pos = iocb.ki_pos;
> > +       } else {
> > +               struct fuse_aio_req *aio_req;
> > +
> > +               aio_req =
> > +                       kmem_cache_zalloc(fuse_aio_request_cachep, GFP_KERNEL);
> > +               if (!aio_req)
> > +                       return -ENOMEM;
> > +
> > +               aio_req->iocb_fuse = iocb_fuse;
> > +               kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
> > +               aio_req->iocb.ki_complete = fuse_aio_rw_complete;
> > +               ret = vfs_iocb_iter_read(passthrough_filp, &aio_req->iocb,
> > +                                        iter);
> > +               if (ret != -EIOCBQUEUED)
> > +                       fuse_aio_cleanup_handler(aio_req);
> > +       }
> > +
> > +       fuse_copyattr(fuse_filp, passthrough_filp, false);
> 
> This copies timestamps from the passthrough_filp to the fuse_filp
> without any locking I can see - which means that depending on the
> architecture and the compiler's decisions, the copied timestamps can
> end up getting corrupted to some degree. On 64-bit platforms, in
> practice, you might e.g. end up with a timestamp where the tv_nsec
> part is from an older timestamp while tv_sec is from a newer one,
> yielding an overall timestamp that might e.g. be almost a second in
> the future; and on 32-bit platforms, the results could even be much
> worse. In theory, this is a data race, which should generally be
> avoided.
> 
> The same thing applies to some of the other places where
> fuse_copyattr() is used.

I'm not too confident in this RCU domain, so what do you think about
changing the fuse_copyattr() code into:

static void fuse_copyattr(struct file *dst_file, struct file *src_file,
                          bool write)
{
        struct inode *dst = file_inode(dst_file);
        struct inode *src = file_inode(src_file);
        struct timespec64 i_atime, i_mtime, i_ctime;

        spin_lock(&src->i_lock);
        i_atime = src->i_atime;
        i_mtime = src->i_mtime;
        i_ctime = src->i_ctime;
        spin_unlock(&src->i_lock);
        spin_lock(&dst->i_lock);
        dst->i_atime = i_atime;
        dst->i_mtime = i_mtime;
        dst->i_ctime = i_ctime;
        spin_unlock(&dst->i_lock);

        if (write)
                fsstack_copy_inode_size(dst, src);
}

The i_*time copy behavior is similar to fsstack_copy_inode_size().
IIRC I didn't add any locking because a comment at the top of fs_stack.h
mentions that the functions declared there do not require i_mutex to be
held, and since fsstack_copy_inode_size() has its own lockings, I probably
assumed that i_*time were special. But actually, not requiring i_mutex
doesn't mean not requiring any locking. Oops! :P

> 
> From a higher-level perspective, are you sure that you even want to do
> this timestamp-copying at all, given that the userspace fuse daemon
> will also be supplying its own timestamps? Especially considering that
> the timestamps are at the inode level while the passthrough logic is
> at the file level, that seems pretty wonky.

Yeah, that looks wonky... :)
But while testing I noticed that the system was misbehaving when the FUSE
stats were not manually updated with the lower filesystem.
My understanding is that being the FUSE passthrough read/write operations
transparent to the FUSE daemon, if stats caching is enabled the kernel
would return the cached stats without forwarding the request to the FUSE
daemon, and this may produce inconsistencies with the lower filesystem.
This copying solution prevents forcing the daemon to disable stats caching.

This reminds me that I should update the commit message mentioning the
reasoning behind these stats updates.

> [...]
> > +       if (!(req->in.h.opcode == FUSE_OPEN && req->args->out_numargs == 1) &&
> > +           !(req->in.h.opcode == FUSE_CREATE && req->args->out_numargs == 2))
> > +               return 0;
> > +
> > +       open_out_index = req->args->out_numargs - 1;
> > +
> > +       if (req->args->out_args[open_out_index].size != sizeof(*open_out))
> > +               return 0;
> 
> Can this ever legitimately happen, or would that indicate a kernel
> bug? If it indicates a kernel bug, please write WARN_ON() around the
> condition, like this: "if
> (WARN_ON(req->args->out_args[open_out_index].size !=
> sizeof(*open_out)))".

Will be chopped in the next patch with ioctl().

> > +       if (!passthrough_filp->f_op->read_iter ||
> > +           !passthrough_filp->f_op->write_iter) {
> > +               pr_err("FUSE: passthrough file misses file operations.\n");
> > +               fput(passthrough_filp);
> > +               return -1;
> > +       }
> 
> (Just as a non-actionable comment: This means that passthrough files
> won't work with files that use ->read and ->write handlers instead of
> ->read_iter and ->write_iter. But that's probably actually a good
> thing - ->read and ->write are mostly used by weird pseudo-files, and
> I don't see any legitimate reason why anyone would want to use those
> things with FUSE passthrough.)
> 

I like this checking because if on the one hand, as you mentioned, this
represents a restriction for those pseud-files not implementing
{read,write}_iter operations, on the other hand vfs automatically falls back to
the {read,write}_iter counterpart if a filesystem doesn't implement the
original read/write. So, why not?

> > +       passthrough_inode = file_inode(passthrough_filp);
> > +       passthrough_sb = passthrough_inode->i_sb;
> > +       fs_stack_depth = passthrough_sb->s_stack_depth + 1;
> > +       if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> > +               pr_err("FUSE: maximum fs stacking depth exceeded for passthrough\n");
> > +               fput(passthrough_filp);
> > +               return -1;
> > +       }
> > +
> > +       req->passthrough_filp = passthrough_filp;
> > +       return 0;
> > +}
> [...]
> > +int __init fuse_passthrough_aio_request_cache_init(void)
> > +{
> > +       fuse_aio_request_cachep =
> > +               kmem_cache_create("fuse_aio_req", sizeof(struct fuse_aio_req),
> > +                                 0, SLAB_HWCACHE_ALIGN, NULL);
> > +       if (!fuse_aio_request_cachep)
> > +               return -ENOMEM;
> > +
> > +       return 0;
> > +}
> > +
> > +void fuse_passthrough_aio_request_cache_destroy(void)
> > +{
> > +       kmem_cache_destroy(fuse_aio_request_cachep);
> > +}
> 
> I think you don't have to make your own slab cache for this; it should
> be fine to just use kmalloc() instead. Using a dedicated cache will be
> more space-efficient if you have tons of allocations, but you won't
> have a huge number of these allocations existing simultaneously, so
> that shouldn't matter much. And if you only have a very small number
> of allocations, you'll probably actually end up using more memory.

I fell into an overkill trying to do the right thing.
Jumping back to kmalloc/kfree.

Thanks again,
Alessio

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6] fuse: Add support for passthrough read/write
  2020-08-24 12:48             ` Miklos Szeredi
@ 2020-09-11 16:23               ` Alessio Balsini
  0 siblings, 0 replies; 13+ messages in thread
From: Alessio Balsini @ 2020-09-11 16:23 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Amir Goldstein, Alessio Balsini, Jann Horn, Jens Axboe,
	Nikhilesh Reddy, Akilesh Kailash, David Anderson, Eric Yan,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, kernel-team, linux-fsdevel, kernel list

Thanks all for the comments.

I have a patchset ready that hopefully wraps together the extendability
suggested by Nikolaus, that I agree is a good idea.
The way I tried to make it more flexible is first of all transitioning to a
ioctl(), as suggested by both Jann and Miklos, and by using a data
structure with flexible array member.

Thanks Amir for the fuse2 pointer. I didn't notice that project before, but
I really enjoyed going through its code.
I'm curious if it's intended to deprecate the current fuse implementation
or is what the current fuse will converge to. I noticed that some good
ideas that were in fuse2 have been also added to fuse, so I tried to take
fuse2 as a reference for my passthrough ioctl().

Miklos, can you please give us a glimpse of what's the future of fuse2?

Thanks a lot again for the feedback, I'll send the new patch in a few
minutes.

Cheers,
Alessio


On Mon, Aug 24, 2020 at 02:48:01PM +0200, Miklos Szeredi wrote:
> On Wed, Aug 19, 2020 at 11:25 AM Amir Goldstein <amir73il@gmail.com> wrote:
> 
> > > What I have in mind is things like not coupling the setup of the
> > > passthrough fds to open(), but having a separate notification message for
> > > this (like what we use for invalidation of cache), and adding not just
> > > an "fd" field but also "offset" and "length" fields (which would
> > > currently be required to be both zero to get the "full file" semantics).
> > >
> >
> > You mean like this?
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/commit/?h=fuse2
> 
> Look specifically at fuse_file_map_iter():
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/tree/fs/fuse2/file.c?h=fuse2#n582
> 
> and fudev_map_ioctl():
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/tree/fs/fuse2/fudev.c?h=fuse2#n601
> 
> This avoids the security issue Jann mentioned as well as allowing
> arbitrary mapping of file ranges.  E.g. it could also  be used by a
> block based filesystem to map I/O directly into the block device.
> 
> What the implementation lacks is any kind of caching.  Since your
> usecase involves just one map extent per file, special casing that
> would be trivial.  We can revisit general caching later.
> 
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-09-11 16:24 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-12 16:14 [PATCH v6] fuse: Add support for passthrough read/write Alessio Balsini
2020-08-12 18:29 ` Jann Horn
2020-08-13 13:28   ` Alessio Balsini
2020-08-13 18:30     ` Jann Horn
2020-08-18 13:53       ` Alessio Balsini
2020-08-19  6:01         ` Nikolaus Rath
2020-08-19  9:24           ` Amir Goldstein
2020-08-24 12:48             ` Miklos Szeredi
2020-09-11 16:23               ` Alessio Balsini
2020-08-19 10:04         ` Jann Horn
2020-09-08 21:42           ` Alessio Balsini
2020-08-13 18:41   ` Jens Axboe
2020-08-12 22:06 ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.