All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write
@ 2021-01-25 15:30 Alessio Balsini
  2021-01-25 15:30 ` [PATCH RESEND V12 1/8] fs: Generic function to convert iocb to rw flags Alessio Balsini
                   ` (8 more replies)
  0 siblings, 9 replies; 66+ messages in thread
From: Alessio Balsini @ 2021-01-25 15:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

This is the 12th version of the series, rebased on top of v5.11-rc5.
Please find the changelog at the bottom of this cover letter.

Add support for file system passthrough read/write of files when enabled
in userspace through the option FUSE_PASSTHROUGH.

There are file systems based on FUSE that are intended to enforce
special policies or trigger complicated decision makings at the file
operations level. Android, for example, uses FUSE to enforce
fine-grained access policies that also depend on the file contents.
Sometimes it happens that at open or create time a file is identified as
not requiring additional checks for consequent reads/writes, thus FUSE
would simply act as a passive bridge between the process accessing the
FUSE file system and the lower file system. Splicing and caching help
reduce the FUSE overhead, but there are still read/write operations
forwarded to the userspace FUSE daemon that could be avoided.

This series has been inspired by the original patches from Nikhilesh
Reddy, the idea and code of which has been elaborated and improved
thanks to the community support.

When the FUSE_PASSTHROUGH capability is enabled, the FUSE daemon may
decide while handling the open/create operations, if the given file can
be accessed in passthrough mode. This means that all the further read
and write operations would be forwarded by the kernel directly to the
lower file system using the VFS layer rather than to the FUSE daemon.
All the requests other than reads or writes are still handled by the
userspace FUSE daemon.
This allows for improved performance on reads and writes, especially in
the case of reads at random offsets, for which no (readahead) caching
mechanism would help.
Benchmarks show improved performance that is close to native file system
access when doing massive manipulations on a single opened file,
especially in the case of random reads, random writes and sequential
writes. Detailed benchmarking results are presented below.

The creation of this direct connection (passthrough) between FUSE file
objects and file objects in the lower file system happens in a way that
reminds of passing file descriptors via sockets:
- a process requests the opening of a file handled by FUSE, so the
  kernel forwards the request to the FUSE daemon;
- the FUSE daemon opens the target file in the lower file system,
  getting its file descriptor;
- the FUSE daemon also decides according to its internal policies if
  passthrough can be enabled for that file, and, if so, can perform a
  FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl on /dev/fuse, passing the file
  descriptor obtained at the previous step and the fuse_req unique
  identifier;
- the kernel translates the file descriptor to the file pointer
  navigating through the opened files of the "current" process and
  temporarily stores it in the associated open/create fuse_req's
  passthrough_filp;
- when the FUSE daemon has done with the request and it's time for the
  kernel to close it, it checks if the passthrough_filp is available and
in case updates the additional field in the fuse_file owned by the
process accessing the FUSE file system.
From now on, all the read/write operations performed by that process
will be redirected to the corresponding lower file system file by
creating new VFS requests.
Since the read/write operation to the lower file system is executed with
the current process's credentials, it might happen that it does not have
enough privileges to succeed. For this reason, the process temporarily
receives the same credentials as the FUSE daemon, that are reverted as
soon as the read/write operation completes, emulating the behavior of
the request to be performed by the FUSE daemon itself. This solution has
been inspired by the way overlayfs handles read/write operations.
Asynchronous IO is supported as well, handled by creating separate AIO
requests for the lower file system that will be internally tracked by
FUSE, that intercepts and propagates their completion through an
internal ki_completed callback similar to the current implementation of
overlayfs.
Finally, also memory-mapped FUSE files are supported in this FUSE
passthrough series as it has been noticed that when a same file with
FUSE passthrough enabled is accessed both with standard
read/write(-iter) operations and memory-mapped read/write operations,
the file content might result corrupted due to an inconsistency between
the FUSE and lower file system caches.

The ioctl has been designed taking as a reference and trying to converge
to the fuse2 implementation. For example, the fuse_passthrough_out data
structure has extra fields that will allow for further extensions of the
feature.


    Performance on RAM block device

What follows has been performed using a custom passthrough_hp FUSE
daemon that enables pass-through for each file that is opened during
both "open" and "create". Benchmarks were run on an Intel Xeon W-2135,
64 GiB of RAM workstation, with a RAM block device used as storage
target. More specifically, out of the system's 64 GiB of RAM, 40 GiB
were reserved for /dev/ram0, formatted as ext4. For the FUSE and FUSE
passthrough benchmarks, the FUSE file system was mounted on top of the
mounted /dev/ram0 device.
That file system has been completely filled and then cleaned up before
running the benchmarks: this to ensure that all the /dev/ram0 space was
reserved and not usable as page cache.

The rationale for using a RAM block device is that SSDs may experience
performance fluctuations, especially when dealing with accessing data
random offsets.
Getting rid of the discrete storage device also removes a huge component
of slowness, highlighting the performance difference of the software
parts (and probably the goodness of CPU caching and its coherence
mechanisms).

No special tuning has been performed, e.g., all the involved processes
are SCHED_OTHER, ondemand is the frequency governor with no frequency
restrictions, and turbo-boost, as well as p-state, are active. This is
because I noticed that, for such high-level benchmarks, results
consistency was minimally affected by these features.

The source code of the updated libfuse library and passthrough_hp is
shared at the following repository:

  https://github.com/balsini/libfuse/tree/fuse-passthrough-v12-v5.11-rc5

Two different kinds of benchmarks were done for this change, the first
set of tests evaluates the bandwidth improvements when manipulating huge
single files, the second set of tests verify that no performance
regressions were introduced when handling many small files.

All the caches were dropped before running every benchmark with:

  echo 3 > /proc/sys/vm/drop_caches

All the benchmarks were run 10 times, with 1 minute cool down between
each run.

The first benchmarks were done by running FIO (fio-3.24) with:
- bs=4Ki;
- file size: 35Gi;
- ioengine: sync;
- fsync_on_close=1;
- randseed=0.
The target file has been chosen large enough to avoid it to be entirely
loaded into the page cache.

Results are presented in the following table:

+-----------+------------+-------------+-------------+
|   MiB/s   |    fuse    | passthrough |   native    |
+-----------+------------+-------------+-------------+
| read      | 471(±1.3%) | 1791(±1.0%) | 1839(±1.8%) |
| write     | 95(±.6%)   | 1068(±.9%)  | 1322(±.8%)  |
| randread  | 25(±1.7%)  | 860(±.8%)   | 1135(±.5%)  |
| randwrite | 76(±3.0%)  | 813(±1.0%)  | 1005(±.7%)  |
+-----------+------------+-------------+-------------+

This table shows that FUSE, except for the sequential reads, is far
behind FUSE passthrough and native in terms of performance. The
extremely good FUSE performance for sequential reads is the result of a
great read-ahead mechanism. I was able to verify that setting
read_ahead_kb to 0 causes a terrible performance drop.
All the results are stable, as shown by the standard deviations.
Moreover, these numbers show the reasonable gap between passthrough and
native, introduced by the extra traversal through the VFS layer.

As long as this patch has the primary objective of improving bandwidth,
another set of tests has been performed to see how this behaves on a
totally different scenario that involves accessing many small files. For
this purpose, measuring the build time of the Linux kernel has been
chosen as an appropriate, well-known, workload. The kernel has been
built with as many processes as the number of logical CPUs (-j
$(nproc)), that besides being a reasonable parallelization value, is
also enough to saturate the processor's utilization thanks to the
additional FUSE daemon's threads, making it even harder to get closer to
the native file system performance.
The following table shows the total build times in the different
configurations:

+------------------+--------------+-----------+
|                  | AVG duration |  Standard |
|                  |     (sec)    | deviation |
+------------------+--------------+-----------+
| FUSE             |      144.566 |     0.697 |
+------------------+--------------+-----------+
| FUSE passthrough |      133.820 |     0.341 |
+------------------+--------------+-----------+
| Native           |      109.423 |     0.724 |
+------------------+--------------+-----------+

Further testing and performance evaluations are welcome.


    Description of the series

Patch 1 generalizes the function which converts iocb flags to rw flags
from overlayfs, so that can be used in this patch set.

Patch 2 enables the 32-bit compatibility for the /dev/fuse ioctl.

Patch 3 introduces the data structures, function signatures and ioctl
required both for the communication with userspace and for the internal
kernel use.

Patch 4 introduces initialization and release functions for FUSE
passthrough.

Patch 5 enables the synchronous read and write operations for those FUSE
files for which the passthrough functionality is enabled.

Patch 6 extends the read and write operations to also support
asynchronous IO.

Patch 7 allows FUSE passthrough to target files for which the requesting
process would not have direct access to, by temporarily performing a
credentials switch to the credentials of the FUSE daemon that issued the
FUSE passthrough ioctl.

Patch 8 extends FUSE passthrough operations to memory-mapped FUSE files.


    Changelog

Changes in v12:
* Revert FILESYSTEM_MAX_STACK_DEPTH checks as they were in v10
  [Requested by Amir Goldstein]
* Introduce passthrough support for memory-mapped FUSE files
  [Requested by yanwu]

Changes in v11:
* Fix the FILESYSTEM_MAX_STACK_DEPTH check to allow other file systems
  to be stacked
* Moved file system stacking depth check at ioctl time
* Update cover letter with correct libfuse repository to test the change
  [Requested by Peng Tao]
* Fix the file reference counter leak introduced in v10
  [Requested by yanwu]

Changes in v10:
* UAPI updated: ioctl now returns an ID that will be used at open/create
  response time to reference the passthrough file
* Synchronous read/write_iter functions does not return silly errors
  (fixed in aio patch)
* FUSE daemon credentials updated at ioctl time instead of mount time
* Updated benchmark results
  [Requested by Miklos Szeredi]

Changes in v9:
* Switched to using VFS instead of direct lower FS file ops
  [Attempt to address a request from Jens Axboe, Jann Horn,
  Amir Goldstein]
* Removal of useless included aio.h header
  [Proposed by Jens Axboe]

Changes in v8:
* aio requests now use kmalloc/kfree, instead of kmem_cache
* Switched to call_{read,write}_iter in AIO
* Revisited attributes copy
* Passthrough can only be enabled via ioctl, fixing the security issue
  spotted by Jann
* Use an extensible fuse_passthrough_out data structure
  [Attempt to address a request from Nikolaus Rath, Amir Goldstein and
Miklos Szeredi]

Changes in v7:
* Full handling of aio requests as done in overlayfs (update commit
* message).
* s/fget_raw/fget.
* Open fails in case of passthrough errors, emitting warning messages.
  [Proposed by Jann Horn]
* Create new local kiocb, getting rid of the previously proposed ki_filp
  swapping.
  [Proposed by Jann Horn and Jens Axboe]
* Code polishing.

Changes in v6:
* Port to kernel v5.8:
  * fuse_file_{read,write}_iter changed since the v5 of this patch was
    proposed.
* Simplify fuse_simple_request.
* Merge fuse_passthrough.h into fuse_i.h
* Refactor of passthrough.c:
  * Remove BUG_ONs.
  * Simplified error checking and request arguments indexing.
  * Use call_{read,write}_iter utility functions.
  * Remove get_file and fputs during read/write: handle the extra FUSE
    references to the lower file object when the fuse_file is
    created/deleted.
  [Proposed by Jann Horn]

Changes in v5:
* Fix the check when setting the passthrough file.
  [Found when testing by Mike Shal]

Changes in v3 and v4:
* Use the fs_stack_depth to prevent further stacking and a minor fix.
  [Proposed by Jann Horn]

Changes in v2:
* Changed the feature name to passthrough from stacked_io.
  [Proposed by Linus Torvalds]


Alessio Balsini (8):
  fs: Generic function to convert iocb to rw flags
  fuse: 32-bit user space ioctl compat for fuse device
  fuse: Definitions and ioctl for passthrough
  fuse: Passthrough initialization and release
  fuse: Introduce synchronous read and write for passthrough
  fuse: Handle asynchronous read and write in passthrough
  fuse: Use daemon creds in passthrough mode
  fuse: Introduce passthrough for mmap

 fs/fuse/Makefile          |   1 +
 fs/fuse/dev.c             |  41 ++++--
 fs/fuse/dir.c             |   2 +
 fs/fuse/file.c            |  15 +-
 fs/fuse/fuse_i.h          |  33 +++++
 fs/fuse/inode.c           |  22 ++-
 fs/fuse/passthrough.c     | 280 ++++++++++++++++++++++++++++++++++++++
 fs/overlayfs/file.c       |  23 +---
 include/linux/fs.h        |   5 +
 include/uapi/linux/fuse.h |  14 +-
 10 files changed, 401 insertions(+), 35 deletions(-)
 create mode 100644 fs/fuse/passthrough.c

-- 
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH RESEND V12 1/8] fs: Generic function to convert iocb to rw flags
  2021-01-25 15:30 [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write Alessio Balsini
@ 2021-01-25 15:30 ` Alessio Balsini
  2021-01-25 16:46   ` Alessio Balsini
  2021-01-25 15:30 ` [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device Alessio Balsini
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 66+ messages in thread
From: Alessio Balsini @ 2021-01-25 15:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

OverlayFS implements its own function to translate iocb flags into rw
flags, so that they can be passed into another vfs call.
With commit ce71bfea207b4 ("fs: align IOCB_* flags with RWF_* flags")
Jens created a 1:1 matching between the iocb flags and rw flags,
simplifying the conversion.

Reduce the OverlayFS code by making the flag conversion function generic
and reusable.

Signed-off-by: Alessio Balsini <balsini@android.com>
---
 fs/overlayfs/file.c | 23 +++++------------------
 include/linux/fs.h  |  5 +++++
 2 files changed, 10 insertions(+), 18 deletions(-)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index bd9dd38347ae..56be2ffc5a14 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -15,6 +15,8 @@
 #include <linux/fs.h>
 #include "overlayfs.h"
 
+#define OVL_IOCB_MASK (IOCB_DSYNC | IOCB_HIPRI | IOCB_NOWAIT | IOCB_SYNC)
+
 struct ovl_aio_req {
 	struct kiocb iocb;
 	struct kiocb *orig_iocb;
@@ -236,22 +238,6 @@ static void ovl_file_accessed(struct file *file)
 	touch_atime(&file->f_path);
 }
 
-static rwf_t ovl_iocb_to_rwf(int ifl)
-{
-	rwf_t flags = 0;
-
-	if (ifl & IOCB_NOWAIT)
-		flags |= RWF_NOWAIT;
-	if (ifl & IOCB_HIPRI)
-		flags |= RWF_HIPRI;
-	if (ifl & IOCB_DSYNC)
-		flags |= RWF_DSYNC;
-	if (ifl & IOCB_SYNC)
-		flags |= RWF_SYNC;
-
-	return flags;
-}
-
 static void ovl_aio_cleanup_handler(struct ovl_aio_req *aio_req)
 {
 	struct kiocb *iocb = &aio_req->iocb;
@@ -299,7 +285,8 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	if (is_sync_kiocb(iocb)) {
 		ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
-				    ovl_iocb_to_rwf(iocb->ki_flags));
+				    iocb_to_rw_flags(iocb->ki_flags,
+						     OVL_IOCB_MASK));
 	} else {
 		struct ovl_aio_req *aio_req;
 
@@ -356,7 +343,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	if (is_sync_kiocb(iocb)) {
 		file_start_write(real.file);
 		ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
-				     ovl_iocb_to_rwf(ifl));
+				     iocb_to_rw_flags(ifl, OVL_IOCB_MASK));
 		file_end_write(real.file);
 		/* Update size */
 		ovl_copyattr(ovl_inode_real(inode), inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..647c35423545 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3275,6 +3275,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 	return 0;
 }
 
+static inline rwf_t iocb_to_rw_flags(int ifl, int iocb_mask)
+{
+	return ifl & iocb_mask;
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
 	ino_t res;
-- 
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device
  2021-01-25 15:30 [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write Alessio Balsini
  2021-01-25 15:30 ` [PATCH RESEND V12 1/8] fs: Generic function to convert iocb to rw flags Alessio Balsini
@ 2021-01-25 15:30 ` Alessio Balsini
       [not found]   ` <CAMAHBGzkfEd9-1u0iKXp65ReJQgUi_=4sMpmfkwEOaMp6Ux7pg@mail.gmail.com>
                     ` (2 more replies)
  2021-01-25 15:30 ` [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough Alessio Balsini
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 66+ messages in thread
From: Alessio Balsini @ 2021-01-25 15:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

With a 64-bit kernel build the FUSE device cannot handle ioctl requests
coming from 32-bit user space.
This is due to the ioctl command translation that generates different
command identifiers that thus cannot be used for direct comparisons
without proper manipulation.

Explicitly extract type and number from the ioctl command to enable
32-bit user space compatibility on 64-bit kernel builds.

Signed-off-by: Alessio Balsini <balsini@android.com>
---
 fs/fuse/dev.c             | 29 ++++++++++++++++++-----------
 include/uapi/linux/fuse.h |  3 ++-
 2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 588f8d1240aa..ff9f3b83f879 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2233,37 +2233,44 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
 static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
 			   unsigned long arg)
 {
-	int err = -ENOTTY;
+	int res;
+	int oldfd;
+	struct fuse_dev *fud = NULL;
 
-	if (cmd == FUSE_DEV_IOC_CLONE) {
-		int oldfd;
+	if (_IOC_TYPE(cmd) != FUSE_DEV_IOC_MAGIC)
+		return -EINVAL;
 
-		err = -EFAULT;
-		if (!get_user(oldfd, (__u32 __user *) arg)) {
+	switch (_IOC_NR(cmd)) {
+	case _IOC_NR(FUSE_DEV_IOC_CLONE):
+		res = -EFAULT;
+		if (!get_user(oldfd, (__u32 __user *)arg)) {
 			struct file *old = fget(oldfd);
 
-			err = -EINVAL;
+			res = -EINVAL;
 			if (old) {
-				struct fuse_dev *fud = NULL;
-
 				/*
 				 * Check against file->f_op because CUSE
 				 * uses the same ioctl handler.
 				 */
 				if (old->f_op == file->f_op &&
-				    old->f_cred->user_ns == file->f_cred->user_ns)
+				    old->f_cred->user_ns ==
+					    file->f_cred->user_ns)
 					fud = fuse_get_dev(old);
 
 				if (fud) {
 					mutex_lock(&fuse_mutex);
-					err = fuse_device_clone(fud->fc, file);
+					res = fuse_device_clone(fud->fc, file);
 					mutex_unlock(&fuse_mutex);
 				}
 				fput(old);
 			}
 		}
+		break;
+	default:
+		res = -ENOTTY;
+		break;
 	}
-	return err;
+	return res;
 }
 
 const struct file_operations fuse_dev_operations = {
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 98ca64d1beb6..54442612c48b 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -903,7 +903,8 @@ struct fuse_notify_retrieve_in {
 };
 
 /* Device ioctls: */
-#define FUSE_DEV_IOC_CLONE	_IOR(229, 0, uint32_t)
+#define FUSE_DEV_IOC_MAGIC		229
+#define FUSE_DEV_IOC_CLONE		_IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
 
 struct fuse_lseek_in {
 	uint64_t	fh;
-- 
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2021-01-25 15:30 [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write Alessio Balsini
  2021-01-25 15:30 ` [PATCH RESEND V12 1/8] fs: Generic function to convert iocb to rw flags Alessio Balsini
  2021-01-25 15:30 ` [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device Alessio Balsini
@ 2021-01-25 15:30 ` Alessio Balsini
  2021-02-17 13:41   ` Miklos Szeredi
  2021-01-25 15:30 ` [PATCH RESEND V12 4/8] fuse: Passthrough initialization and release Alessio Balsini
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 66+ messages in thread
From: Alessio Balsini @ 2021-01-25 15:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

Expose the FUSE_PASSTHROUGH interface to user space and declare all the
basic data structures and functions as the skeleton on top of which the
FUSE passthrough functionality will be built.

As part of this, introduce the new FUSE passthrough ioctl, which allows
the FUSE daemon to specify a direct connection between a FUSE file and a
lower file system file. Such ioctl requires user space to pass the file
descriptor of one of its opened files through the fuse_passthrough_out
data structure introduced in this patch. This structure includes extra
fields for possible future extensions.
Also, add the passthrough functions for the set-up and tear-down of the
data structures and locks that will be used both when fuse_conns and
fuse_files are created/deleted.

Signed-off-by: Alessio Balsini <balsini@android.com>
---
 fs/fuse/Makefile          |  1 +
 fs/fuse/dev.c             | 12 ++++++++++++
 fs/fuse/dir.c             |  2 ++
 fs/fuse/file.c            |  4 +++-
 fs/fuse/fuse_i.h          | 27 +++++++++++++++++++++++++++
 fs/fuse/inode.c           | 17 ++++++++++++++++-
 fs/fuse/passthrough.c     | 21 +++++++++++++++++++++
 include/uapi/linux/fuse.h | 11 ++++++++++-
 8 files changed, 92 insertions(+), 3 deletions(-)
 create mode 100644 fs/fuse/passthrough.c

diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 8c7021fb2cd4..20ed23aa16fa 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_CUSE) += cuse.o
 obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
 
 fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o
+fuse-y += passthrough.o
 fuse-$(CONFIG_FUSE_DAX) += dax.o
 
 virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index ff9f3b83f879..5446f13db5a0 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2236,6 +2236,7 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
 	int res;
 	int oldfd;
 	struct fuse_dev *fud = NULL;
+	struct fuse_passthrough_out pto;
 
 	if (_IOC_TYPE(cmd) != FUSE_DEV_IOC_MAGIC)
 		return -EINVAL;
@@ -2266,6 +2267,17 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
 			}
 		}
 		break;
+	case _IOC_NR(FUSE_DEV_IOC_PASSTHROUGH_OPEN):
+		res = -EFAULT;
+		if (!copy_from_user(&pto,
+				    (struct fuse_passthrough_out __user *)arg,
+				    sizeof(pto))) {
+			res = -EINVAL;
+			fud = fuse_get_dev(file);
+			if (fud)
+				res = fuse_passthrough_open(fud, &pto);
+		}
+		break;
 	default:
 		res = -ENOTTY;
 		break;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 78f9f209078c..c9a1b33c5481 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -513,6 +513,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
 {
 	int err;
 	struct inode *inode;
+	struct fuse_conn *fc = get_fuse_conn(dir);
 	struct fuse_mount *fm = get_fuse_mount(dir);
 	FUSE_ARGS(args);
 	struct fuse_forget_link *forget;
@@ -574,6 +575,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
 	ff->fh = outopen.fh;
 	ff->nodeid = outentry.nodeid;
 	ff->open_flags = outopen.open_flags;
+	fuse_passthrough_setup(fc, ff, &outopen);
 	inode = fuse_iget(dir->i_sb, outentry.nodeid, outentry.generation,
 			  &outentry.attr, entry_attr_timeout(&outentry), 0);
 	if (!inode) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8cccecb55fb8..953f3034c375 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -158,7 +158,7 @@ int fuse_do_open(struct fuse_mount *fm, u64 nodeid, struct file *file,
 		if (!err) {
 			ff->fh = outarg.fh;
 			ff->open_flags = outarg.open_flags;
-
+			fuse_passthrough_setup(fc, ff, &outarg);
 		} else if (err != -ENOSYS) {
 			fuse_file_free(ff);
 			return err;
@@ -304,6 +304,8 @@ void fuse_release_common(struct file *file, bool isdir)
 	struct fuse_release_args *ra = ff->release_args;
 	int opcode = isdir ? FUSE_RELEASEDIR : FUSE_RELEASE;
 
+	fuse_passthrough_release(&ff->passthrough);
+
 	fuse_prepare_release(fi, ff, file->f_flags, opcode);
 
 	if (ff->flock) {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 7c4b8cb93f9f..8d39f5304a11 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -180,6 +180,14 @@ struct fuse_conn;
 struct fuse_mount;
 struct fuse_release_args;
 
+/**
+ * Reference to lower filesystem file for read/write operations handled in
+ * passthrough mode
+ */
+struct fuse_passthrough {
+	struct file *filp;
+};
+
 /** FUSE specific file data */
 struct fuse_file {
 	/** Fuse connection for this file */
@@ -225,6 +233,9 @@ struct fuse_file {
 
 	} readdir;
 
+	/** Container for data related to the passthrough functionality */
+	struct fuse_passthrough passthrough;
+
 	/** RB node to be linked on fuse_conn->polled_files */
 	struct rb_node polled_node;
 
@@ -755,6 +766,9 @@ struct fuse_conn {
 	/* Auto-mount submounts announced by the server */
 	unsigned int auto_submounts:1;
 
+	/** Passthrough mode for read/write IO */
+	unsigned int passthrough:1;
+
 	/** The number of requests waiting for completion */
 	atomic_t num_waiting;
 
@@ -798,6 +812,12 @@ struct fuse_conn {
 
 	/** List of filesystems using this connection */
 	struct list_head mounts;
+
+	/** IDR for passthrough requests */
+	struct idr passthrough_req;
+
+	/** Protects passthrough_req */
+	spinlock_t passthrough_req_lock;
 };
 
 /*
@@ -1213,4 +1233,11 @@ void fuse_dax_inode_cleanup(struct inode *inode);
 bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment);
 void fuse_dax_cancel_work(struct fuse_conn *fc);
 
+/* passthrough.c */
+int fuse_passthrough_open(struct fuse_dev *fud,
+			  struct fuse_passthrough_out *pto);
+int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_file *ff,
+			   struct fuse_open_out *openarg);
+void fuse_passthrough_release(struct fuse_passthrough *passthrough);
+
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b0e18b470e91..a1104d5abb70 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -691,6 +691,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
 	spin_lock_init(&fc->bg_lock);
+	spin_lock_init(&fc->passthrough_req_lock);
 	init_rwsem(&fc->killsb);
 	refcount_set(&fc->count, 1);
 	atomic_set(&fc->dev_count, 1);
@@ -699,6 +700,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
 	INIT_LIST_HEAD(&fc->bg_queue);
 	INIT_LIST_HEAD(&fc->entry);
 	INIT_LIST_HEAD(&fc->devices);
+	idr_init(&fc->passthrough_req);
 	atomic_set(&fc->num_waiting, 0);
 	fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
 	fc->congestion_threshold = FUSE_DEFAULT_CONGESTION_THRESHOLD;
@@ -1052,6 +1054,12 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 				fc->handle_killpriv_v2 = 1;
 				fm->sb->s_flags |= SB_NOSEC;
 			}
+			if (arg->flags & FUSE_PASSTHROUGH) {
+				fc->passthrough = 1;
+				/* Prevent further stacking */
+				fm->sb->s_stack_depth =
+					FILESYSTEM_MAX_STACK_DEPTH;
+			}
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1095,7 +1103,7 @@ void fuse_send_init(struct fuse_mount *fm)
 		FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
 		FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
 		FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
-		FUSE_HANDLE_KILLPRIV_V2;
+		FUSE_HANDLE_KILLPRIV_V2 | FUSE_PASSTHROUGH;
 #ifdef CONFIG_FUSE_DAX
 	if (fm->fc->dax)
 		ia->in.flags |= FUSE_MAP_ALIGNMENT;
@@ -1123,9 +1131,16 @@ void fuse_send_init(struct fuse_mount *fm)
 }
 EXPORT_SYMBOL_GPL(fuse_send_init);
 
+static int free_fuse_passthrough(int id, void *p, void *data)
+{
+	return 0;
+}
+
 void fuse_free_conn(struct fuse_conn *fc)
 {
 	WARN_ON(!list_empty(&fc->devices));
+	idr_for_each(&fc->passthrough_req, free_fuse_passthrough, NULL);
+	idr_destroy(&fc->passthrough_req);
 	kfree_rcu(fc, rcu);
 }
 EXPORT_SYMBOL_GPL(fuse_free_conn);
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
new file mode 100644
index 000000000000..594060c654f8
--- /dev/null
+++ b/fs/fuse/passthrough.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "fuse_i.h"
+
+#include <linux/fuse.h>
+
+int fuse_passthrough_open(struct fuse_dev *fud,
+			  struct fuse_passthrough_out *pto)
+{
+	return -EINVAL;
+}
+
+int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_file *ff,
+			   struct fuse_open_out *openarg)
+{
+	return -EINVAL;
+}
+
+void fuse_passthrough_release(struct fuse_passthrough *passthrough)
+{
+}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 54442612c48b..9d7685ce0acd 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -360,6 +360,7 @@ struct fuse_file_lock {
 #define FUSE_MAP_ALIGNMENT	(1 << 26)
 #define FUSE_SUBMOUNTS		(1 << 27)
 #define FUSE_HANDLE_KILLPRIV_V2	(1 << 28)
+#define FUSE_PASSTHROUGH	(1 << 29)
 
 /**
  * CUSE INIT request/reply flags
@@ -625,7 +626,7 @@ struct fuse_create_in {
 struct fuse_open_out {
 	uint64_t	fh;
 	uint32_t	open_flags;
-	uint32_t	padding;
+	uint32_t	passthrough_fh;
 };
 
 struct fuse_release_in {
@@ -828,6 +829,13 @@ struct fuse_in_header {
 	uint32_t	padding;
 };
 
+struct fuse_passthrough_out {
+	uint32_t	fd;
+	/* For future implementation */
+	uint32_t	len;
+	void		*vec;
+};
+
 struct fuse_out_header {
 	uint32_t	len;
 	int32_t		error;
@@ -905,6 +913,7 @@ struct fuse_notify_retrieve_in {
 /* Device ioctls: */
 #define FUSE_DEV_IOC_MAGIC		229
 #define FUSE_DEV_IOC_CLONE		_IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
+#define FUSE_DEV_IOC_PASSTHROUGH_OPEN	_IOW(FUSE_DEV_IOC_MAGIC, 1, struct fuse_passthrough_out)
 
 struct fuse_lseek_in {
 	uint64_t	fh;
-- 
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH RESEND V12 4/8] fuse: Passthrough initialization and release
  2021-01-25 15:30 [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write Alessio Balsini
                   ` (2 preceding siblings ...)
  2021-01-25 15:30 ` [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough Alessio Balsini
@ 2021-01-25 15:30 ` Alessio Balsini
  2021-02-17 13:52   ` Miklos Szeredi
  2021-01-25 15:30 ` [PATCH RESEND V12 5/8] fuse: Introduce synchronous read and write for passthrough Alessio Balsini
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 66+ messages in thread
From: Alessio Balsini @ 2021-01-25 15:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

Implement the FUSE passthrough ioctl that associates the lower
(passthrough) file system file with the fuse_file.

The file descriptor passed to the ioctl by the FUSE daemon is used to
access the relative file pointer, that will be copied to the fuse_file
data structure to consolidate the link between the FUSE and lower file
system.

To enable the passthrough mode, user space triggers the
FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl and, if the call succeeds, receives
back an identifier that will be used at open/create response time in the
fuse_open_out field to associate the FUSE file to the lower file system
file.
The value returned by the ioctl to user space can be:
- > 0: success, the identifier can be used as part of an open/create
reply.
- <= 0: an error occurred.
The value 0 represents an error to preserve backward compatibility: the
fuse_open_out field that is used to pass the passthrough_fh back to the
kernel uses the same bits that were previously as struct padding, and is
commonly zero-initialized (e.g., in the libfuse implementation).
Removing 0 from the correct values fixes the ambiguity between the case
in which 0 corresponds to a real passthrough_fh, a missing
implementation of FUSE passthrough or a request for a normal FUSE file,
simplifying the user space implementation.

For the passthrough mode to be successfully activated, the lower file
system file must implement both read_iter and write_iter file
operations. This extra check avoids special pseudo files to be targeted
for this feature.
Passthrough comes with another limitation: no further file system
stacking is allowed for those FUSE file systems using passthrough.

Signed-off-by: Alessio Balsini <balsini@android.com>
---
 fs/fuse/inode.c       |  5 +++
 fs/fuse/passthrough.c | 87 ++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index a1104d5abb70..7ebc398fbacb 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1133,6 +1133,11 @@ EXPORT_SYMBOL_GPL(fuse_send_init);
 
 static int free_fuse_passthrough(int id, void *p, void *data)
 {
+	struct fuse_passthrough *passthrough = (struct fuse_passthrough *)p;
+
+	fuse_passthrough_release(passthrough);
+	kfree(p);
+
 	return 0;
 }
 
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index 594060c654f8..cf993e83803e 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -3,19 +3,102 @@
 #include "fuse_i.h"
 
 #include <linux/fuse.h>
+#include <linux/idr.h>
 
 int fuse_passthrough_open(struct fuse_dev *fud,
 			  struct fuse_passthrough_out *pto)
 {
-	return -EINVAL;
+	int res;
+	struct file *passthrough_filp;
+	struct fuse_conn *fc = fud->fc;
+	struct inode *passthrough_inode;
+	struct super_block *passthrough_sb;
+	struct fuse_passthrough *passthrough;
+
+	if (!fc->passthrough)
+		return -EPERM;
+
+	/* This field is reserved for future implementation */
+	if (pto->len != 0)
+		return -EINVAL;
+
+	passthrough_filp = fget(pto->fd);
+	if (!passthrough_filp) {
+		pr_err("FUSE: invalid file descriptor for passthrough.\n");
+		return -EBADF;
+	}
+
+	if (!passthrough_filp->f_op->read_iter ||
+	    !passthrough_filp->f_op->write_iter) {
+		pr_err("FUSE: passthrough file misses file operations.\n");
+		res = -EBADF;
+		goto err_free_file;
+	}
+
+	passthrough_inode = file_inode(passthrough_filp);
+	passthrough_sb = passthrough_inode->i_sb;
+	if (passthrough_sb->s_stack_depth >= FILESYSTEM_MAX_STACK_DEPTH) {
+		pr_err("FUSE: fs stacking depth exceeded for passthrough\n");
+		res = -EINVAL;
+		goto err_free_file;
+	}
+
+	passthrough = kmalloc(sizeof(struct fuse_passthrough), GFP_KERNEL);
+	if (!passthrough) {
+		res = -ENOMEM;
+		goto err_free_file;
+	}
+
+	passthrough->filp = passthrough_filp;
+
+	idr_preload(GFP_KERNEL);
+	spin_lock(&fc->passthrough_req_lock);
+	res = idr_alloc(&fc->passthrough_req, passthrough, 1, 0, GFP_ATOMIC);
+	spin_unlock(&fc->passthrough_req_lock);
+	idr_preload_end();
+
+	if (res > 0)
+		return res;
+
+	fuse_passthrough_release(passthrough);
+	kfree(passthrough);
+
+err_free_file:
+	fput(passthrough_filp);
+
+	return res;
 }
 
 int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_file *ff,
 			   struct fuse_open_out *openarg)
 {
-	return -EINVAL;
+	struct fuse_passthrough *passthrough;
+	int passthrough_fh = openarg->passthrough_fh;
+
+	if (!fc->passthrough)
+		return -EPERM;
+
+	/* Default case, passthrough is not requested */
+	if (passthrough_fh <= 0)
+		return -EINVAL;
+
+	spin_lock(&fc->passthrough_req_lock);
+	passthrough = idr_remove(&fc->passthrough_req, passthrough_fh);
+	spin_unlock(&fc->passthrough_req_lock);
+
+	if (!passthrough)
+		return -EINVAL;
+
+	ff->passthrough = *passthrough;
+	kfree(passthrough);
+
+	return 0;
 }
 
 void fuse_passthrough_release(struct fuse_passthrough *passthrough)
 {
+	if (passthrough->filp) {
+		fput(passthrough->filp);
+		passthrough->filp = NULL;
+	}
 }
-- 
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH RESEND V12 5/8] fuse: Introduce synchronous read and write for passthrough
  2021-01-25 15:30 [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write Alessio Balsini
                   ` (3 preceding siblings ...)
  2021-01-25 15:30 ` [PATCH RESEND V12 4/8] fuse: Passthrough initialization and release Alessio Balsini
@ 2021-01-25 15:30 ` Alessio Balsini
  2021-02-17 14:00   ` Miklos Szeredi
  2021-01-25 15:30 ` [PATCH RESEND V12 6/8] fuse: Handle asynchronous read and write in passthrough Alessio Balsini
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 66+ messages in thread
From: Alessio Balsini @ 2021-01-25 15:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

All the read and write operations performed on fuse_files which have the
passthrough feature enabled are forwarded to the associated lower file
system file via VFS.

Sending the request directly to the lower file system avoids the
userspace round-trip that, because of possible context switches and
additional operations might reduce the overall performance, especially
in those cases where caching doesn't help, for example in reads at
random offsets.

Verifying if a fuse_file has a lower file system file associated with
can be done by checking the validity of its passthrough_filp pointer.
This pointer is not NULL only if passthrough has been successfully
enabled via the appropriate ioctl().
When a read/write operation is requested for a FUSE file with
passthrough enabled, a new equivalent VFS request is generated, which
instead targets the lower file system file.
The VFS layer performs additional checks that allow for safer operations
but may cause the operation to fail if the process accessing the FUSE
file system does not have access to the lower file system.

This change only implements synchronous requests in passthrough,
returning an error in the case of asynchronous operations, yet covering
the majority of the use cases.

Signed-off-by: Alessio Balsini <balsini@android.com>
---
 fs/fuse/file.c        |  8 ++++--
 fs/fuse/fuse_i.h      |  2 ++
 fs/fuse/passthrough.c | 57 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 953f3034c375..cddada1e8bd9 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1581,7 +1581,9 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (FUSE_IS_DAX(inode))
 		return fuse_dax_read_iter(iocb, to);
 
-	if (!(ff->open_flags & FOPEN_DIRECT_IO))
+	if (ff->passthrough.filp)
+		return fuse_passthrough_read_iter(iocb, to);
+	else if (!(ff->open_flags & FOPEN_DIRECT_IO))
 		return fuse_cache_read_iter(iocb, to);
 	else
 		return fuse_direct_read_iter(iocb, to);
@@ -1599,7 +1601,9 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (FUSE_IS_DAX(inode))
 		return fuse_dax_write_iter(iocb, from);
 
-	if (!(ff->open_flags & FOPEN_DIRECT_IO))
+	if (ff->passthrough.filp)
+		return fuse_passthrough_write_iter(iocb, from);
+	else if (!(ff->open_flags & FOPEN_DIRECT_IO))
 		return fuse_cache_write_iter(iocb, from);
 	else
 		return fuse_direct_write_iter(iocb, from);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 8d39f5304a11..c4730d893324 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1239,5 +1239,7 @@ int fuse_passthrough_open(struct fuse_dev *fud,
 int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_file *ff,
 			   struct fuse_open_out *openarg);
 void fuse_passthrough_release(struct fuse_passthrough *passthrough);
+ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *to);
+ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *from);
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index cf993e83803e..d949ca07a83b 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -4,6 +4,63 @@
 
 #include <linux/fuse.h>
 #include <linux/idr.h>
+#include <linux/uio.h>
+
+#define PASSTHROUGH_IOCB_MASK                                                  \
+	(IOCB_APPEND | IOCB_DSYNC | IOCB_HIPRI | IOCB_NOWAIT | IOCB_SYNC)
+
+static void fuse_copyattr(struct file *dst_file, struct file *src_file)
+{
+	struct inode *dst = file_inode(dst_file);
+	struct inode *src = file_inode(src_file);
+
+	i_size_write(dst, i_size_read(src));
+}
+
+ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
+				   struct iov_iter *iter)
+{
+	ssize_t ret;
+	struct file *fuse_filp = iocb_fuse->ki_filp;
+	struct fuse_file *ff = fuse_filp->private_data;
+	struct file *passthrough_filp = ff->passthrough.filp;
+
+	if (!iov_iter_count(iter))
+		return 0;
+
+	ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
+			    iocb_to_rw_flags(iocb_fuse->ki_flags,
+					     PASSTHROUGH_IOCB_MASK));
+
+	return ret;
+}
+
+ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
+				    struct iov_iter *iter)
+{
+	ssize_t ret;
+	struct file *fuse_filp = iocb_fuse->ki_filp;
+	struct fuse_file *ff = fuse_filp->private_data;
+	struct inode *fuse_inode = file_inode(fuse_filp);
+	struct file *passthrough_filp = ff->passthrough.filp;
+
+	if (!iov_iter_count(iter))
+		return 0;
+
+	inode_lock(fuse_inode);
+
+	file_start_write(passthrough_filp);
+	ret = vfs_iter_write(passthrough_filp, iter, &iocb_fuse->ki_pos,
+			     iocb_to_rw_flags(iocb_fuse->ki_flags,
+					      PASSTHROUGH_IOCB_MASK));
+	file_end_write(passthrough_filp);
+	if (ret > 0)
+		fuse_copyattr(fuse_filp, passthrough_filp);
+
+	inode_unlock(fuse_inode);
+
+	return ret;
+}
 
 int fuse_passthrough_open(struct fuse_dev *fud,
 			  struct fuse_passthrough_out *pto)
-- 
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH RESEND V12 6/8] fuse: Handle asynchronous read and write in passthrough
  2021-01-25 15:30 [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write Alessio Balsini
                   ` (4 preceding siblings ...)
  2021-01-25 15:30 ` [PATCH RESEND V12 5/8] fuse: Introduce synchronous read and write for passthrough Alessio Balsini
@ 2021-01-25 15:30 ` Alessio Balsini
  2021-01-25 15:30 ` [PATCH RESEND V12 7/8] fuse: Use daemon creds in passthrough mode Alessio Balsini
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 66+ messages in thread
From: Alessio Balsini @ 2021-01-25 15:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

Extend the passthrough feature by handling asynchronous IO both for read
and write operations.

When an AIO request is received, if the request targets a FUSE file with
the passthrough functionality enabled, a new identical AIO request is
created. The new request targets the lower file system file and gets
assigned a special FUSE passthrough AIO completion callback.
When the lower file system AIO request is completed, the FUSE
passthrough AIO completion callback is executed and propagates the
completion signal to the FUSE AIO request by triggering its completion
callback as well.

Signed-off-by: Alessio Balsini <balsini@android.com>
---
 fs/fuse/passthrough.c | 89 +++++++++++++++++++++++++++++++++++++------
 1 file changed, 78 insertions(+), 11 deletions(-)

diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index d949ca07a83b..c7fa1eeb7639 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -9,6 +9,11 @@
 #define PASSTHROUGH_IOCB_MASK                                                  \
 	(IOCB_APPEND | IOCB_DSYNC | IOCB_HIPRI | IOCB_NOWAIT | IOCB_SYNC)
 
+struct fuse_aio_req {
+	struct kiocb iocb;
+	struct kiocb *iocb_fuse;
+};
+
 static void fuse_copyattr(struct file *dst_file, struct file *src_file)
 {
 	struct inode *dst = file_inode(dst_file);
@@ -17,6 +22,32 @@ static void fuse_copyattr(struct file *dst_file, struct file *src_file)
 	i_size_write(dst, i_size_read(src));
 }
 
+static void fuse_aio_cleanup_handler(struct fuse_aio_req *aio_req)
+{
+	struct kiocb *iocb = &aio_req->iocb;
+	struct kiocb *iocb_fuse = aio_req->iocb_fuse;
+
+	if (iocb->ki_flags & IOCB_WRITE) {
+		__sb_writers_acquired(file_inode(iocb->ki_filp)->i_sb,
+				      SB_FREEZE_WRITE);
+		file_end_write(iocb->ki_filp);
+		fuse_copyattr(iocb_fuse->ki_filp, iocb->ki_filp);
+	}
+
+	iocb_fuse->ki_pos = iocb->ki_pos;
+	kfree(aio_req);
+}
+
+static void fuse_aio_rw_complete(struct kiocb *iocb, long res, long res2)
+{
+	struct fuse_aio_req *aio_req =
+		container_of(iocb, struct fuse_aio_req, iocb);
+	struct kiocb *iocb_fuse = aio_req->iocb_fuse;
+
+	fuse_aio_cleanup_handler(aio_req);
+	iocb_fuse->ki_complete(iocb_fuse, res, res2);
+}
+
 ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
 				   struct iov_iter *iter)
 {
@@ -28,9 +59,24 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
 	if (!iov_iter_count(iter))
 		return 0;
 
-	ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
-			    iocb_to_rw_flags(iocb_fuse->ki_flags,
-					     PASSTHROUGH_IOCB_MASK));
+	if (is_sync_kiocb(iocb_fuse)) {
+		ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
+				    iocb_to_rw_flags(iocb_fuse->ki_flags,
+						     PASSTHROUGH_IOCB_MASK));
+	} else {
+		struct fuse_aio_req *aio_req;
+
+		aio_req = kmalloc(sizeof(struct fuse_aio_req), GFP_KERNEL);
+		if (!aio_req)
+			return -ENOMEM;
+
+		aio_req->iocb_fuse = iocb_fuse;
+		kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
+		aio_req->iocb.ki_complete = fuse_aio_rw_complete;
+		ret = call_read_iter(passthrough_filp, &aio_req->iocb, iter);
+		if (ret != -EIOCBQUEUED)
+			fuse_aio_cleanup_handler(aio_req);
+	}
 
 	return ret;
 }
@@ -43,20 +89,41 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
 	struct fuse_file *ff = fuse_filp->private_data;
 	struct inode *fuse_inode = file_inode(fuse_filp);
 	struct file *passthrough_filp = ff->passthrough.filp;
+	struct inode *passthrough_inode = file_inode(passthrough_filp);
 
 	if (!iov_iter_count(iter))
 		return 0;
 
 	inode_lock(fuse_inode);
 
-	file_start_write(passthrough_filp);
-	ret = vfs_iter_write(passthrough_filp, iter, &iocb_fuse->ki_pos,
-			     iocb_to_rw_flags(iocb_fuse->ki_flags,
-					      PASSTHROUGH_IOCB_MASK));
-	file_end_write(passthrough_filp);
-	if (ret > 0)
-		fuse_copyattr(fuse_filp, passthrough_filp);
-
+	if (is_sync_kiocb(iocb_fuse)) {
+		file_start_write(passthrough_filp);
+		ret = vfs_iter_write(passthrough_filp, iter, &iocb_fuse->ki_pos,
+				     iocb_to_rw_flags(iocb_fuse->ki_flags,
+						      PASSTHROUGH_IOCB_MASK));
+		file_end_write(passthrough_filp);
+		if (ret > 0)
+			fuse_copyattr(fuse_filp, passthrough_filp);
+	} else {
+		struct fuse_aio_req *aio_req;
+
+		aio_req = kmalloc(sizeof(struct fuse_aio_req), GFP_KERNEL);
+		if (!aio_req) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		file_start_write(passthrough_filp);
+		__sb_writers_release(passthrough_inode->i_sb, SB_FREEZE_WRITE);
+
+		aio_req->iocb_fuse = iocb_fuse;
+		kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
+		aio_req->iocb.ki_complete = fuse_aio_rw_complete;
+		ret = call_write_iter(passthrough_filp, &aio_req->iocb, iter);
+		if (ret != -EIOCBQUEUED)
+			fuse_aio_cleanup_handler(aio_req);
+	}
+out:
 	inode_unlock(fuse_inode);
 
 	return ret;
-- 
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH RESEND V12 7/8] fuse: Use daemon creds in passthrough mode
  2021-01-25 15:30 [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write Alessio Balsini
                   ` (5 preceding siblings ...)
  2021-01-25 15:30 ` [PATCH RESEND V12 6/8] fuse: Handle asynchronous read and write in passthrough Alessio Balsini
@ 2021-01-25 15:30 ` Alessio Balsini
  2021-02-05  9:23   ` Peng Tao
  2021-01-25 15:30 ` [PATCH RESEND V12 8/8] fuse: Introduce passthrough for mmap Alessio Balsini
  2021-11-18 18:31 ` [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write Amir Goldstein
  8 siblings, 1 reply; 66+ messages in thread
From: Alessio Balsini @ 2021-01-25 15:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

When using FUSE passthrough, read/write operations are directly
forwarded to the lower file system file through VFS, but there is no
guarantee that the process that is triggering the request has the right
permissions to access the lower file system. This would cause the
read/write access to fail.

In passthrough file systems, where the FUSE daemon is responsible for
the enforcement of the lower file system access policies, often happens
that the process dealing with the FUSE file system doesn't have access
to the lower file system.
Being the FUSE daemon in charge of implementing the FUSE file
operations, that in the case of read/write operations usually simply
results in the copy of memory buffers from/to the lower file system
respectively, these operations are executed with the FUSE daemon
privileges.

This patch adds a reference to the FUSE daemon credentials, referenced
at FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl() time so that they can be used
to temporarily raise the user credentials when accessing lower file
system files in passthrough.
The process accessing the FUSE file with passthrough enabled temporarily
receives the privileges of the FUSE daemon while performing read/write
operations. Similar behavior is implemented in overlayfs.
These privileges will be reverted as soon as the IO operation completes.
This feature does not provide any higher security privileges to those
processes accessing the FUSE file system with passthrough enabled. This
is because it is still the FUSE daemon responsible for enabling or not
the passthrough feature at file open time, and should enable the feature
only after appropriate access policy checks.

Signed-off-by: Alessio Balsini <balsini@android.com>
---
 fs/fuse/fuse_i.h      |  5 ++++-
 fs/fuse/passthrough.c | 11 +++++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4730d893324..815af1845b16 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -182,10 +182,13 @@ struct fuse_release_args;
 
 /**
  * Reference to lower filesystem file for read/write operations handled in
- * passthrough mode
+ * passthrough mode.
+ * This struct also tracks the credentials to be used for handling read/write
+ * operations.
  */
 struct fuse_passthrough {
 	struct file *filp;
+	struct cred *cred;
 };
 
 /** FUSE specific file data */
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index c7fa1eeb7639..24866c5fe7e2 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -52,6 +52,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
 				   struct iov_iter *iter)
 {
 	ssize_t ret;
+	const struct cred *old_cred;
 	struct file *fuse_filp = iocb_fuse->ki_filp;
 	struct fuse_file *ff = fuse_filp->private_data;
 	struct file *passthrough_filp = ff->passthrough.filp;
@@ -59,6 +60,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
 	if (!iov_iter_count(iter))
 		return 0;
 
+	old_cred = override_creds(ff->passthrough.cred);
 	if (is_sync_kiocb(iocb_fuse)) {
 		ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
 				    iocb_to_rw_flags(iocb_fuse->ki_flags,
@@ -77,6 +79,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
 		if (ret != -EIOCBQUEUED)
 			fuse_aio_cleanup_handler(aio_req);
 	}
+	revert_creds(old_cred);
 
 	return ret;
 }
@@ -85,6 +88,7 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
 				    struct iov_iter *iter)
 {
 	ssize_t ret;
+	const struct cred *old_cred;
 	struct file *fuse_filp = iocb_fuse->ki_filp;
 	struct fuse_file *ff = fuse_filp->private_data;
 	struct inode *fuse_inode = file_inode(fuse_filp);
@@ -96,6 +100,7 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
 
 	inode_lock(fuse_inode);
 
+	old_cred = override_creds(ff->passthrough.cred);
 	if (is_sync_kiocb(iocb_fuse)) {
 		file_start_write(passthrough_filp);
 		ret = vfs_iter_write(passthrough_filp, iter, &iocb_fuse->ki_pos,
@@ -124,6 +129,7 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
 			fuse_aio_cleanup_handler(aio_req);
 	}
 out:
+	revert_creds(old_cred);
 	inode_unlock(fuse_inode);
 
 	return ret;
@@ -174,6 +180,7 @@ int fuse_passthrough_open(struct fuse_dev *fud,
 	}
 
 	passthrough->filp = passthrough_filp;
+	passthrough->cred = prepare_creds();
 
 	idr_preload(GFP_KERNEL);
 	spin_lock(&fc->passthrough_req_lock);
@@ -225,4 +232,8 @@ void fuse_passthrough_release(struct fuse_passthrough *passthrough)
 		fput(passthrough->filp);
 		passthrough->filp = NULL;
 	}
+	if (passthrough->cred) {
+		put_cred(passthrough->cred);
+		passthrough->cred = NULL;
+	}
 }
-- 
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH RESEND V12 8/8] fuse: Introduce passthrough for mmap
  2021-01-25 15:30 [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write Alessio Balsini
                   ` (6 preceding siblings ...)
  2021-01-25 15:30 ` [PATCH RESEND V12 7/8] fuse: Use daemon creds in passthrough mode Alessio Balsini
@ 2021-01-25 15:30 ` Alessio Balsini
  2021-02-17 14:05   ` Miklos Szeredi
  2021-11-18 18:31 ` [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write Amir Goldstein
  8 siblings, 1 reply; 66+ messages in thread
From: Alessio Balsini @ 2021-01-25 15:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

Enabling FUSE passthrough for mmap-ed operations not only affects
performance, but has also been shown as mandatory for the correct
functioning of FUSE passthrough.
yanwu noticed [1] that a FUSE file with passthrough enabled may suffer
data inconsistencies if the same file is also accessed with mmap. What
happens is that read/write operations are directly applied to the lower
file system (and its cache), while mmap-ed operations are affecting the
FUSE cache.

Extend the FUSE passthrough implementation to also handle memory-mapped
FUSE file, to both fix the cache inconsistencies and extend the
passthrough performance benefits to mmap-ed operations.

[1] https://lore.kernel.org/lkml/20210119110654.11817-1-wu-yan@tcl.com/

Signed-off-by: Alessio Balsini <balsini@android.com>
---
 fs/fuse/file.c        |  3 +++
 fs/fuse/fuse_i.h      |  1 +
 fs/fuse/passthrough.c | 41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 45 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index cddada1e8bd9..e3741a94c1f9 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2370,6 +2370,9 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if (FUSE_IS_DAX(file_inode(file)))
 		return fuse_dax_mmap(file, vma);
 
+	if (ff->passthrough.filp)
+		return fuse_passthrough_mmap(file, vma);
+
 	if (ff->open_flags & FOPEN_DIRECT_IO) {
 		/* Can't provide the coherency needed for MAP_SHARED */
 		if (vma->vm_flags & VM_MAYSHARE)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 815af1845b16..7b0d65984608 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1244,5 +1244,6 @@ int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_file *ff,
 void fuse_passthrough_release(struct fuse_passthrough *passthrough);
 ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *to);
 ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *from);
+ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma);
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index 24866c5fe7e2..284979f87747 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -135,6 +135,47 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
 	return ret;
 }
 
+ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	int ret;
+	const struct cred *old_cred;
+	struct fuse_file *ff = file->private_data;
+	struct inode *fuse_inode = file_inode(file);
+	struct file *passthrough_filp = ff->passthrough.filp;
+	struct inode *passthrough_inode = file_inode(passthrough_filp);
+
+	if (!passthrough_filp->f_op->mmap)
+		return -ENODEV;
+
+	if (WARN_ON(file != vma->vm_file))
+		return -EIO;
+
+	vma->vm_file = get_file(passthrough_filp);
+
+	old_cred = override_creds(ff->passthrough.cred);
+	ret = call_mmap(vma->vm_file, vma);
+	revert_creds(old_cred);
+
+	if (ret)
+		fput(passthrough_filp);
+	else
+		fput(file);
+
+	if (file->f_flags & O_NOATIME)
+		return ret;
+
+	if ((!timespec64_equal(&fuse_inode->i_mtime,
+			       &passthrough_inode->i_mtime) ||
+	     !timespec64_equal(&fuse_inode->i_ctime,
+			       &passthrough_inode->i_ctime))) {
+		fuse_inode->i_mtime = passthrough_inode->i_mtime;
+		fuse_inode->i_ctime = passthrough_inode->i_ctime;
+	}
+	touch_atime(&file->f_path);
+
+	return ret;
+}
+
 int fuse_passthrough_open(struct fuse_dev *fud,
 			  struct fuse_passthrough_out *pto)
 {
-- 
2.30.0.280.ga3ce27912f-goog


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 1/8] fs: Generic function to convert iocb to rw flags
  2021-01-25 15:30 ` [PATCH RESEND V12 1/8] fs: Generic function to convert iocb to rw flags Alessio Balsini
@ 2021-01-25 16:46   ` Alessio Balsini
  2021-03-24  7:43     ` Rokudo Yan
  0 siblings, 1 reply; 66+ messages in thread
From: Alessio Balsini @ 2021-01-25 16:46 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

On Mon, Jan 25, 2021 at 03:30:50PM +0000, Alessio Balsini wrote:
> OverlayFS implements its own function to translate iocb flags into rw
> flags, so that they can be passed into another vfs call.
> With commit ce71bfea207b4 ("fs: align IOCB_* flags with RWF_* flags")
> Jens created a 1:1 matching between the iocb flags and rw flags,
> simplifying the conversion.
> 
> Reduce the OverlayFS code by making the flag conversion function generic
> and reusable.
> 
> Signed-off-by: Alessio Balsini <balsini@android.com>
> ---
>  fs/overlayfs/file.c | 23 +++++------------------
>  include/linux/fs.h  |  5 +++++
>  2 files changed, 10 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index bd9dd38347ae..56be2ffc5a14 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -15,6 +15,8 @@
>  #include <linux/fs.h>
>  #include "overlayfs.h"
>  
> +#define OVL_IOCB_MASK (IOCB_DSYNC | IOCB_HIPRI | IOCB_NOWAIT | IOCB_SYNC)
> +
>  struct ovl_aio_req {
>  	struct kiocb iocb;
>  	struct kiocb *orig_iocb;
> @@ -236,22 +238,6 @@ static void ovl_file_accessed(struct file *file)
>  	touch_atime(&file->f_path);
>  }
>  
> -static rwf_t ovl_iocb_to_rwf(int ifl)
> -{
> -	rwf_t flags = 0;
> -
> -	if (ifl & IOCB_NOWAIT)
> -		flags |= RWF_NOWAIT;
> -	if (ifl & IOCB_HIPRI)
> -		flags |= RWF_HIPRI;
> -	if (ifl & IOCB_DSYNC)
> -		flags |= RWF_DSYNC;
> -	if (ifl & IOCB_SYNC)
> -		flags |= RWF_SYNC;
> -
> -	return flags;
> -}
> -
>  static void ovl_aio_cleanup_handler(struct ovl_aio_req *aio_req)
>  {
>  	struct kiocb *iocb = &aio_req->iocb;
> @@ -299,7 +285,8 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  	old_cred = ovl_override_creds(file_inode(file)->i_sb);
>  	if (is_sync_kiocb(iocb)) {
>  		ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
> -				    ovl_iocb_to_rwf(iocb->ki_flags));
> +				    iocb_to_rw_flags(iocb->ki_flags,
> +						     OVL_IOCB_MASK));
>  	} else {
>  		struct ovl_aio_req *aio_req;
>  
> @@ -356,7 +343,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>  	if (is_sync_kiocb(iocb)) {
>  		file_start_write(real.file);
>  		ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
> -				     ovl_iocb_to_rwf(ifl));
> +				     iocb_to_rw_flags(ifl, OVL_IOCB_MASK));
>  		file_end_write(real.file);
>  		/* Update size */
>  		ovl_copyattr(ovl_inode_real(inode), inode);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fd47deea7c17..647c35423545 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3275,6 +3275,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
>  	return 0;
>  }
>  
> +static inline rwf_t iocb_to_rw_flags(int ifl, int iocb_mask)
> +{
> +	return ifl & iocb_mask;
> +}
> +
>  static inline ino_t parent_ino(struct dentry *dentry)
>  {
>  	ino_t res;
> -- 
> 2.30.0.280.ga3ce27912f-goog
> 

For some reason lkml.org and lore.kernel.org are not showing this change
as part of the thread.
Let's see if replying to the email fixes the indexing.

Regards,
Alessio

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device
       [not found]   ` <CAMAHBGzkfEd9-1u0iKXp65ReJQgUi_=4sMpmfkwEOaMp6Ux7pg@mail.gmail.com>
@ 2021-01-27 13:40     ` Alessio Balsini
       [not found]       ` <CAMAHBGwpKW+30kNQ_Apt8A-FTmr94hBOzkT21cjEHHW+t7yUMQ@mail.gmail.com>
  0 siblings, 1 reply; 66+ messages in thread
From: Alessio Balsini @ 2021-01-27 13:40 UTC (permalink / raw)
  To: qxy
  Cc: Alessio Balsini, Miklos Szeredi, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Peng Tao, Stefano Duo, Zimuzo Ezeozue, wuyan,
	fuse-devel, kernel-team, linux-fsdevel, linux-kernel

On Wed, Jan 27, 2021 at 08:15:19PM +0800, qxy wrote:
> Hi Allesio,
> 
> Thank you for your supply for 32-bit user space.
> Actually we have already met this issue on our product and we resolved it
> like this:
> 
> Project *platform/external/libfuse*
> diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
> index e9d4f1a..fe0cb6d 100644
> --- a/include/fuse_kernel.h
> +++ b/include/fuse_kernel.h
> @@ -562,7 +562,11 @@
>          uint32_t fd;
>          /* For future implementation */
>          uint32_t len;
> -        void *   vec;
> +        union {
> +                void *   vec;
> +                /* compatible for 32-bit libfuse and 64-bit kernel */
> +                uint64_t padding;
> +        };
>  };
> 
> However, if we need to use *vec in the future,  we still need to do more in
> 32-bit libfuse to work with 64-bit kernel :(
> 

Good point.
What I had in mind as a possible use was the possibility of enabling
passthrough for only a few regions of the file, similar to fuse2.
But it doesn't really make sense to define the data structure fields for
uncertain future extensions, so what we could do is:

struct fuse_passthrough_out {
+	uint32_t	size;	// Size of this data structure
	uint32_t	fd;
-	/* For future implementation */
-	uint32_t	len;
-	void		*vec;
};

Similar to what "struct sched_attr" does.
This is probably the most flexible solution, that would allow to extend
this data structure in the future with no headaches both in kernel and
user space.
What do you think?

Thanks!
Alessio


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device
       [not found]       ` <CAMAHBGwpKW+30kNQ_Apt8A-FTmr94hBOzkT21cjEHHW+t7yUMQ@mail.gmail.com>
@ 2021-01-28 14:15         ` Alessio Balsini
  2021-02-05  9:54           ` Peng Tao
  2021-03-16 18:57           ` Arnd Bergmann
  0 siblings, 2 replies; 66+ messages in thread
From: Alessio Balsini @ 2021-01-28 14:15 UTC (permalink / raw)
  To: qxy
  Cc: Alessio Balsini, Miklos Szeredi, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Peng Tao, Stefano Duo, Zimuzo Ezeozue, wuyan,
	fuse-devel, kernel-team, linux-fsdevel, linux-kernel

Hi all,

I'm more than happy to change the interface into something that is
objectively better and accepted by everyone.
I would really love to reach the point at which we have a "stable-ish"
UAPI as soon as possible.

I've been thinking about a few possible approaches to fix the issue, yet
to preserve its flexibility. These are mentioned below.


  Solution 1: Size

As mentioned in my previous email, one solution could be to introduce
the "size" field to allow the structure to grow in the future.

struct fuse_passthrough_out {
    uint32_t        size;   // Size of this data structure
    uint32_t        fd;
};

The problem here is that we are making the promise that all the upcoming
fields are going to be maintained forever and at the offsets they were
originally defined.


  Solution 2: Version

Another solution could be to s/size/version, where for every version of
FUSE passthrough we reserve the right to modifying the fields over time,
casting them to the right data structure according to the version.


  Solution 3: Type

Using an enumerator to define the data structure content and purpose is
the most flexible solution I can think of.  This would for example allow
us to substitute FUSE_DEV_IOC_PASSTHROUGH_OPEN with the generic
FUSE_DEV_IOC_PASSTHROUGH and having a single ioctl for any eventually
upcoming passthrough requests.

enum fuse_passthrough_type {
    FUSE_PASSTHROUGH_OPEN
};

struct fuse_passthrough_out {
    uint32_t type; /* as defined by enum fuse_passthrough_type */
    union {
        uint32_t fd;
    };
};

This last is my favorite, as regardless the minimal logic required to
detect the size and content of the struct (not required now as we only
have a single option), it would also allow to do some kind of interface
versioning (e.g., in case we want to implement
FUSE_PASSTHROUGH_OPEN_V2).

What do you think?

Thanks,
Alessio

P.S.
Sorry if you received a duplicate email. I first sent this in reply to an email
without realizing it was a private message.

On Thu, Jan 28, 2021 at 11:01:59AM +0800, qxy wrote:
> Hi Alessio,
> 
> I have received a failure from the Mail Delivery System for times and feel
> really sorry if you have already received the duplicate message...
> 
> Thank you for your reply.
> I think it's wonderful to remove *vec from the data structure fields since
> we consider that it is not a good idea to use pointer when there is a need
> for cross-platform.
> Do you have a plan to modify the kernel fuse_passthrough_out data structure
> the same way as you mentioned?
> 
> Thanks!
> qixiaoyu

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 7/8] fuse: Use daemon creds in passthrough mode
  2021-01-25 15:30 ` [PATCH RESEND V12 7/8] fuse: Use daemon creds in passthrough mode Alessio Balsini
@ 2021-02-05  9:23   ` Peng Tao
  2021-02-05 11:21     ` Alessio Balsini
  0 siblings, 1 reply; 66+ messages in thread
From: Peng Tao @ 2021-02-05  9:23 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: Miklos Szeredi, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel,
	kernel-team, linux-fsdevel, Linux Kernel Mailing List

On Mon, Jan 25, 2021 at 11:31 PM Alessio Balsini <balsini@android.com> wrote:
>
> When using FUSE passthrough, read/write operations are directly
> forwarded to the lower file system file through VFS, but there is no
> guarantee that the process that is triggering the request has the right
> permissions to access the lower file system. This would cause the
> read/write access to fail.
>
> In passthrough file systems, where the FUSE daemon is responsible for
> the enforcement of the lower file system access policies, often happens
> that the process dealing with the FUSE file system doesn't have access
> to the lower file system.
> Being the FUSE daemon in charge of implementing the FUSE file
> operations, that in the case of read/write operations usually simply
> results in the copy of memory buffers from/to the lower file system
> respectively, these operations are executed with the FUSE daemon
> privileges.
>
> This patch adds a reference to the FUSE daemon credentials, referenced
> at FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl() time so that they can be used
> to temporarily raise the user credentials when accessing lower file
> system files in passthrough.
> The process accessing the FUSE file with passthrough enabled temporarily
> receives the privileges of the FUSE daemon while performing read/write
> operations. Similar behavior is implemented in overlayfs.
> These privileges will be reverted as soon as the IO operation completes.
> This feature does not provide any higher security privileges to those
> processes accessing the FUSE file system with passthrough enabled. This
> is because it is still the FUSE daemon responsible for enabling or not
> the passthrough feature at file open time, and should enable the feature
> only after appropriate access policy checks.
>
> Signed-off-by: Alessio Balsini <balsini@android.com>
> ---
>  fs/fuse/fuse_i.h      |  5 ++++-
>  fs/fuse/passthrough.c | 11 +++++++++++
>  2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index c4730d893324..815af1845b16 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -182,10 +182,13 @@ struct fuse_release_args;
>
>  /**
>   * Reference to lower filesystem file for read/write operations handled in
> - * passthrough mode
> + * passthrough mode.
> + * This struct also tracks the credentials to be used for handling read/write
> + * operations.
>   */
>  struct fuse_passthrough {
>         struct file *filp;
> +       struct cred *cred;
>  };
>
>  /** FUSE specific file data */
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> index c7fa1eeb7639..24866c5fe7e2 100644
> --- a/fs/fuse/passthrough.c
> +++ b/fs/fuse/passthrough.c
> @@ -52,6 +52,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
>                                    struct iov_iter *iter)
>  {
>         ssize_t ret;
> +       const struct cred *old_cred;
>         struct file *fuse_filp = iocb_fuse->ki_filp;
>         struct fuse_file *ff = fuse_filp->private_data;
>         struct file *passthrough_filp = ff->passthrough.filp;
> @@ -59,6 +60,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
>         if (!iov_iter_count(iter))
>                 return 0;
>
> +       old_cred = override_creds(ff->passthrough.cred);
>         if (is_sync_kiocb(iocb_fuse)) {
>                 ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
>                                     iocb_to_rw_flags(iocb_fuse->ki_flags,
> @@ -77,6 +79,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
>                 if (ret != -EIOCBQUEUED)
>                         fuse_aio_cleanup_handler(aio_req);
>         }
> +       revert_creds(old_cred);
cred should be reverted when kmalloc() fails above.

Cheers,
Tao
-- 
Into Sth. Rich & Strange

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device
  2021-01-28 14:15         ` Alessio Balsini
@ 2021-02-05  9:54           ` Peng Tao
  2021-03-16 18:57           ` Arnd Bergmann
  1 sibling, 0 replies; 66+ messages in thread
From: Peng Tao @ 2021-02-05  9:54 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: qxy, Miklos Szeredi, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel,
	kernel-team, linux-fsdevel, Linux Kernel Mailing List

On Thu, Jan 28, 2021 at 10:15 PM Alessio Balsini <balsini@android.com> wrote:
>
> Hi all,
>
> I'm more than happy to change the interface into something that is
> objectively better and accepted by everyone.
> I would really love to reach the point at which we have a "stable-ish"
> UAPI as soon as possible.
>
> I've been thinking about a few possible approaches to fix the issue, yet
> to preserve its flexibility. These are mentioned below.
>
>
>   Solution 1: Size
>
> As mentioned in my previous email, one solution could be to introduce
> the "size" field to allow the structure to grow in the future.
>
> struct fuse_passthrough_out {
>     uint32_t        size;   // Size of this data structure
>     uint32_t        fd;
> };
>
> The problem here is that we are making the promise that all the upcoming
> fields are going to be maintained forever and at the offsets they were
> originally defined.
>
>
>   Solution 2: Version
>
> Another solution could be to s/size/version, where for every version of
> FUSE passthrough we reserve the right to modifying the fields over time,
> casting them to the right data structure according to the version.
>
>
>   Solution 3: Type
>
> Using an enumerator to define the data structure content and purpose is
> the most flexible solution I can think of.  This would for example allow
> us to substitute FUSE_DEV_IOC_PASSTHROUGH_OPEN with the generic
> FUSE_DEV_IOC_PASSTHROUGH and having a single ioctl for any eventually
> upcoming passthrough requests.
>
> enum fuse_passthrough_type {
>     FUSE_PASSTHROUGH_OPEN
> };
>
> struct fuse_passthrough_out {
>     uint32_t type; /* as defined by enum fuse_passthrough_type */
>     union {
>         uint32_t fd;
>     };
> };
>
> This last is my favorite, as regardless the minimal logic required to
> detect the size and content of the struct (not required now as we only
> have a single option), it would also allow to do some kind of interface
> versioning (e.g., in case we want to implement
> FUSE_PASSTHROUGH_OPEN_V2).
>
Usually a new type of ioctl will be added in such cases. If we want to
stick to the same ioctl number, it might be easier to add a `flags`
field to differentiate compatible passthrough ioctls. So in future, if
a new interface is compatible with the existing one, we can use flags
to tell it. If it is not compatible, we still need to add a new ioctl.
wdyt?

struct fuse_passthrough_out {
    uint32_t flags;
    union {
        uint32_t fd;
    };
};

This somehow follows the "Flags as a system call API design pattern"
(https://lwn.net/Articles/585415/).

Just my two cents.

Cheers,
Tao
-- 
Into Sth. Rich & Strange

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 7/8] fuse: Use daemon creds in passthrough mode
  2021-02-05  9:23   ` Peng Tao
@ 2021-02-05 11:21     ` Alessio Balsini
  0 siblings, 0 replies; 66+ messages in thread
From: Alessio Balsini @ 2021-02-05 11:21 UTC (permalink / raw)
  To: Peng Tao
  Cc: Alessio Balsini, Miklos Szeredi, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel,
	kernel-team, linux-fsdevel, Linux Kernel Mailing List

On Fri, Feb 05, 2021 at 05:23:56PM +0800, Peng Tao wrote:
> On Mon, Jan 25, 2021 at 11:31 PM Alessio Balsini <balsini@android.com> wrote:
> >
> > When using FUSE passthrough, read/write operations are directly
> > forwarded to the lower file system file through VFS, but there is no
> > guarantee that the process that is triggering the request has the right
> > permissions to access the lower file system. This would cause the
> > read/write access to fail.
> >
> > In passthrough file systems, where the FUSE daemon is responsible for
> > the enforcement of the lower file system access policies, often happens
> > that the process dealing with the FUSE file system doesn't have access
> > to the lower file system.
> > Being the FUSE daemon in charge of implementing the FUSE file
> > operations, that in the case of read/write operations usually simply
> > results in the copy of memory buffers from/to the lower file system
> > respectively, these operations are executed with the FUSE daemon
> > privileges.
> >
> > This patch adds a reference to the FUSE daemon credentials, referenced
> > at FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl() time so that they can be used
> > to temporarily raise the user credentials when accessing lower file
> > system files in passthrough.
> > The process accessing the FUSE file with passthrough enabled temporarily
> > receives the privileges of the FUSE daemon while performing read/write
> > operations. Similar behavior is implemented in overlayfs.
> > These privileges will be reverted as soon as the IO operation completes.
> > This feature does not provide any higher security privileges to those
> > processes accessing the FUSE file system with passthrough enabled. This
> > is because it is still the FUSE daemon responsible for enabling or not
> > the passthrough feature at file open time, and should enable the feature
> > only after appropriate access policy checks.
> >
> > Signed-off-by: Alessio Balsini <balsini@android.com>
> > ---
> >  fs/fuse/fuse_i.h      |  5 ++++-
> >  fs/fuse/passthrough.c | 11 +++++++++++
> >  2 files changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index c4730d893324..815af1845b16 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -182,10 +182,13 @@ struct fuse_release_args;
> >
> >  /**
> >   * Reference to lower filesystem file for read/write operations handled in
> > - * passthrough mode
> > + * passthrough mode.
> > + * This struct also tracks the credentials to be used for handling read/write
> > + * operations.
> >   */
> >  struct fuse_passthrough {
> >         struct file *filp;
> > +       struct cred *cred;
> >  };
> >
> >  /** FUSE specific file data */
> > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > index c7fa1eeb7639..24866c5fe7e2 100644
> > --- a/fs/fuse/passthrough.c
> > +++ b/fs/fuse/passthrough.c
> > @@ -52,6 +52,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> >                                    struct iov_iter *iter)
> >  {
> >         ssize_t ret;
> > +       const struct cred *old_cred;
> >         struct file *fuse_filp = iocb_fuse->ki_filp;
> >         struct fuse_file *ff = fuse_filp->private_data;
> >         struct file *passthrough_filp = ff->passthrough.filp;
> > @@ -59,6 +60,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> >         if (!iov_iter_count(iter))
> >                 return 0;
> >
> > +       old_cred = override_creds(ff->passthrough.cred);
> >         if (is_sync_kiocb(iocb_fuse)) {
> >                 ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
> >                                     iocb_to_rw_flags(iocb_fuse->ki_flags,
> > @@ -77,6 +79,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> >                 if (ret != -EIOCBQUEUED)
> >                         fuse_aio_cleanup_handler(aio_req);
> >         }
> > +       revert_creds(old_cred);
> cred should be reverted when kmalloc() fails above.
> 
> Cheers,
> Tao
> -- 
> Into Sth. Rich & Strange

Thanks Tao, definitely!

Please find the fixup at the bottom of this email.
I keep the WIP V13 here:

  https://github.com/balsini/linux/tree/fuse-passthrough-v13-v5.11-rc5

Thanks,
Alessio

---8<---
From 63797a2cc6b3946bce59989adcb8f39f70f27643 Mon Sep 17 00:00:00 2001
From: Alessio Balsini <balsini@android.com>
Date: Fri, 5 Feb 2021 10:58:49 +0000
Subject: [PATCH] fuse: Fix crediantials leak in passthrough read_iter

If the system doesn't have enough memory when fuse_passthrough_read_iter
is requested in asynchronous IO, an error is directly returned without
restoring the caller's credentials.
Fix by always ensuring credentials are restored.

Fixes: 20210125153057.3623715-8-balsini@android.com ("fuse: Use daemon creds in passthrough mode")
Signed-off-by: Alessio Balsini <balsini@android.com>
---
 fs/fuse/passthrough.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index 284979f87747..1df94c1d8a00 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -69,8 +69,10 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
                struct fuse_aio_req *aio_req;

                aio_req = kmalloc(sizeof(struct fuse_aio_req), GFP_KERNEL);
-               if (!aio_req)
-                       return -ENOMEM;
+               if (!aio_req) {
+                       ret = -ENOMEM;
+                       goto out;
+               }

                aio_req->iocb_fuse = iocb_fuse;
                kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
@@ -79,6 +81,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
                if (ret != -EIOCBQUEUED)
                        fuse_aio_cleanup_handler(aio_req);
        }
+out:
        revert_creds(old_cred);

        return ret;
--
2.30.0.365.g02bc693789-goog

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device
  2021-01-25 15:30 ` [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device Alessio Balsini
       [not found]   ` <CAMAHBGzkfEd9-1u0iKXp65ReJQgUi_=4sMpmfkwEOaMp6Ux7pg@mail.gmail.com>
@ 2021-02-17 10:21   ` Miklos Szeredi
  2021-03-01 12:26     ` Alessio Balsini
  2021-03-16 18:53   ` Arnd Bergmann
  2 siblings, 1 reply; 66+ messages in thread
From: Miklos Szeredi @ 2021-02-17 10:21 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <balsini@android.com> wrote:
>
> With a 64-bit kernel build the FUSE device cannot handle ioctl requests
> coming from 32-bit user space.
> This is due to the ioctl command translation that generates different
> command identifiers that thus cannot be used for direct comparisons
> without proper manipulation.
>
> Explicitly extract type and number from the ioctl command to enable
> 32-bit user space compatibility on 64-bit kernel builds.
>
> Signed-off-by: Alessio Balsini <balsini@android.com>
> ---
>  fs/fuse/dev.c             | 29 ++++++++++++++++++-----------
>  include/uapi/linux/fuse.h |  3 ++-
>  2 files changed, 20 insertions(+), 12 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 588f8d1240aa..ff9f3b83f879 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2233,37 +2233,44 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
>  static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
>                            unsigned long arg)
>  {
> -       int err = -ENOTTY;
> +       int res;
> +       int oldfd;
> +       struct fuse_dev *fud = NULL;
>
> -       if (cmd == FUSE_DEV_IOC_CLONE) {
> -               int oldfd;
> +       if (_IOC_TYPE(cmd) != FUSE_DEV_IOC_MAGIC)
> +               return -EINVAL;

Why change ENOTTY to EINVAL?

Thanks,
MIklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2021-01-25 15:30 ` [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough Alessio Balsini
@ 2021-02-17 13:41   ` Miklos Szeredi
  2021-02-19  7:05     ` Peng Tao
  0 siblings, 1 reply; 66+ messages in thread
From: Miklos Szeredi @ 2021-02-17 13:41 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <balsini@android.com> wrote:
>
> Expose the FUSE_PASSTHROUGH interface to user space and declare all the
> basic data structures and functions as the skeleton on top of which the
> FUSE passthrough functionality will be built.
>
> As part of this, introduce the new FUSE passthrough ioctl, which allows
> the FUSE daemon to specify a direct connection between a FUSE file and a
> lower file system file. Such ioctl requires user space to pass the file
> descriptor of one of its opened files through the fuse_passthrough_out
> data structure introduced in this patch. This structure includes extra
> fields for possible future extensions.
> Also, add the passthrough functions for the set-up and tear-down of the
> data structures and locks that will be used both when fuse_conns and
> fuse_files are created/deleted.
>
> Signed-off-by: Alessio Balsini <balsini@android.com>

[...]

> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 54442612c48b..9d7685ce0acd 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -360,6 +360,7 @@ struct fuse_file_lock {
>  #define FUSE_MAP_ALIGNMENT     (1 << 26)
>  #define FUSE_SUBMOUNTS         (1 << 27)
>  #define FUSE_HANDLE_KILLPRIV_V2        (1 << 28)
> +#define FUSE_PASSTHROUGH       (1 << 29)

This header has a version and a changelog.  Please update those as well.

>
>  /**
>   * CUSE INIT request/reply flags
> @@ -625,7 +626,7 @@ struct fuse_create_in {
>  struct fuse_open_out {
>         uint64_t        fh;
>         uint32_t        open_flags;
> -       uint32_t        padding;
> +       uint32_t        passthrough_fh;

I think it would be cleaner to add a FOPEN_PASSTHROUGH flag to
explicitly request passthrough instead of just passing a non-null
value to passthrough_fh.

>  };
>
>  struct fuse_release_in {
> @@ -828,6 +829,13 @@ struct fuse_in_header {
>         uint32_t        padding;
>  };
>
> +struct fuse_passthrough_out {
> +       uint32_t        fd;
> +       /* For future implementation */
> +       uint32_t        len;
> +       void            *vec;
> +};

I don't see why we'd need these extensions.    The ioctl just needs to
establish an ID to open file mapping that can be referenced on the
regular protocol, i.e. it just needs to be passed an open file
descriptor and return an unique ID.

Mapping the fuse file's data to the underlying file's data is a
different matter.  That can be an identity mapping established at open
time (this is what this series does) or it can be an arbitrary extent
mapping to one or more underlying open files, established at open time
or on demand.  All of these can be done in band using the fuse
protocol, no need to involve the ioctl mechanism.

So I think we can just get rid of "struct fuse_passthrough_out"
completely and use "uint32_t *" as the ioctl argument.

What I think would be useful is to have an explicit
FUSE_DEV_IOC_PASSTHROUGH_CLOSE ioctl, that would need to be called
once the fuse server no longer needs this ID.   If this turns out to
be a performance problem, we could still add the auto-close behavior
with an explicit FOPEN_PASSTHROUGH_AUTOCLOSE flag later.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 4/8] fuse: Passthrough initialization and release
  2021-01-25 15:30 ` [PATCH RESEND V12 4/8] fuse: Passthrough initialization and release Alessio Balsini
@ 2021-02-17 13:52   ` Miklos Szeredi
  2021-05-05 12:21     ` Amir Goldstein
  0 siblings, 1 reply; 66+ messages in thread
From: Miklos Szeredi @ 2021-02-17 13:52 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <balsini@android.com> wrote:
>
> Implement the FUSE passthrough ioctl that associates the lower
> (passthrough) file system file with the fuse_file.
>
> The file descriptor passed to the ioctl by the FUSE daemon is used to
> access the relative file pointer, that will be copied to the fuse_file
> data structure to consolidate the link between the FUSE and lower file
> system.
>
> To enable the passthrough mode, user space triggers the
> FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl and, if the call succeeds, receives
> back an identifier that will be used at open/create response time in the
> fuse_open_out field to associate the FUSE file to the lower file system
> file.
> The value returned by the ioctl to user space can be:
> - > 0: success, the identifier can be used as part of an open/create
> reply.
> - <= 0: an error occurred.
> The value 0 represents an error to preserve backward compatibility: the
> fuse_open_out field that is used to pass the passthrough_fh back to the
> kernel uses the same bits that were previously as struct padding, and is
> commonly zero-initialized (e.g., in the libfuse implementation).
> Removing 0 from the correct values fixes the ambiguity between the case
> in which 0 corresponds to a real passthrough_fh, a missing
> implementation of FUSE passthrough or a request for a normal FUSE file,
> simplifying the user space implementation.
>
> For the passthrough mode to be successfully activated, the lower file
> system file must implement both read_iter and write_iter file
> operations. This extra check avoids special pseudo files to be targeted
> for this feature.
> Passthrough comes with another limitation: no further file system
> stacking is allowed for those FUSE file systems using passthrough.
>
> Signed-off-by: Alessio Balsini <balsini@android.com>
> ---
>  fs/fuse/inode.c       |  5 +++
>  fs/fuse/passthrough.c | 87 ++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 90 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index a1104d5abb70..7ebc398fbacb 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1133,6 +1133,11 @@ EXPORT_SYMBOL_GPL(fuse_send_init);
>
>  static int free_fuse_passthrough(int id, void *p, void *data)
>  {
> +       struct fuse_passthrough *passthrough = (struct fuse_passthrough *)p;
> +
> +       fuse_passthrough_release(passthrough);
> +       kfree(p);
> +
>         return 0;
>  }
>
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> index 594060c654f8..cf993e83803e 100644
> --- a/fs/fuse/passthrough.c
> +++ b/fs/fuse/passthrough.c
> @@ -3,19 +3,102 @@
>  #include "fuse_i.h"
>
>  #include <linux/fuse.h>
> +#include <linux/idr.h>
>
>  int fuse_passthrough_open(struct fuse_dev *fud,
>                           struct fuse_passthrough_out *pto)
>  {
> -       return -EINVAL;
> +       int res;
> +       struct file *passthrough_filp;
> +       struct fuse_conn *fc = fud->fc;
> +       struct inode *passthrough_inode;
> +       struct super_block *passthrough_sb;
> +       struct fuse_passthrough *passthrough;
> +
> +       if (!fc->passthrough)
> +               return -EPERM;
> +
> +       /* This field is reserved for future implementation */
> +       if (pto->len != 0)
> +               return -EINVAL;
> +
> +       passthrough_filp = fget(pto->fd);
> +       if (!passthrough_filp) {
> +               pr_err("FUSE: invalid file descriptor for passthrough.\n");
> +               return -EBADF;
> +       }
> +
> +       if (!passthrough_filp->f_op->read_iter ||
> +           !passthrough_filp->f_op->write_iter) {
> +               pr_err("FUSE: passthrough file misses file operations.\n");
> +               res = -EBADF;
> +               goto err_free_file;
> +       }
> +
> +       passthrough_inode = file_inode(passthrough_filp);
> +       passthrough_sb = passthrough_inode->i_sb;
> +       if (passthrough_sb->s_stack_depth >= FILESYSTEM_MAX_STACK_DEPTH) {
> +               pr_err("FUSE: fs stacking depth exceeded for passthrough\n");

No need to print an error to the logs, this can be a perfectly normal
occurrence.  However I'd try to find a more unique error value than
EINVAL so that the fuse server can interpret this as "not your fault,
but can't support passthrough on this file".  E.g. EOPNOTSUPP.


> +               res = -EINVAL;
> +               goto err_free_file;
> +       }
> +
> +       passthrough = kmalloc(sizeof(struct fuse_passthrough), GFP_KERNEL);
> +       if (!passthrough) {
> +               res = -ENOMEM;
> +               goto err_free_file;
> +       }
> +
> +       passthrough->filp = passthrough_filp;
> +
> +       idr_preload(GFP_KERNEL);
> +       spin_lock(&fc->passthrough_req_lock);

Should be okay to use fc->lock, since neither adding nor removing the
passthrough ID should be a heavily used operation, and querying the
mapping is lockless.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 5/8] fuse: Introduce synchronous read and write for passthrough
  2021-01-25 15:30 ` [PATCH RESEND V12 5/8] fuse: Introduce synchronous read and write for passthrough Alessio Balsini
@ 2021-02-17 14:00   ` Miklos Szeredi
  0 siblings, 0 replies; 66+ messages in thread
From: Miklos Szeredi @ 2021-02-17 14:00 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <balsini@android.com> wrote:
>
> All the read and write operations performed on fuse_files which have the
> passthrough feature enabled are forwarded to the associated lower file
> system file via VFS.
>
> Sending the request directly to the lower file system avoids the
> userspace round-trip that, because of possible context switches and
> additional operations might reduce the overall performance, especially
> in those cases where caching doesn't help, for example in reads at
> random offsets.
>
> Verifying if a fuse_file has a lower file system file associated with
> can be done by checking the validity of its passthrough_filp pointer.
> This pointer is not NULL only if passthrough has been successfully
> enabled via the appropriate ioctl().
> When a read/write operation is requested for a FUSE file with
> passthrough enabled, a new equivalent VFS request is generated, which
> instead targets the lower file system file.
> The VFS layer performs additional checks that allow for safer operations
> but may cause the operation to fail if the process accessing the FUSE
> file system does not have access to the lower file system.
>
> This change only implements synchronous requests in passthrough,
> returning an error in the case of asynchronous operations, yet covering
> the majority of the use cases.
>
> Signed-off-by: Alessio Balsini <balsini@android.com>
> ---
>  fs/fuse/file.c        |  8 ++++--
>  fs/fuse/fuse_i.h      |  2 ++
>  fs/fuse/passthrough.c | 57 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 65 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 953f3034c375..cddada1e8bd9 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1581,7 +1581,9 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>         if (FUSE_IS_DAX(inode))
>                 return fuse_dax_read_iter(iocb, to);
>
> -       if (!(ff->open_flags & FOPEN_DIRECT_IO))
> +       if (ff->passthrough.filp)
> +               return fuse_passthrough_read_iter(iocb, to);
> +       else if (!(ff->open_flags & FOPEN_DIRECT_IO))
>                 return fuse_cache_read_iter(iocb, to);
>         else
>                 return fuse_direct_read_iter(iocb, to);
> @@ -1599,7 +1601,9 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>         if (FUSE_IS_DAX(inode))
>                 return fuse_dax_write_iter(iocb, from);
>
> -       if (!(ff->open_flags & FOPEN_DIRECT_IO))
> +       if (ff->passthrough.filp)
> +               return fuse_passthrough_write_iter(iocb, from);
> +       else if (!(ff->open_flags & FOPEN_DIRECT_IO))
>                 return fuse_cache_write_iter(iocb, from);
>         else
>                 return fuse_direct_write_iter(iocb, from);
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 8d39f5304a11..c4730d893324 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1239,5 +1239,7 @@ int fuse_passthrough_open(struct fuse_dev *fud,
>  int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_file *ff,
>                            struct fuse_open_out *openarg);
>  void fuse_passthrough_release(struct fuse_passthrough *passthrough);
> +ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *to);
> +ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *from);
>
>  #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> index cf993e83803e..d949ca07a83b 100644
> --- a/fs/fuse/passthrough.c
> +++ b/fs/fuse/passthrough.c
> @@ -4,6 +4,63 @@
>
>  #include <linux/fuse.h>
>  #include <linux/idr.h>
> +#include <linux/uio.h>
> +
> +#define PASSTHROUGH_IOCB_MASK                                                  \
> +       (IOCB_APPEND | IOCB_DSYNC | IOCB_HIPRI | IOCB_NOWAIT | IOCB_SYNC)
> +
> +static void fuse_copyattr(struct file *dst_file, struct file *src_file)
> +{
> +       struct inode *dst = file_inode(dst_file);
> +       struct inode *src = file_inode(src_file);
> +
> +       i_size_write(dst, i_size_read(src));
> +}

Hmm, I see why this is done, yet it's contrary to what's been set out
at the beginning: "All the requests other than reads or writes are
still handled by the userspace FUSE daemon."

Maybe just use fuse_write_update_size() instead of copying the size
from the underlying inode.

> +
> +ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> +                                  struct iov_iter *iter)
> +{
> +       ssize_t ret;
> +       struct file *fuse_filp = iocb_fuse->ki_filp;
> +       struct fuse_file *ff = fuse_filp->private_data;
> +       struct file *passthrough_filp = ff->passthrough.filp;
> +
> +       if (!iov_iter_count(iter))
> +               return 0;
> +
> +       ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
> +                           iocb_to_rw_flags(iocb_fuse->ki_flags,
> +                                            PASSTHROUGH_IOCB_MASK));

Please split this line up into:

rwf = ioctb_to_rw_flags(...);
ret = vfs_iter_read(..., rwf);

> +
> +       return ret;
> +}
> +
> +ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
> +                                   struct iov_iter *iter)
> +{
> +       ssize_t ret;
> +       struct file *fuse_filp = iocb_fuse->ki_filp;
> +       struct fuse_file *ff = fuse_filp->private_data;
> +       struct inode *fuse_inode = file_inode(fuse_filp);
> +       struct file *passthrough_filp = ff->passthrough.filp;
> +
> +       if (!iov_iter_count(iter))
> +               return 0;
> +
> +       inode_lock(fuse_inode);
> +
> +       file_start_write(passthrough_filp);
> +       ret = vfs_iter_write(passthrough_filp, iter, &iocb_fuse->ki_pos,
> +                            iocb_to_rw_flags(iocb_fuse->ki_flags,
> +                                             PASSTHROUGH_IOCB_MASK));

Same here.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 8/8] fuse: Introduce passthrough for mmap
  2021-01-25 15:30 ` [PATCH RESEND V12 8/8] fuse: Introduce passthrough for mmap Alessio Balsini
@ 2021-02-17 14:05   ` Miklos Szeredi
  2021-04-01 11:24     ` Alessio Balsini
  0 siblings, 1 reply; 66+ messages in thread
From: Miklos Szeredi @ 2021-02-17 14:05 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <balsini@android.com> wrote:
>
> Enabling FUSE passthrough for mmap-ed operations not only affects
> performance, but has also been shown as mandatory for the correct
> functioning of FUSE passthrough.
> yanwu noticed [1] that a FUSE file with passthrough enabled may suffer
> data inconsistencies if the same file is also accessed with mmap. What
> happens is that read/write operations are directly applied to the lower
> file system (and its cache), while mmap-ed operations are affecting the
> FUSE cache.
>
> Extend the FUSE passthrough implementation to also handle memory-mapped
> FUSE file, to both fix the cache inconsistencies and extend the
> passthrough performance benefits to mmap-ed operations.
>
> [1] https://lore.kernel.org/lkml/20210119110654.11817-1-wu-yan@tcl.com/
>
> Signed-off-by: Alessio Balsini <balsini@android.com>
> ---
>  fs/fuse/file.c        |  3 +++
>  fs/fuse/fuse_i.h      |  1 +
>  fs/fuse/passthrough.c | 41 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 45 insertions(+)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index cddada1e8bd9..e3741a94c1f9 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -2370,6 +2370,9 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
>         if (FUSE_IS_DAX(file_inode(file)))
>                 return fuse_dax_mmap(file, vma);
>
> +       if (ff->passthrough.filp)
> +               return fuse_passthrough_mmap(file, vma);
> +
>         if (ff->open_flags & FOPEN_DIRECT_IO) {
>                 /* Can't provide the coherency needed for MAP_SHARED */
>                 if (vma->vm_flags & VM_MAYSHARE)
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 815af1845b16..7b0d65984608 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1244,5 +1244,6 @@ int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_file *ff,
>  void fuse_passthrough_release(struct fuse_passthrough *passthrough);
>  ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *to);
>  ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *from);
> +ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma);
>
>  #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> index 24866c5fe7e2..284979f87747 100644
> --- a/fs/fuse/passthrough.c
> +++ b/fs/fuse/passthrough.c
> @@ -135,6 +135,47 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
>         return ret;
>  }
>
> +ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +       int ret;
> +       const struct cred *old_cred;
> +       struct fuse_file *ff = file->private_data;
> +       struct inode *fuse_inode = file_inode(file);
> +       struct file *passthrough_filp = ff->passthrough.filp;
> +       struct inode *passthrough_inode = file_inode(passthrough_filp);
> +
> +       if (!passthrough_filp->f_op->mmap)
> +               return -ENODEV;
> +
> +       if (WARN_ON(file != vma->vm_file))
> +               return -EIO;
> +
> +       vma->vm_file = get_file(passthrough_filp);
> +
> +       old_cred = override_creds(ff->passthrough.cred);
> +       ret = call_mmap(vma->vm_file, vma);
> +       revert_creds(old_cred);
> +
> +       if (ret)
> +               fput(passthrough_filp);
> +       else
> +               fput(file);
> +
> +       if (file->f_flags & O_NOATIME)
> +               return ret;
> +
> +       if ((!timespec64_equal(&fuse_inode->i_mtime,
> +                              &passthrough_inode->i_mtime) ||
> +            !timespec64_equal(&fuse_inode->i_ctime,
> +                              &passthrough_inode->i_ctime))) {
> +               fuse_inode->i_mtime = passthrough_inode->i_mtime;
> +               fuse_inode->i_ctime = passthrough_inode->i_ctime;

Again, violation of rules.   Not sure why this is needed, mmap(2)
isn't supposed to change mtime or ctime, AFAIK.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2021-02-17 13:41   ` Miklos Szeredi
@ 2021-02-19  7:05     ` Peng Tao
  2021-02-19  8:40       ` Miklos Szeredi
  0 siblings, 1 reply; 66+ messages in thread
From: Peng Tao @ 2021-02-19  7:05 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Alessio Balsini, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel,
	kernel-team, linux-fsdevel, Linux Kernel Mailing List

On Wed, Feb 17, 2021 at 9:41 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <balsini@android.com> wrote:
> >
> > Expose the FUSE_PASSTHROUGH interface to user space and declare all the
> > basic data structures and functions as the skeleton on top of which the
> > FUSE passthrough functionality will be built.
> >
> > As part of this, introduce the new FUSE passthrough ioctl, which allows
> > the FUSE daemon to specify a direct connection between a FUSE file and a
> > lower file system file. Such ioctl requires user space to pass the file
> > descriptor of one of its opened files through the fuse_passthrough_out
> > data structure introduced in this patch. This structure includes extra
> > fields for possible future extensions.
> > Also, add the passthrough functions for the set-up and tear-down of the
> > data structures and locks that will be used both when fuse_conns and
> > fuse_files are created/deleted.
> >
> > Signed-off-by: Alessio Balsini <balsini@android.com>
>
> [...]
>
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 54442612c48b..9d7685ce0acd 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -360,6 +360,7 @@ struct fuse_file_lock {
> >  #define FUSE_MAP_ALIGNMENT     (1 << 26)
> >  #define FUSE_SUBMOUNTS         (1 << 27)
> >  #define FUSE_HANDLE_KILLPRIV_V2        (1 << 28)
> > +#define FUSE_PASSTHROUGH       (1 << 29)
>
> This header has a version and a changelog.  Please update those as well.
>
> >
> >  /**
> >   * CUSE INIT request/reply flags
> > @@ -625,7 +626,7 @@ struct fuse_create_in {
> >  struct fuse_open_out {
> >         uint64_t        fh;
> >         uint32_t        open_flags;
> > -       uint32_t        padding;
> > +       uint32_t        passthrough_fh;
>
> I think it would be cleaner to add a FOPEN_PASSTHROUGH flag to
> explicitly request passthrough instead of just passing a non-null
> value to passthrough_fh.
>
> >  };
> >
> >  struct fuse_release_in {
> > @@ -828,6 +829,13 @@ struct fuse_in_header {
> >         uint32_t        padding;
> >  };
> >
> > +struct fuse_passthrough_out {
> > +       uint32_t        fd;
> > +       /* For future implementation */
> > +       uint32_t        len;
> > +       void            *vec;
> > +};
>
> I don't see why we'd need these extensions.    The ioctl just needs to
> establish an ID to open file mapping that can be referenced on the
> regular protocol, i.e. it just needs to be passed an open file
> descriptor and return an unique ID.
>
> Mapping the fuse file's data to the underlying file's data is a
> different matter.  That can be an identity mapping established at open
> time (this is what this series does) or it can be an arbitrary extent
> mapping to one or more underlying open files, established at open time
> or on demand.  All of these can be done in band using the fuse
> protocol, no need to involve the ioctl mechanism.
>
> So I think we can just get rid of "struct fuse_passthrough_out"
> completely and use "uint32_t *" as the ioctl argument.
>
> What I think would be useful is to have an explicit
> FUSE_DEV_IOC_PASSTHROUGH_CLOSE ioctl, that would need to be called
> once the fuse server no longer needs this ID.   If this turns out to
> be a performance problem, we could still add the auto-close behavior
> with an explicit FOPEN_PASSTHROUGH_AUTOCLOSE flag later.
Hi Miklos,

W/o auto closing, what happens if user space daemon forgets to call
FUSE_DEV_IOC_PASSTHROUGH_CLOSE? Do we keep the ID alive somewhere?

Thanks,
Tao
-- 
Into Sth. Rich & Strange

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2021-02-19  7:05     ` Peng Tao
@ 2021-02-19  8:40       ` Miklos Szeredi
  2021-03-01 17:05         ` Alessio Balsini
  0 siblings, 1 reply; 66+ messages in thread
From: Miklos Szeredi @ 2021-02-19  8:40 UTC (permalink / raw)
  To: Peng Tao
  Cc: Alessio Balsini, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel,
	kernel-team, linux-fsdevel, Linux Kernel Mailing List

On Fri, Feb 19, 2021 at 8:05 AM Peng Tao <bergwolf@gmail.com> wrote:
>
> On Wed, Feb 17, 2021 at 9:41 PM Miklos Szeredi <miklos@szeredi.hu> wrote:

> > What I think would be useful is to have an explicit
> > FUSE_DEV_IOC_PASSTHROUGH_CLOSE ioctl, that would need to be called
> > once the fuse server no longer needs this ID.   If this turns out to
> > be a performance problem, we could still add the auto-close behavior
> > with an explicit FOPEN_PASSTHROUGH_AUTOCLOSE flag later.
> Hi Miklos,
>
> W/o auto closing, what happens if user space daemon forgets to call
> FUSE_DEV_IOC_PASSTHROUGH_CLOSE? Do we keep the ID alive somewhere?

Kernel would keep the ID open until explicit close or fuse connection
is released.

There should be some limit on the max open files referenced through
ID's, though.   E.g. inherit RLIMIT_NOFILE from mounting task.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device
  2021-02-17 10:21   ` Miklos Szeredi
@ 2021-03-01 12:26     ` Alessio Balsini
  0 siblings, 0 replies; 66+ messages in thread
From: Alessio Balsini @ 2021-03-01 12:26 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Alessio Balsini, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Peng Tao, Stefano Duo, Zimuzo Ezeozue, wuyan,
	fuse-devel, kernel-team, linux-fsdevel, linux-kernel

On Wed, Feb 17, 2021 at 11:21:04AM +0100, Miklos Szeredi wrote:
> On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <balsini@android.com> wrote:
> >
> > With a 64-bit kernel build the FUSE device cannot handle ioctl requests
> > coming from 32-bit user space.
> > This is due to the ioctl command translation that generates different
> > command identifiers that thus cannot be used for direct comparisons
> > without proper manipulation.
> >
> > Explicitly extract type and number from the ioctl command to enable
> > 32-bit user space compatibility on 64-bit kernel builds.
> >
> > Signed-off-by: Alessio Balsini <balsini@android.com>
> > ---
> >  fs/fuse/dev.c             | 29 ++++++++++++++++++-----------
> >  include/uapi/linux/fuse.h |  3 ++-
> >  2 files changed, 20 insertions(+), 12 deletions(-)
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index 588f8d1240aa..ff9f3b83f879 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -2233,37 +2233,44 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
> >  static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
> >                            unsigned long arg)
> >  {
> > -       int err = -ENOTTY;
> > +       int res;
> > +       int oldfd;
> > +       struct fuse_dev *fud = NULL;
> >
> > -       if (cmd == FUSE_DEV_IOC_CLONE) {
> > -               int oldfd;
> > +       if (_IOC_TYPE(cmd) != FUSE_DEV_IOC_MAGIC)
> > +               return -EINVAL;
> 
> Why change ENOTTY to EINVAL?
> 
> Thanks,
> MIklos

For the magic number mismatch I was thinking that EINVAL would have been
more appropriate, meaning: "this ioctl is definitely something this file
doesn't support".  ENOTTY, could be more specific to the subset of
ioctls supported by the file, so I kept this in the default case of the
switch.
After counting the use of ENOTTY vs EINVAL for these _IOC_TYPE checks,
EINVALs were less than half ENOTTYs, so I'm totally fine with switching
back to ENOTTY in V13.

Thanks!
Alessio

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2021-02-19  8:40       ` Miklos Szeredi
@ 2021-03-01 17:05         ` Alessio Balsini
  2022-09-08 15:36           ` Amir Goldstein
  0 siblings, 1 reply; 66+ messages in thread
From: Alessio Balsini @ 2021-03-01 17:05 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Peng Tao, Alessio Balsini, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel,
	kernel-team, linux-fsdevel, Linux Kernel Mailing List

On Fri, Feb 19, 2021 at 09:40:21AM +0100, Miklos Szeredi wrote:
> On Fri, Feb 19, 2021 at 8:05 AM Peng Tao <bergwolf@gmail.com> wrote:
> >
> > On Wed, Feb 17, 2021 at 9:41 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > > What I think would be useful is to have an explicit
> > > FUSE_DEV_IOC_PASSTHROUGH_CLOSE ioctl, that would need to be called
> > > once the fuse server no longer needs this ID.   If this turns out to
> > > be a performance problem, we could still add the auto-close behavior
> > > with an explicit FOPEN_PASSTHROUGH_AUTOCLOSE flag later.
> > Hi Miklos,
> >
> > W/o auto closing, what happens if user space daemon forgets to call
> > FUSE_DEV_IOC_PASSTHROUGH_CLOSE? Do we keep the ID alive somewhere?
> 
> Kernel would keep the ID open until explicit close or fuse connection
> is released.
> 
> There should be some limit on the max open files referenced through
> ID's, though.   E.g. inherit RLIMIT_NOFILE from mounting task.
> 
> Thanks,
> Miklos

I like the idea of FUSE_DEV_IOC_PASSTHROUGH_CLOSE to revoke the
passthrough access, that is something I was already working on. What I
had in mind was simply to break that 1:1 connection between fuse_file
and lower filp setting a specific fuse_file::passthrough::filp to NULL,
but this is slightly different from what you mentioned.

AFAIU you are suggesting to allocate one ID for each lower fs file
opened with passthrough within a connection, and maybe using idr_find at
every read/write/mmap operation to check if passthrough is enabled on
that file. Something similar to fuse2_map_get().
This way the fuse server can pass the same ID to one or more
fuse_file(s).
FUSE_DEV_IOC_PASSTHROUGH_CLOSE would idr_remove the ID, so idr_find
would fail, preventing the use of passthrough on that ID. CMIIW.

After FUSE_DEV_IOC_PASSTHROUGH_CLOSE(ID) it may happen that if some
fuse_file(s) storing that ID are still open and the same ID is reclaimed
in a new idr_alloc, this would lead to mismatching lower fs filp being
used by our fuse_file(s).  So also the ID stored in the fuse_file(s)
must be invalidated to prevent future uses of deallocated IDs.

Would it make sense to have a list of fuse_files using the same ID, that
must be traversed at FUSE_DEV_IOC_PASSTHROUGH_CLOSE time?
Negative values (maybe -ENOENT) might be used to mark IDs as invalid,
and tested before idr_find at read/write/mmap to avoid the idr_find
complexity in case passthrough is disabled for that file.

What do you think?


I agree with all the above comments to this patch, i.e., add
FOPEN_PASSTHROUGH, drop fuse_passthrough_out, header version+changelog,
that will be fixed in V13.


Thanks,
Alessio

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device
  2021-01-25 15:30 ` [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device Alessio Balsini
       [not found]   ` <CAMAHBGzkfEd9-1u0iKXp65ReJQgUi_=4sMpmfkwEOaMp6Ux7pg@mail.gmail.com>
  2021-02-17 10:21   ` Miklos Szeredi
@ 2021-03-16 18:53   ` Arnd Bergmann
  2021-03-18 16:13     ` Alessio Balsini
  2 siblings, 1 reply; 66+ messages in thread
From: Arnd Bergmann @ 2021-03-16 18:53 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: Miklos Szeredi, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Peng Tao, Stefano Duo, Zimuzo Ezeozue, wuyan,
	fuse-devel, Android Kernel Team, Linux FS-devel Mailing List,
	linux-kernel

On Mon, Jan 25, 2021 at 4:48 PM Alessio Balsini <balsini@android.com> wrote:
>
> With a 64-bit kernel build the FUSE device cannot handle ioctl requests
> coming from 32-bit user space.
> This is due to the ioctl command translation that generates different
> command identifiers that thus cannot be used for direct comparisons
> without proper manipulation.
>
> Explicitly extract type and number from the ioctl command to enable
> 32-bit user space compatibility on 64-bit kernel builds.
>
> Signed-off-by: Alessio Balsini <balsini@android.com>

I saw this commit go into the mainline kernel, and I'm worried that this
doesn't do what the description says. Since the argument is a 'uint32_t',
it is the same on both 32-bit and 64-bit user space, and the patch won't
make any difference for compat mode, as long as that is using the normal
uapi headers.

If there is any user space that has a different definition of
FUSE_DEV_IOC_CLONE, that may now successfully call
this ioctl command, but the kernel will now also accept any other
command code that has the same type and number, but an
arbitrary direction or size argument.

I think this should be changed back to specifically allow the
command code(s) that are actually used and nothing else.

       Arnd

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device
  2021-01-28 14:15         ` Alessio Balsini
  2021-02-05  9:54           ` Peng Tao
@ 2021-03-16 18:57           ` Arnd Bergmann
  1 sibling, 0 replies; 66+ messages in thread
From: Arnd Bergmann @ 2021-03-16 18:57 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: qxy, Miklos Szeredi, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Peng Tao, Stefano Duo, Zimuzo Ezeozue, wuyan,
	fuse-devel, Android Kernel Team, Linux FS-devel Mailing List,
	linux-kernel

On Thu, Jan 28, 2021 at 3:17 PM Alessio Balsini <balsini@android.com> wrote:
>
> Hi all,
>
> I'm more than happy to change the interface into something that is
> objectively better and accepted by everyone.
> I would really love to reach the point at which we have a "stable-ish"
> UAPI as soon as possible.

It's in the mainline kernel, so you already have a stable uapi and
cannot change that in any incompatible way!

> I've been thinking about a few possible approaches to fix the issue, yet
> to preserve its flexibility. These are mentioned below.
>
>
>   Solution 1: Size
>
> As mentioned in my previous email, one solution could be to introduce
> the "size" field to allow the structure to grow in the future.
>
> struct fuse_passthrough_out {
>     uint32_t        size;   // Size of this data structure
>     uint32_t        fd;
> };
>
> The problem here is that we are making the promise that all the upcoming
> fields are going to be maintained forever and at the offsets they were
> originally defined.
>
>
>   Solution 2: Version
>
> Another solution could be to s/size/version, where for every version of
> FUSE passthrough we reserve the right to modifying the fields over time,
> casting them to the right data structure according to the version.


Please read Documentation/driver-api/ioctl.rst for how to design
ioctls. Neither 'size' nor 'version' fields are appropriate here. If you
have a new behavior, you need a new command code.

      Arnd

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device
  2021-03-16 18:53   ` Arnd Bergmann
@ 2021-03-18 16:13     ` Alessio Balsini
  2021-03-18 21:15       ` Arnd Bergmann
  0 siblings, 1 reply; 66+ messages in thread
From: Alessio Balsini @ 2021-03-18 16:13 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Alessio Balsini, Miklos Szeredi, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Peng Tao, Stefano Duo, Zimuzo Ezeozue, wuyan,
	fuse-devel, Android Kernel Team, Linux FS-devel Mailing List,
	linux-kernel

On Tue, Mar 16, 2021 at 07:53:06PM +0100, Arnd Bergmann wrote:
> On Mon, Jan 25, 2021 at 4:48 PM Alessio Balsini <balsini@android.com> wrote:
> >
> > With a 64-bit kernel build the FUSE device cannot handle ioctl requests
> > coming from 32-bit user space.
> > This is due to the ioctl command translation that generates different
> > command identifiers that thus cannot be used for direct comparisons
> > without proper manipulation.
> >
> > Explicitly extract type and number from the ioctl command to enable
> > 32-bit user space compatibility on 64-bit kernel builds.
> >
> > Signed-off-by: Alessio Balsini <balsini@android.com>
> 
> I saw this commit go into the mainline kernel, and I'm worried that this
> doesn't do what the description says. Since the argument is a 'uint32_t',
> it is the same on both 32-bit and 64-bit user space, and the patch won't
> make any difference for compat mode, as long as that is using the normal
> uapi headers.
> 
> If there is any user space that has a different definition of
> FUSE_DEV_IOC_CLONE, that may now successfully call
> this ioctl command, but the kernel will now also accept any other
> command code that has the same type and number, but an
> arbitrary direction or size argument.
> 
> I think this should be changed back to specifically allow the
> command code(s) that are actually used and nothing else.
> 
>        Arnd

Hi Arnd,

Thanks for spotting this possible criticality.

I noticed that 32-bit users pace was unable to use the
FUSE_DEV_IOC_CLONE ioctl on 64-bit kernels, so this change avoid this
issue by forcing the kernel to interpret 32 and 64 bit
FUSE_DEV_IOC_CLONE command as if they were the same.
This is the simplest solution I could find as the UAPI is not changed
as, as you mentioned, the argument doesn't require any conversion.

I understand that this might limit possible future extensions of the
FUSE_DEV_IOC_XXX ioctls if their in/out argument changed depending on
the architecture, but only at that point we can switch to using the
compat layer, right?

What I'm worried about is the direction, do you think this would be an
issue?

I can start working on a compat layer fix meanwhile.

Thanks,
Alessio

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device
  2021-03-18 16:13     ` Alessio Balsini
@ 2021-03-18 21:15       ` Arnd Bergmann
  2021-03-19 15:21         ` Alessio Balsini
  0 siblings, 1 reply; 66+ messages in thread
From: Arnd Bergmann @ 2021-03-18 21:15 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: Miklos Szeredi, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Peng Tao, Stefano Duo, Zimuzo Ezeozue, wuyan,
	fuse-devel, Android Kernel Team, Linux FS-devel Mailing List,
	linux-kernel

On Thu, Mar 18, 2021 at 5:13 PM Alessio Balsini <balsini@android.com> wrote:
> On Tue, Mar 16, 2021 at 07:53:06PM +0100, Arnd Bergmann wrote:
> > On Mon, Jan 25, 2021 at 4:48 PM Alessio Balsini <balsini@android.com> wrote:
> > >
>
> Thanks for spotting this possible criticality.
>
> I noticed that 32-bit users pace was unable to use the
> FUSE_DEV_IOC_CLONE ioctl on 64-bit kernels, so this change avoid this
> issue by forcing the kernel to interpret 32 and 64 bit
> FUSE_DEV_IOC_CLONE command as if they were the same.

As far as I can tell from the kernel headers, the command code should
be the same for both 32-bit and 64-bit tasks: 0x8004e500.
Can you find out what exact value you see in the user space that was
causing problems, and how it ended up with a different value than
the 64-bit version?

If there are two possible command codes, I'd suggest you just change
the driver to handle both variants explicitly, but not any other one.

> This is the simplest solution I could find as the UAPI is not changed
> as, as you mentioned, the argument doesn't require any conversion.
>
> I understand that this might limit possible future extensions of the
> FUSE_DEV_IOC_XXX ioctls if their in/out argument changed depending on
> the architecture, but only at that point we can switch to using the
> compat layer, right?
>
> What I'm worried about is the direction, do you think this would be an
> issue?
>
> I can start working on a compat layer fix meanwhile.

For a proper well-designed ioctl interface, compat support should not
need anything beyond the '.compat_ioctl = compat_ptr_ioctl'
assignment.

       Arnd

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device
  2021-03-18 21:15       ` Arnd Bergmann
@ 2021-03-19 15:21         ` Alessio Balsini
  0 siblings, 0 replies; 66+ messages in thread
From: Alessio Balsini @ 2021-03-19 15:21 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Alessio Balsini, Miklos Szeredi, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Peng Tao, Stefano Duo, Zimuzo Ezeozue, wuyan,
	fuse-devel, Android Kernel Team, Linux FS-devel Mailing List,
	linux-kernel

On Thu, Mar 18, 2021 at 10:15:43PM +0100, Arnd Bergmann wrote:
> On Thu, Mar 18, 2021 at 5:13 PM Alessio Balsini <balsini@android.com> wrote:
> > On Tue, Mar 16, 2021 at 07:53:06PM +0100, Arnd Bergmann wrote:
> > > On Mon, Jan 25, 2021 at 4:48 PM Alessio Balsini <balsini@android.com> wrote:
> > > >
> >
> > Thanks for spotting this possible criticality.
> >
> > I noticed that 32-bit users pace was unable to use the
> > FUSE_DEV_IOC_CLONE ioctl on 64-bit kernels, so this change avoid this
> > issue by forcing the kernel to interpret 32 and 64 bit
> > FUSE_DEV_IOC_CLONE command as if they were the same.
> 
> As far as I can tell from the kernel headers, the command code should
> be the same for both 32-bit and 64-bit tasks: 0x8004e500.
> Can you find out what exact value you see in the user space that was
> causing problems, and how it ended up with a different value than
> the 64-bit version?
> 
> If there are two possible command codes, I'd suggest you just change
> the driver to handle both variants explicitly, but not any other one.
> 
> > This is the simplest solution I could find as the UAPI is not changed
> > as, as you mentioned, the argument doesn't require any conversion.
> >
> > I understand that this might limit possible future extensions of the
> > FUSE_DEV_IOC_XXX ioctls if their in/out argument changed depending on
> > the architecture, but only at that point we can switch to using the
> > compat layer, right?
> >
> > What I'm worried about is the direction, do you think this would be an
> > issue?
> >
> > I can start working on a compat layer fix meanwhile.
> 
> For a proper well-designed ioctl interface, compat support should not
> need anything beyond the '.compat_ioctl = compat_ptr_ioctl'
> assignment.
> 
>        Arnd

You are right and now I can see what happened here.

When I introduce the PASSTHROUGH ioctl, because of the 'void *', the
command mismatches on _IOC_SIZE(nr). I solved this by only testing
_IOC_NUMBER and _IOC_TYPE, implicitely (mistakenly) removing the
_IOC_SIZE check.  So, the fuse_dev_ioctl was correctly rejecting the
ioctl request from 32-bit userspace because of the wrong size and I was
just forcing it to digest the wrong data regardless.
Ouch!

The commit message for this patch would still be incorrect, but I posted
a fix here to bring the ioctl checking back to the original behavior:

  https://lore.kernel.org/lkml/20210319150514.1315985-1-balsini@android.com/

Thanks for spotting this!
Alessio


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 1/8] fs: Generic function to convert iocb to rw flags
  2021-01-25 16:46   ` Alessio Balsini
@ 2021-03-24  7:43     ` Rokudo Yan
  2021-03-24 14:02       ` Alessio Balsini
  0 siblings, 1 reply; 66+ messages in thread
From: Rokudo Yan @ 2021-03-24  7:43 UTC (permalink / raw)
  To: Alessio Balsini, Miklos Szeredi
  Cc: Akilesh Kailash, Amir Goldstein, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

On 1/26/21 12:46 AM, Alessio Balsini wrote:
> On Mon, Jan 25, 2021 at 03:30:50PM +0000, Alessio Balsini wrote:
>> OverlayFS implements its own function to translate iocb flags into rw
>> flags, so that they can be passed into another vfs call.
>> With commit ce71bfea207b4 ("fs: align IOCB_* flags with RWF_* flags")
>> Jens created a 1:1 matching between the iocb flags and rw flags,
>> simplifying the conversion.
>>
>> Reduce the OverlayFS code by making the flag conversion function generic
>> and reusable.
>>
>> Signed-off-by: Alessio Balsini <balsini@android.com>
>> ---
>>   fs/overlayfs/file.c | 23 +++++------------------
>>   include/linux/fs.h  |  5 +++++
>>   2 files changed, 10 insertions(+), 18 deletions(-)
>>
>> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>> index bd9dd38347ae..56be2ffc5a14 100644
>> --- a/fs/overlayfs/file.c
>> +++ b/fs/overlayfs/file.c
>> @@ -15,6 +15,8 @@
>>   #include <linux/fs.h>
>>   #include "overlayfs.h"
>>   
>> +#define OVL_IOCB_MASK (IOCB_DSYNC | IOCB_HIPRI | IOCB_NOWAIT | IOCB_SYNC)
>> +
>>   struct ovl_aio_req {
>>   	struct kiocb iocb;
>>   	struct kiocb *orig_iocb;
>> @@ -236,22 +238,6 @@ static void ovl_file_accessed(struct file *file)
>>   	touch_atime(&file->f_path);
>>   }
>>   
>> -static rwf_t ovl_iocb_to_rwf(int ifl)
>> -{
>> -	rwf_t flags = 0;
>> -
>> -	if (ifl & IOCB_NOWAIT)
>> -		flags |= RWF_NOWAIT;
>> -	if (ifl & IOCB_HIPRI)
>> -		flags |= RWF_HIPRI;
>> -	if (ifl & IOCB_DSYNC)
>> -		flags |= RWF_DSYNC;
>> -	if (ifl & IOCB_SYNC)
>> -		flags |= RWF_SYNC;
>> -
>> -	return flags;
>> -}
>> -
>>   static void ovl_aio_cleanup_handler(struct ovl_aio_req *aio_req)
>>   {
>>   	struct kiocb *iocb = &aio_req->iocb;
>> @@ -299,7 +285,8 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>>   	old_cred = ovl_override_creds(file_inode(file)->i_sb);
>>   	if (is_sync_kiocb(iocb)) {
>>   		ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
>> -				    ovl_iocb_to_rwf(iocb->ki_flags));
>> +				    iocb_to_rw_flags(iocb->ki_flags,
>> +						     OVL_IOCB_MASK));
>>   	} else {
>>   		struct ovl_aio_req *aio_req;
>>   
>> @@ -356,7 +343,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>>   	if (is_sync_kiocb(iocb)) {
>>   		file_start_write(real.file);
>>   		ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
>> -				     ovl_iocb_to_rwf(ifl));
>> +				     iocb_to_rw_flags(ifl, OVL_IOCB_MASK));
>>   		file_end_write(real.file);
>>   		/* Update size */
>>   		ovl_copyattr(ovl_inode_real(inode), inode);
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index fd47deea7c17..647c35423545 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -3275,6 +3275,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
>>   	return 0;
>>   }
>>   
>> +static inline rwf_t iocb_to_rw_flags(int ifl, int iocb_mask)
>> +{
>> +	return ifl & iocb_mask;
>> +}
>> +
>>   static inline ino_t parent_ino(struct dentry *dentry)
>>   {
>>   	ino_t res;
>> -- 
>> 2.30.0.280.ga3ce27912f-goog
>>
> 
> For some reason lkml.org and lore.kernel.org are not showing this change
> as part of the thread.
> Let's see if replying to the email fixes the indexing.
> 
> Regards,
> Alessio
> 

Hi, Alessio

This change imply IOCB_* and RWF_* flags are properly aligned, which is 
not true for kernel version 5.4/4.19/4.14. As the patch ("fs: align 
IOCB_* flags with RWF_* flags") is not back-ported to these stable 
kernel branches. The issue was found when applying these patches
to kernel-5.4(files open with passthrough enabled can't do append 
write). I think the issue exists in AOSP common kernel too.
Could you please fix this?

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce71bfea207b4d7c21d36f24ec37618ffcea1da8

https://android-review.googlesource.com/c/kernel/common/+/1556243

Thanks
yanwu

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 1/8] fs: Generic function to convert iocb to rw flags
  2021-03-24  7:43     ` Rokudo Yan
@ 2021-03-24 14:02       ` Alessio Balsini
  0 siblings, 0 replies; 66+ messages in thread
From: Alessio Balsini @ 2021-03-24 14:02 UTC (permalink / raw)
  To: Rokudo Yan
  Cc: Alessio Balsini, Miklos Szeredi, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Peng Tao, Stefano Duo, Zimuzo Ezeozue, fuse-devel,
	kernel-team, linux-fsdevel, linux-kernel

On Wed, Mar 24, 2021 at 03:43:12PM +0800, Rokudo Yan wrote:
> On 1/26/21 12:46 AM, Alessio Balsini wrote:
> > On Mon, Jan 25, 2021 at 03:30:50PM +0000, Alessio Balsini wrote:
> > > OverlayFS implements its own function to translate iocb flags into rw
> > > flags, so that they can be passed into another vfs call.
> > > With commit ce71bfea207b4 ("fs: align IOCB_* flags with RWF_* flags")
> > > Jens created a 1:1 matching between the iocb flags and rw flags,
> > > simplifying the conversion.
> > > 
> > > Reduce the OverlayFS code by making the flag conversion function generic
> > > and reusable.
> > > 
> > > Signed-off-by: Alessio Balsini <balsini@android.com>
> > > ---
> > >   fs/overlayfs/file.c | 23 +++++------------------
> > >   include/linux/fs.h  |  5 +++++
> > >   2 files changed, 10 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > index bd9dd38347ae..56be2ffc5a14 100644
> > > --- a/fs/overlayfs/file.c
> > > +++ b/fs/overlayfs/file.c
> > > @@ -15,6 +15,8 @@
> > >   #include <linux/fs.h>
> > >   #include "overlayfs.h"
> > > +#define OVL_IOCB_MASK (IOCB_DSYNC | IOCB_HIPRI | IOCB_NOWAIT | IOCB_SYNC)
> > > +
> > >   struct ovl_aio_req {
> > >   	struct kiocb iocb;
> > >   	struct kiocb *orig_iocb;
> > > @@ -236,22 +238,6 @@ static void ovl_file_accessed(struct file *file)
> > >   	touch_atime(&file->f_path);
> > >   }
> > > -static rwf_t ovl_iocb_to_rwf(int ifl)
> > > -{
> > > -	rwf_t flags = 0;
> > > -
> > > -	if (ifl & IOCB_NOWAIT)
> > > -		flags |= RWF_NOWAIT;
> > > -	if (ifl & IOCB_HIPRI)
> > > -		flags |= RWF_HIPRI;
> > > -	if (ifl & IOCB_DSYNC)
> > > -		flags |= RWF_DSYNC;
> > > -	if (ifl & IOCB_SYNC)
> > > -		flags |= RWF_SYNC;
> > > -
> > > -	return flags;
> > > -}
> > > -
> > >   static void ovl_aio_cleanup_handler(struct ovl_aio_req *aio_req)
> > >   {
> > >   	struct kiocb *iocb = &aio_req->iocb;
> > > @@ -299,7 +285,8 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> > >   	old_cred = ovl_override_creds(file_inode(file)->i_sb);
> > >   	if (is_sync_kiocb(iocb)) {
> > >   		ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
> > > -				    ovl_iocb_to_rwf(iocb->ki_flags));
> > > +				    iocb_to_rw_flags(iocb->ki_flags,
> > > +						     OVL_IOCB_MASK));
> > >   	} else {
> > >   		struct ovl_aio_req *aio_req;
> > > @@ -356,7 +343,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> > >   	if (is_sync_kiocb(iocb)) {
> > >   		file_start_write(real.file);
> > >   		ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
> > > -				     ovl_iocb_to_rwf(ifl));
> > > +				     iocb_to_rw_flags(ifl, OVL_IOCB_MASK));
> > >   		file_end_write(real.file);
> > >   		/* Update size */
> > >   		ovl_copyattr(ovl_inode_real(inode), inode);
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index fd47deea7c17..647c35423545 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -3275,6 +3275,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> > >   	return 0;
> > >   }
> > > +static inline rwf_t iocb_to_rw_flags(int ifl, int iocb_mask)
> > > +{
> > > +	return ifl & iocb_mask;
> > > +}
> > > +
> > >   static inline ino_t parent_ino(struct dentry *dentry)
> > >   {
> > >   	ino_t res;
> > > -- 
> > > 2.30.0.280.ga3ce27912f-goog
> > > 
> > 
> > For some reason lkml.org and lore.kernel.org are not showing this change
> > as part of the thread.
> > Let's see if replying to the email fixes the indexing.
> > 
> > Regards,
> > Alessio
> > 
> 
> Hi, Alessio
> 
> This change imply IOCB_* and RWF_* flags are properly aligned, which is not
> true for kernel version 5.4/4.19/4.14. As the patch ("fs: align IOCB_* flags
> with RWF_* flags") is not back-ported to these stable kernel branches. The
> issue was found when applying these patches
> to kernel-5.4(files open with passthrough enabled can't do append write). I
> think the issue exists in AOSP common kernel too.
> Could you please fix this?
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce71bfea207b4d7c21d36f24ec37618ffcea1da8
> 
> https://android-review.googlesource.com/c/kernel/common/+/1556243
> 
> Thanks
> yanwu

Hi yanwu,

Correct, this change depends on commit ce71bfea207b ("fs: align IOCB_*
flags with RWF_* flags"), and this dependency is satisfied upstream.
Being FUSE passthrough a new feature and not a bugfix, I'm not planning
to do any backporting to LTS kernels (and GregKH won't probably accept
it).

Android is a different story (and slightly out of topic here).
We are looking forward to have FUSE passthrough enabled on Android as
most of the user data is handled by FUSE. We liked the performance
improvements and non-intrusiveness of the change both for the kernel and
for userspace, so we started supporting this in android12-5.4+ kernel
branches. We are not planning to maintain the feature to older kernels
though (we can't add features to already released), and this is why FUSE
passthrough is not merged there.
To answer your question, in AOSP the officially supported kernels
already have the flags alignment change merged, and a not supported
backporting to older kernels (i.e., 4.14 and 4.19) is already available:

https://android-review.googlesource.com/q/%2522BACKPORT:+fs:+align+IOCB_*+flags+with+RWF_*+flags%2522+-status:abandoned

Thanks,
Alessio

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 8/8] fuse: Introduce passthrough for mmap
  2021-02-17 14:05   ` Miklos Szeredi
@ 2021-04-01 11:24     ` Alessio Balsini
  0 siblings, 0 replies; 66+ messages in thread
From: Alessio Balsini @ 2021-04-01 11:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Alessio Balsini, Akilesh Kailash, Amir Goldstein,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Peng Tao, Stefano Duo, Zimuzo Ezeozue, wuyan,
	fuse-devel, kernel-team, linux-fsdevel, linux-kernel

On Wed, Feb 17, 2021 at 03:05:07PM +0100, Miklos Szeredi wrote:
> On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <balsini@android.com> wrote:
> >
> > Enabling FUSE passthrough for mmap-ed operations not only affects
> > performance, but has also been shown as mandatory for the correct
> > functioning of FUSE passthrough.
> > yanwu noticed [1] that a FUSE file with passthrough enabled may suffer
> > data inconsistencies if the same file is also accessed with mmap. What
> > happens is that read/write operations are directly applied to the lower
> > file system (and its cache), while mmap-ed operations are affecting the
> > FUSE cache.
> >
> > Extend the FUSE passthrough implementation to also handle memory-mapped
> > FUSE file, to both fix the cache inconsistencies and extend the
> > passthrough performance benefits to mmap-ed operations.
> >
> > [1] https://lore.kernel.org/lkml/20210119110654.11817-1-wu-yan@tcl.com/
> >
> > Signed-off-by: Alessio Balsini <balsini@android.com>
> > ---
> >  fs/fuse/file.c        |  3 +++
> >  fs/fuse/fuse_i.h      |  1 +
> >  fs/fuse/passthrough.c | 41 +++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 45 insertions(+)
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index cddada1e8bd9..e3741a94c1f9 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -2370,6 +2370,9 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
> >         if (FUSE_IS_DAX(file_inode(file)))
> >                 return fuse_dax_mmap(file, vma);
> >
> > +       if (ff->passthrough.filp)
> > +               return fuse_passthrough_mmap(file, vma);
> > +
> >         if (ff->open_flags & FOPEN_DIRECT_IO) {
> >                 /* Can't provide the coherency needed for MAP_SHARED */
> >                 if (vma->vm_flags & VM_MAYSHARE)
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 815af1845b16..7b0d65984608 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -1244,5 +1244,6 @@ int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_file *ff,
> >  void fuse_passthrough_release(struct fuse_passthrough *passthrough);
> >  ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *to);
> >  ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *from);
> > +ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma);
> >
> >  #endif /* _FS_FUSE_I_H */
> > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > index 24866c5fe7e2..284979f87747 100644
> > --- a/fs/fuse/passthrough.c
> > +++ b/fs/fuse/passthrough.c
> > @@ -135,6 +135,47 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
> >         return ret;
> >  }
> >
> > +ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +       int ret;
> > +       const struct cred *old_cred;
> > +       struct fuse_file *ff = file->private_data;
> > +       struct inode *fuse_inode = file_inode(file);
> > +       struct file *passthrough_filp = ff->passthrough.filp;
> > +       struct inode *passthrough_inode = file_inode(passthrough_filp);
> > +
> > +       if (!passthrough_filp->f_op->mmap)
> > +               return -ENODEV;
> > +
> > +       if (WARN_ON(file != vma->vm_file))
> > +               return -EIO;
> > +
> > +       vma->vm_file = get_file(passthrough_filp);
> > +
> > +       old_cred = override_creds(ff->passthrough.cred);
> > +       ret = call_mmap(vma->vm_file, vma);
> > +       revert_creds(old_cred);
> > +
> > +       if (ret)
> > +               fput(passthrough_filp);
> > +       else
> > +               fput(file);
> > +
> > +       if (file->f_flags & O_NOATIME)
> > +               return ret;
> > +
> > +       if ((!timespec64_equal(&fuse_inode->i_mtime,
> > +                              &passthrough_inode->i_mtime) ||
> > +            !timespec64_equal(&fuse_inode->i_ctime,
> > +                              &passthrough_inode->i_ctime))) {
> > +               fuse_inode->i_mtime = passthrough_inode->i_mtime;
> > +               fuse_inode->i_ctime = passthrough_inode->i_ctime;
> 
> Again, violation of rules.   Not sure why this is needed, mmap(2)
> isn't supposed to change mtime or ctime, AFAIK.
> 
> Thanks,
> Miklos

Hi Miklos,

I don't have a strong preference for this and will drop the ctime/atime
updates in v13.


For the records, here follows my reasoning for which I decided to update
atime/ctime here.

From the stats(2) man it just says that it's not guaranteed that atime
would be updated, as `Other routines, like mmap(2), may or may not
update st_atime.`

Something similar according to the inotify(7) man that warns not to trigger events
after mmap(2), msync(2), and munmap(2) operations.

The mmap(2) man mentions that st_ctime and st_mtime would be updated for
file mappings with PROT_WRITE and MAP_SHARED, before a msync(2) with
MS_SYNC or MS_ASYNC.
This passthrough scenario is slightly different from the standard mmap,
but it seems to me that we are kind of falling into a similar use case
for the atime/ctime update.
I would imagine this is why OverlayFS updates atime/ctime too in
ovl_mmap(), through ovl_copyattr().

Thanks,
Alessio

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 4/8] fuse: Passthrough initialization and release
  2021-02-17 13:52   ` Miklos Szeredi
@ 2021-05-05 12:21     ` Amir Goldstein
  2021-05-17 11:36       ` Alessio Balsini
  0 siblings, 1 reply; 66+ messages in thread
From: Amir Goldstein @ 2021-05-05 12:21 UTC (permalink / raw)
  To: Miklos Szeredi, Alessio Balsini
  Cc: Akilesh Kailash, Antonio SJ Musumeci, David Anderson,
	Giuseppe Scrivano, Jann Horn, Jens Axboe, Martijn Coenen,
	Palmer Dabbelt, Paul Lawrence, Peng Tao, Stefano Duo,
	Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team, linux-fsdevel,
	linux-kernel

On Wed, Feb 17, 2021 at 3:52 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <balsini@android.com> wrote:
> >
> > Implement the FUSE passthrough ioctl that associates the lower
> > (passthrough) file system file with the fuse_file.
> >
> > The file descriptor passed to the ioctl by the FUSE daemon is used to
> > access the relative file pointer, that will be copied to the fuse_file
> > data structure to consolidate the link between the FUSE and lower file
> > system.
> >
> > To enable the passthrough mode, user space triggers the
> > FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl and, if the call succeeds, receives
> > back an identifier that will be used at open/create response time in the
> > fuse_open_out field to associate the FUSE file to the lower file system
> > file.
> > The value returned by the ioctl to user space can be:
> > - > 0: success, the identifier can be used as part of an open/create
> > reply.
> > - <= 0: an error occurred.
> > The value 0 represents an error to preserve backward compatibility: the
> > fuse_open_out field that is used to pass the passthrough_fh back to the
> > kernel uses the same bits that were previously as struct padding, and is
> > commonly zero-initialized (e.g., in the libfuse implementation).
> > Removing 0 from the correct values fixes the ambiguity between the case
> > in which 0 corresponds to a real passthrough_fh, a missing
> > implementation of FUSE passthrough or a request for a normal FUSE file,
> > simplifying the user space implementation.
> >
> > For the passthrough mode to be successfully activated, the lower file
> > system file must implement both read_iter and write_iter file
> > operations. This extra check avoids special pseudo files to be targeted
> > for this feature.
> > Passthrough comes with another limitation: no further file system
> > stacking is allowed for those FUSE file systems using passthrough.
> >
> > Signed-off-by: Alessio Balsini <balsini@android.com>
> > ---
> >  fs/fuse/inode.c       |  5 +++
> >  fs/fuse/passthrough.c | 87 ++++++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 90 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index a1104d5abb70..7ebc398fbacb 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -1133,6 +1133,11 @@ EXPORT_SYMBOL_GPL(fuse_send_init);
> >
> >  static int free_fuse_passthrough(int id, void *p, void *data)
> >  {
> > +       struct fuse_passthrough *passthrough = (struct fuse_passthrough *)p;
> > +
> > +       fuse_passthrough_release(passthrough);
> > +       kfree(p);
> > +
> >         return 0;
> >  }
> >
> > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > index 594060c654f8..cf993e83803e 100644
> > --- a/fs/fuse/passthrough.c
> > +++ b/fs/fuse/passthrough.c
> > @@ -3,19 +3,102 @@
> >  #include "fuse_i.h"
> >
> >  #include <linux/fuse.h>
> > +#include <linux/idr.h>
> >
> >  int fuse_passthrough_open(struct fuse_dev *fud,
> >                           struct fuse_passthrough_out *pto)
> >  {
> > -       return -EINVAL;
> > +       int res;
> > +       struct file *passthrough_filp;
> > +       struct fuse_conn *fc = fud->fc;
> > +       struct inode *passthrough_inode;
> > +       struct super_block *passthrough_sb;
> > +       struct fuse_passthrough *passthrough;
> > +
> > +       if (!fc->passthrough)
> > +               return -EPERM;
> > +
> > +       /* This field is reserved for future implementation */
> > +       if (pto->len != 0)
> > +               return -EINVAL;
> > +
> > +       passthrough_filp = fget(pto->fd);
> > +       if (!passthrough_filp) {
> > +               pr_err("FUSE: invalid file descriptor for passthrough.\n");
> > +               return -EBADF;
> > +       }
> > +
> > +       if (!passthrough_filp->f_op->read_iter ||
> > +           !passthrough_filp->f_op->write_iter) {
> > +               pr_err("FUSE: passthrough file misses file operations.\n");
> > +               res = -EBADF;
> > +               goto err_free_file;
> > +       }
> > +
> > +       passthrough_inode = file_inode(passthrough_filp);
> > +       passthrough_sb = passthrough_inode->i_sb;
> > +       if (passthrough_sb->s_stack_depth >= FILESYSTEM_MAX_STACK_DEPTH) {
> > +               pr_err("FUSE: fs stacking depth exceeded for passthrough\n");
>
> No need to print an error to the logs, this can be a perfectly normal
> occurrence.  However I'd try to find a more unique error value than
> EINVAL so that the fuse server can interpret this as "not your fault,
> but can't support passthrough on this file".  E.g. EOPNOTSUPP.
>
>

Sorry for the fashionably late response...
Same comment for !{read,write}_iter case above.
EBAFD is really not appropriate there.
May I suggest ELOOP for s_stack_depth and EOPNOTSUPP
for no rw iter ops.

Are you planning to post another version of the patches soon?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 4/8] fuse: Passthrough initialization and release
  2021-05-05 12:21     ` Amir Goldstein
@ 2021-05-17 11:36       ` Alessio Balsini
  2021-05-17 13:21         ` Amir Goldstein
  0 siblings, 1 reply; 66+ messages in thread
From: Alessio Balsini @ 2021-05-17 11:36 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Alessio Balsini, Akilesh Kailash,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Peng Tao, Stefano Duo, Zimuzo Ezeozue, wuyan,
	fuse-devel, kernel-team, linux-fsdevel, linux-kernel

On Wed, May 05, 2021 at 03:21:26PM +0300, Amir Goldstein wrote:
> On Wed, Feb 17, 2021 at 3:52 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <balsini@android.com> wrote:
> > >
> > > Implement the FUSE passthrough ioctl that associates the lower
> > > (passthrough) file system file with the fuse_file.
> > >
> > > The file descriptor passed to the ioctl by the FUSE daemon is used to
> > > access the relative file pointer, that will be copied to the fuse_file
> > > data structure to consolidate the link between the FUSE and lower file
> > > system.
> > >
> > > To enable the passthrough mode, user space triggers the
> > > FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl and, if the call succeeds, receives
> > > back an identifier that will be used at open/create response time in the
> > > fuse_open_out field to associate the FUSE file to the lower file system
> > > file.
> > > The value returned by the ioctl to user space can be:
> > > - > 0: success, the identifier can be used as part of an open/create
> > > reply.
> > > - <= 0: an error occurred.
> > > The value 0 represents an error to preserve backward compatibility: the
> > > fuse_open_out field that is used to pass the passthrough_fh back to the
> > > kernel uses the same bits that were previously as struct padding, and is
> > > commonly zero-initialized (e.g., in the libfuse implementation).
> > > Removing 0 from the correct values fixes the ambiguity between the case
> > > in which 0 corresponds to a real passthrough_fh, a missing
> > > implementation of FUSE passthrough or a request for a normal FUSE file,
> > > simplifying the user space implementation.
> > >
> > > For the passthrough mode to be successfully activated, the lower file
> > > system file must implement both read_iter and write_iter file
> > > operations. This extra check avoids special pseudo files to be targeted
> > > for this feature.
> > > Passthrough comes with another limitation: no further file system
> > > stacking is allowed for those FUSE file systems using passthrough.
> > >
> > > Signed-off-by: Alessio Balsini <balsini@android.com>
> > > ---
> > >  fs/fuse/inode.c       |  5 +++
> > >  fs/fuse/passthrough.c | 87 ++++++++++++++++++++++++++++++++++++++++++-
> > >  2 files changed, 90 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > index a1104d5abb70..7ebc398fbacb 100644
> > > --- a/fs/fuse/inode.c
> > > +++ b/fs/fuse/inode.c
> > > @@ -1133,6 +1133,11 @@ EXPORT_SYMBOL_GPL(fuse_send_init);
> > >
> > >  static int free_fuse_passthrough(int id, void *p, void *data)
> > >  {
> > > +       struct fuse_passthrough *passthrough = (struct fuse_passthrough *)p;
> > > +
> > > +       fuse_passthrough_release(passthrough);
> > > +       kfree(p);
> > > +
> > >         return 0;
> > >  }
> > >
> > > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > > index 594060c654f8..cf993e83803e 100644
> > > --- a/fs/fuse/passthrough.c
> > > +++ b/fs/fuse/passthrough.c
> > > @@ -3,19 +3,102 @@
> > >  #include "fuse_i.h"
> > >
> > >  #include <linux/fuse.h>
> > > +#include <linux/idr.h>
> > >
> > >  int fuse_passthrough_open(struct fuse_dev *fud,
> > >                           struct fuse_passthrough_out *pto)
> > >  {
> > > -       return -EINVAL;
> > > +       int res;
> > > +       struct file *passthrough_filp;
> > > +       struct fuse_conn *fc = fud->fc;
> > > +       struct inode *passthrough_inode;
> > > +       struct super_block *passthrough_sb;
> > > +       struct fuse_passthrough *passthrough;
> > > +
> > > +       if (!fc->passthrough)
> > > +               return -EPERM;
> > > +
> > > +       /* This field is reserved for future implementation */
> > > +       if (pto->len != 0)
> > > +               return -EINVAL;
> > > +
> > > +       passthrough_filp = fget(pto->fd);
> > > +       if (!passthrough_filp) {
> > > +               pr_err("FUSE: invalid file descriptor for passthrough.\n");
> > > +               return -EBADF;
> > > +       }
> > > +
> > > +       if (!passthrough_filp->f_op->read_iter ||
> > > +           !passthrough_filp->f_op->write_iter) {
> > > +               pr_err("FUSE: passthrough file misses file operations.\n");
> > > +               res = -EBADF;
> > > +               goto err_free_file;
> > > +       }
> > > +
> > > +       passthrough_inode = file_inode(passthrough_filp);
> > > +       passthrough_sb = passthrough_inode->i_sb;
> > > +       if (passthrough_sb->s_stack_depth >= FILESYSTEM_MAX_STACK_DEPTH) {
> > > +               pr_err("FUSE: fs stacking depth exceeded for passthrough\n");
> >
> > No need to print an error to the logs, this can be a perfectly normal
> > occurrence.  However I'd try to find a more unique error value than
> > EINVAL so that the fuse server can interpret this as "not your fault,
> > but can't support passthrough on this file".  E.g. EOPNOTSUPP.
> >
> >
> 
> Sorry for the fashionably late response...
> Same comment for !{read,write}_iter case above.
> EBAFD is really not appropriate there.
> May I suggest ELOOP for s_stack_depth and EOPNOTSUPP
> for no rw iter ops.
> 
> Are you planning to post another version of the patches soon?
> 
> Thanks,
> Amir.

Hi Amir,

Thanks for your feedback. I like the ELOOP for the stack_depth, and the
EOPNOTSUPP is already in my local branch.

I've been busy with the integration of this last patchset in Android,
that unfortunately will be shipped as out-of-tree. I'm keeping all my
fingers crossed for my workarounds to prevent future passthrough UAPI
breakages to work. :)

I have an ugly patch which uses IDR as Miklos asked, but unfortunately
I'm facing some performance issues due to the locking mechanisms to keep
guarantee the RCU consistency. I can post the new patch set as an RFC
soon for the community to take a look.
At a glance what happens is:
- the IDR, one for each fuse_conn, contains pointers to "struct
  fuse_passthrough" containing:
  - fuse_file *: which is using passthrough,
  - file *: native file system file target,
  - cred of the FUSE server,
- ioctl(PASSTHROUGH_OPEN): updates IDR, requires spinlock:
  - kmalloc(fuse_passthrough), update file and cred,
  - ID = idr_alloc(),
  - return ID,
- fuse_open reply from FUSE server with passthrough ID: updates IDR,
  requires spinlock:
  - pt = idr_find(ID),
  - update fuse_file with the current fuse_file,
  - update fuse_file->passthrough_id = ID,
  - idr_replace(),
- read/write/mmap: lock-free IDR read:
  - idr_find(fuse_file::passthrough_id),
  - forward request to lower file system as for the current FUSE
    passthrough patches.
- ioctl(PASSTHROUGH_CLOSE): updates IDR, requires spinlock:
  - idr_remove();
  - call_rcu(): tell possible open fuse_file user that the ID is no more
    valid and kfree() the allocated struct;
- close(fuse_file): updates IDR, requires spinlock:
  - ID = fuse_file::passthrough_id
  - idr_find(ID),
  - fuse_passthrough::fuse_file = NULL,
  - idr_replace().

This would be fine if passthrough is activated for a few, big files,
where the passthrough overhead is dominated by the direct access
benefits, but if passthrough is enabled for many small files which just
do one or two read/writes (as what I did for my benchmark of total build
time for the kernel, where I was passing-through every opened file), the
overhead becomes a real issue.

If you have any thoughts on how to make this simpler, I'm absolutely
open to fix this.

We are also in the preliminary design step of having a hierarchical
passthrough feature, for which all the dir/file operations performed in
any of the contents of the passthrough folder, are automatically
forwarded to the lower file system.
That may fix this overhead issue, but it's still early to forecast what
is going to happen with this idea. Right now I'm not even sure it's
feasible.

Looking forward to receiving some feedback, thanks in advance!
Alessio

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 4/8] fuse: Passthrough initialization and release
  2021-05-17 11:36       ` Alessio Balsini
@ 2021-05-17 13:21         ` Amir Goldstein
  0 siblings, 0 replies; 66+ messages in thread
From: Amir Goldstein @ 2021-05-17 13:21 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: Miklos Szeredi, Akilesh Kailash, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

> I have an ugly patch which uses IDR as Miklos asked, but unfortunately
> I'm facing some performance issues due to the locking mechanisms to keep
> guarantee the RCU consistency. I can post the new patch set as an RFC
> soon for the community to take a look.
> At a glance what happens is:
> - the IDR, one for each fuse_conn, contains pointers to "struct
>   fuse_passthrough" containing:
>   - fuse_file *: which is using passthrough,
>   - file *: native file system file target,
>   - cred of the FUSE server,
> - ioctl(PASSTHROUGH_OPEN): updates IDR, requires spinlock:
>   - kmalloc(fuse_passthrough), update file and cred,
>   - ID = idr_alloc(),
>   - return ID,
> - fuse_open reply from FUSE server with passthrough ID: updates IDR,
>   requires spinlock:
>   - pt = idr_find(ID),
>   - update fuse_file with the current fuse_file,
>   - update fuse_file->passthrough_id = ID,
>   - idr_replace(),
> - read/write/mmap: lock-free IDR read:
>   - idr_find(fuse_file::passthrough_id),
>   - forward request to lower file system as for the current FUSE
>     passthrough patches.
> - ioctl(PASSTHROUGH_CLOSE): updates IDR, requires spinlock:
>   - idr_remove();
>   - call_rcu(): tell possible open fuse_file user that the ID is no more
>     valid and kfree() the allocated struct;
> - close(fuse_file): updates IDR, requires spinlock:
>   - ID = fuse_file::passthrough_id
>   - idr_find(ID),
>   - fuse_passthrough::fuse_file = NULL,
>   - idr_replace().
>
> This would be fine if passthrough is activated for a few, big files,
> where the passthrough overhead is dominated by the direct access
> benefits, but if passthrough is enabled for many small files which just
> do one or two read/writes (as what I did for my benchmark of total build
> time for the kernel, where I was passing-through every opened file), the
> overhead becomes a real issue.
>
> If you have any thoughts on how to make this simpler, I'm absolutely
> open to fix this.
>

This IDR namespace usually serves a single process. Right?
It sounds a bit more like a file table, more specifically, like io_file_table.
I may be way off, but this sounds a bit like IORING_REGISTER_FILES.
Is there anything that can be learned from this UAPI?
Maybe even reuse of some of the io_uring file register code?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write
  2021-01-25 15:30 [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write Alessio Balsini
                   ` (7 preceding siblings ...)
  2021-01-25 15:30 ` [PATCH RESEND V12 8/8] fuse: Introduce passthrough for mmap Alessio Balsini
@ 2021-11-18 18:31 ` Amir Goldstein
  8 siblings, 0 replies; 66+ messages in thread
From: Amir Goldstein @ 2021-11-18 18:31 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: Miklos Szeredi, Akilesh Kailash, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Peng Tao,
	Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team,
	linux-fsdevel, linux-kernel

On Mon, Jan 25, 2021 at 5:31 PM Alessio Balsini <balsini@android.com> wrote:
>
> This is the 12th version of the series, rebased on top of v5.11-rc5.
> Please find the changelog at the bottom of this cover letter.
>
> Add support for file system passthrough read/write of files when enabled
> in userspace through the option FUSE_PASSTHROUGH.
>
> There are file systems based on FUSE that are intended to enforce
> special policies or trigger complicated decision makings at the file
> operations level. Android, for example, uses FUSE to enforce
> fine-grained access policies that also depend on the file contents.
> Sometimes it happens that at open or create time a file is identified as
> not requiring additional checks for consequent reads/writes, thus FUSE
> would simply act as a passive bridge between the process accessing the
> FUSE file system and the lower file system. Splicing and caching help
> reduce the FUSE overhead, but there are still read/write operations
> forwarded to the userspace FUSE daemon that could be avoided.
>
> This series has been inspired by the original patches from Nikhilesh
> Reddy, the idea and code of which has been elaborated and improved
> thanks to the community support.
>
> When the FUSE_PASSTHROUGH capability is enabled, the FUSE daemon may
> decide while handling the open/create operations, if the given file can
> be accessed in passthrough mode. This means that all the further read
> and write operations would be forwarded by the kernel directly to the
> lower file system using the VFS layer rather than to the FUSE daemon.
> All the requests other than reads or writes are still handled by the
> userspace FUSE daemon.
> This allows for improved performance on reads and writes, especially in
> the case of reads at random offsets, for which no (readahead) caching
> mechanism would help.
> Benchmarks show improved performance that is close to native file system
> access when doing massive manipulations on a single opened file,
> especially in the case of random reads, random writes and sequential
> writes. Detailed benchmarking results are presented below.
>
> The creation of this direct connection (passthrough) between FUSE file
> objects and file objects in the lower file system happens in a way that
> reminds of passing file descriptors via sockets:
> - a process requests the opening of a file handled by FUSE, so the
>   kernel forwards the request to the FUSE daemon;
> - the FUSE daemon opens the target file in the lower file system,
>   getting its file descriptor;
> - the FUSE daemon also decides according to its internal policies if
>   passthrough can be enabled for that file, and, if so, can perform a
>   FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl on /dev/fuse, passing the file
>   descriptor obtained at the previous step and the fuse_req unique
>   identifier;
> - the kernel translates the file descriptor to the file pointer
>   navigating through the opened files of the "current" process and
>   temporarily stores it in the associated open/create fuse_req's
>   passthrough_filp;
> - when the FUSE daemon has done with the request and it's time for the
>   kernel to close it, it checks if the passthrough_filp is available and
> in case updates the additional field in the fuse_file owned by the
> process accessing the FUSE file system.
> From now on, all the read/write operations performed by that process
> will be redirected to the corresponding lower file system file by
> creating new VFS requests.
> Since the read/write operation to the lower file system is executed with
> the current process's credentials, it might happen that it does not have
> enough privileges to succeed. For this reason, the process temporarily
> receives the same credentials as the FUSE daemon, that are reverted as
> soon as the read/write operation completes, emulating the behavior of
> the request to be performed by the FUSE daemon itself. This solution has
> been inspired by the way overlayfs handles read/write operations.
> Asynchronous IO is supported as well, handled by creating separate AIO
> requests for the lower file system that will be internally tracked by
> FUSE, that intercepts and propagates their completion through an
> internal ki_completed callback similar to the current implementation of
> overlayfs.
> Finally, also memory-mapped FUSE files are supported in this FUSE
> passthrough series as it has been noticed that when a same file with
> FUSE passthrough enabled is accessed both with standard
> read/write(-iter) operations and memory-mapped read/write operations,
> the file content might result corrupted due to an inconsistency between
> the FUSE and lower file system caches.
>
> The ioctl has been designed taking as a reference and trying to converge
> to the fuse2 implementation. For example, the fuse_passthrough_out data
> structure has extra fields that will allow for further extensions of the
> feature.
>
>
>     Performance on RAM block device
>
> What follows has been performed using a custom passthrough_hp FUSE
> daemon that enables pass-through for each file that is opened during
> both "open" and "create". Benchmarks were run on an Intel Xeon W-2135,
> 64 GiB of RAM workstation, with a RAM block device used as storage
> target. More specifically, out of the system's 64 GiB of RAM, 40 GiB
> were reserved for /dev/ram0, formatted as ext4. For the FUSE and FUSE
> passthrough benchmarks, the FUSE file system was mounted on top of the
> mounted /dev/ram0 device.
> That file system has been completely filled and then cleaned up before
> running the benchmarks: this to ensure that all the /dev/ram0 space was
> reserved and not usable as page cache.
>
> The rationale for using a RAM block device is that SSDs may experience
> performance fluctuations, especially when dealing with accessing data
> random offsets.
> Getting rid of the discrete storage device also removes a huge component
> of slowness, highlighting the performance difference of the software
> parts (and probably the goodness of CPU caching and its coherence
> mechanisms).
>
> No special tuning has been performed, e.g., all the involved processes
> are SCHED_OTHER, ondemand is the frequency governor with no frequency
> restrictions, and turbo-boost, as well as p-state, are active. This is
> because I noticed that, for such high-level benchmarks, results
> consistency was minimally affected by these features.
>
> The source code of the updated libfuse library and passthrough_hp is
> shared at the following repository:
>
>   https://github.com/balsini/libfuse/tree/fuse-passthrough-v12-v5.11-rc5
>
> Two different kinds of benchmarks were done for this change, the first
> set of tests evaluates the bandwidth improvements when manipulating huge
> single files, the second set of tests verify that no performance
> regressions were introduced when handling many small files.
>
> All the caches were dropped before running every benchmark with:
>
>   echo 3 > /proc/sys/vm/drop_caches
>
> All the benchmarks were run 10 times, with 1 minute cool down between
> each run.
>
> The first benchmarks were done by running FIO (fio-3.24) with:
> - bs=4Ki;
> - file size: 35Gi;
> - ioengine: sync;
> - fsync_on_close=1;
> - randseed=0.
> The target file has been chosen large enough to avoid it to be entirely
> loaded into the page cache.
>
> Results are presented in the following table:
>
> +-----------+------------+-------------+-------------+
> |   MiB/s   |    fuse    | passthrough |   native    |
> +-----------+------------+-------------+-------------+
> | read      | 471(±1.3%) | 1791(±1.0%) | 1839(±1.8%) |
> | write     | 95(±.6%)   | 1068(±.9%)  | 1322(±.8%)  |
> | randread  | 25(±1.7%)  | 860(±.8%)   | 1135(±.5%)  |
> | randwrite | 76(±3.0%)  | 813(±1.0%)  | 1005(±.7%)  |
> +-----------+------------+-------------+-------------+
>
> This table shows that FUSE, except for the sequential reads, is far
> behind FUSE passthrough and native in terms of performance. The
> extremely good FUSE performance for sequential reads is the result of a
> great read-ahead mechanism. I was able to verify that setting
> read_ahead_kb to 0 causes a terrible performance drop.
> All the results are stable, as shown by the standard deviations.
> Moreover, these numbers show the reasonable gap between passthrough and
> native, introduced by the extra traversal through the VFS layer.
>
> As long as this patch has the primary objective of improving bandwidth,
> another set of tests has been performed to see how this behaves on a
> totally different scenario that involves accessing many small files. For
> this purpose, measuring the build time of the Linux kernel has been
> chosen as an appropriate, well-known, workload. The kernel has been
> built with as many processes as the number of logical CPUs (-j
> $(nproc)), that besides being a reasonable parallelization value, is
> also enough to saturate the processor's utilization thanks to the
> additional FUSE daemon's threads, making it even harder to get closer to
> the native file system performance.
> The following table shows the total build times in the different
> configurations:
>
> +------------------+--------------+-----------+
> |                  | AVG duration |  Standard |
> |                  |     (sec)    | deviation |
> +------------------+--------------+-----------+
> | FUSE             |      144.566 |     0.697 |
> +------------------+--------------+-----------+
> | FUSE passthrough |      133.820 |     0.341 |
> +------------------+--------------+-----------+
> | Native           |      109.423 |     0.724 |
> +------------------+--------------+-----------+
>
> Further testing and performance evaluations are welcome.
>
>
>     Description of the series
>
> Patch 1 generalizes the function which converts iocb flags to rw flags
> from overlayfs, so that can be used in this patch set.
>
> Patch 2 enables the 32-bit compatibility for the /dev/fuse ioctl.
>
> Patch 3 introduces the data structures, function signatures and ioctl
> required both for the communication with userspace and for the internal
> kernel use.
>
> Patch 4 introduces initialization and release functions for FUSE
> passthrough.
>
> Patch 5 enables the synchronous read and write operations for those FUSE
> files for which the passthrough functionality is enabled.
>
> Patch 6 extends the read and write operations to also support
> asynchronous IO.
>
> Patch 7 allows FUSE passthrough to target files for which the requesting
> process would not have direct access to, by temporarily performing a
> credentials switch to the credentials of the FUSE daemon that issued the
> FUSE passthrough ioctl.
>
> Patch 8 extends FUSE passthrough operations to memory-mapped FUSE files.
>
>
>     Changelog
>
> Changes in v12:
> * Revert FILESYSTEM_MAX_STACK_DEPTH checks as they were in v10
>   [Requested by Amir Goldstein]
> * Introduce passthrough support for memory-mapped FUSE files
>   [Requested by yanwu]
>
> Changes in v11:
> * Fix the FILESYSTEM_MAX_STACK_DEPTH check to allow other file systems
>   to be stacked
> * Moved file system stacking depth check at ioctl time
> * Update cover letter with correct libfuse repository to test the change
>   [Requested by Peng Tao]
> * Fix the file reference counter leak introduced in v10
>   [Requested by yanwu]
>
> Changes in v10:
> * UAPI updated: ioctl now returns an ID that will be used at open/create
>   response time to reference the passthrough file
> * Synchronous read/write_iter functions does not return silly errors
>   (fixed in aio patch)
> * FUSE daemon credentials updated at ioctl time instead of mount time
> * Updated benchmark results
>   [Requested by Miklos Szeredi]
>
> Changes in v9:
> * Switched to using VFS instead of direct lower FS file ops
>   [Attempt to address a request from Jens Axboe, Jann Horn,
>   Amir Goldstein]
> * Removal of useless included aio.h header
>   [Proposed by Jens Axboe]
>
> Changes in v8:
> * aio requests now use kmalloc/kfree, instead of kmem_cache
> * Switched to call_{read,write}_iter in AIO
> * Revisited attributes copy
> * Passthrough can only be enabled via ioctl, fixing the security issue
>   spotted by Jann
> * Use an extensible fuse_passthrough_out data structure
>   [Attempt to address a request from Nikolaus Rath, Amir Goldstein and
> Miklos Szeredi]
>
> Changes in v7:
> * Full handling of aio requests as done in overlayfs (update commit
> * message).
> * s/fget_raw/fget.
> * Open fails in case of passthrough errors, emitting warning messages.
>   [Proposed by Jann Horn]
> * Create new local kiocb, getting rid of the previously proposed ki_filp
>   swapping.
>   [Proposed by Jann Horn and Jens Axboe]
> * Code polishing.
>
> Changes in v6:
> * Port to kernel v5.8:
>   * fuse_file_{read,write}_iter changed since the v5 of this patch was
>     proposed.
> * Simplify fuse_simple_request.
> * Merge fuse_passthrough.h into fuse_i.h
> * Refactor of passthrough.c:
>   * Remove BUG_ONs.
>   * Simplified error checking and request arguments indexing.
>   * Use call_{read,write}_iter utility functions.
>   * Remove get_file and fputs during read/write: handle the extra FUSE
>     references to the lower file object when the fuse_file is
>     created/deleted.
>   [Proposed by Jann Horn]
>
> Changes in v5:
> * Fix the check when setting the passthrough file.
>   [Found when testing by Mike Shal]
>
> Changes in v3 and v4:
> * Use the fs_stack_depth to prevent further stacking and a minor fix.
>   [Proposed by Jann Horn]
>
> Changes in v2:
> * Changed the feature name to passthrough from stacked_io.
>   [Proposed by Linus Torvalds]
>
>
> Alessio Balsini (8):
>   fs: Generic function to convert iocb to rw flags
>   fuse: 32-bit user space ioctl compat for fuse device
>   fuse: Definitions and ioctl for passthrough
>   fuse: Passthrough initialization and release
>   fuse: Introduce synchronous read and write for passthrough
>   fuse: Handle asynchronous read and write in passthrough
>   fuse: Use daemon creds in passthrough mode
>   fuse: Introduce passthrough for mmap
>
>  fs/fuse/Makefile          |   1 +
>  fs/fuse/dev.c             |  41 ++++--
>  fs/fuse/dir.c             |   2 +
>  fs/fuse/file.c            |  15 +-
>  fs/fuse/fuse_i.h          |  33 +++++
>  fs/fuse/inode.c           |  22 ++-
>  fs/fuse/passthrough.c     | 280 ++++++++++++++++++++++++++++++++++++++
>  fs/overlayfs/file.c       |  23 +---
>  include/linux/fs.h        |   5 +
>  include/uapi/linux/fuse.h |  14 +-
>  10 files changed, 401 insertions(+), 35 deletions(-)
>  create mode 100644 fs/fuse/passthrough.c
>
> --
> 2.30.0.280.ga3ce27912f-goog
>

Hi Alessio,

I have been testing this patch set for a while and recently
nfstest_posix found one issue:
mtime/ctime are not invalidated on passthrough write.

I have tested a fix on this 5.10.y backport branch:
https://github.com/amir73il/linux/commits/linux-5.10.y-fuse-passthrough

Please feel free to review and/or take the fix for your next posting
if you have plans of posting another version...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2021-03-01 17:05         ` Alessio Balsini
@ 2022-09-08 15:36           ` Amir Goldstein
  2022-09-09 19:07             ` Miklos Szeredi
  0 siblings, 1 reply; 66+ messages in thread
From: Amir Goldstein @ 2022-09-08 15:36 UTC (permalink / raw)
  To: Alessio Balsini, Miklos Szeredi
  Cc: Peng Tao, Akilesh Kailash, Antonio SJ Musumeci, David Anderson,
	Giuseppe Scrivano, Jann Horn, Jens Axboe, Martijn Coenen,
	Palmer Dabbelt, Paul Lawrence, Stefano Duo, Zimuzo Ezeozue,
	wuyan, fuse-devel, kernel-team, linux-fsdevel

Hi Alessio and Miklos,

Some time has passed.. and I was thinking of picking up these patches.

On Mon, Mar 1, 2021 at 7:05 PM Alessio Balsini <balsini@android.com> wrote:
>
> On Fri, Feb 19, 2021 at 09:40:21AM +0100, Miklos Szeredi wrote:
> > On Fri, Feb 19, 2021 at 8:05 AM Peng Tao <bergwolf@gmail.com> wrote:
> > >
> > > On Wed, Feb 17, 2021 at 9:41 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > > > What I think would be useful is to have an explicit
> > > > FUSE_DEV_IOC_PASSTHROUGH_CLOSE ioctl, that would need to be called
> > > > once the fuse server no longer needs this ID.   If this turns out to
> > > > be a performance problem, we could still add the auto-close behavior
> > > > with an explicit FOPEN_PASSTHROUGH_AUTOCLOSE flag later.
> > > Hi Miklos,
> > >
> > > W/o auto closing, what happens if user space daemon forgets to call
> > > FUSE_DEV_IOC_PASSTHROUGH_CLOSE? Do we keep the ID alive somewhere?
> >
> > Kernel would keep the ID open until explicit close or fuse connection
> > is released.
> >
> > There should be some limit on the max open files referenced through
> > ID's, though.   E.g. inherit RLIMIT_NOFILE from mounting task.
> >
> > Thanks,
> > Miklos
>
> I like the idea of FUSE_DEV_IOC_PASSTHROUGH_CLOSE to revoke the
> passthrough access, that is something I was already working on. What I
> had in mind was simply to break that 1:1 connection between fuse_file
> and lower filp setting a specific fuse_file::passthrough::filp to NULL,
> but this is slightly different from what you mentioned.
>

I don't like the idea of switching between passthrough and server mid-life
of an open file.

There are consequences related to syncing the attribute cache of the kernel
and the server that I don't even want to think about.

> AFAIU you are suggesting to allocate one ID for each lower fs file
> opened with passthrough within a connection, and maybe using idr_find at
> every read/write/mmap operation to check if passthrough is enabled on
> that file. Something similar to fuse2_map_get().
> This way the fuse server can pass the same ID to one or more
> fuse_file(s).
> FUSE_DEV_IOC_PASSTHROUGH_CLOSE would idr_remove the ID, so idr_find
> would fail, preventing the use of passthrough on that ID. CMIIW.
>

I don't think that FUSE_DEV_IOC_PASSTHROUGH_CLOSE should remove the ID.
We can use a refcount for the mapping and FUSE_DEV_IOC_PASSTHROUGH_CLOSE
just drops the initial server's refcount.

Implementing revoke for an existing mapping is something completely different.
It can be done, not even so hard, but I don't think it should be part of this
series and in any case revoke will not remove the ID.

> After FUSE_DEV_IOC_PASSTHROUGH_CLOSE(ID) it may happen that if some
> fuse_file(s) storing that ID are still open and the same ID is reclaimed
> in a new idr_alloc, this would lead to mismatching lower fs filp being
> used by our fuse_file(s).  So also the ID stored in the fuse_file(s)
> must be invalidated to prevent future uses of deallocated IDs.

Obtaining a refcount on FOPEN_PASSTHROUGH will solve that.

>
> Would it make sense to have a list of fuse_files using the same ID, that
> must be traversed at FUSE_DEV_IOC_PASSTHROUGH_CLOSE time?
> Negative values (maybe -ENOENT) might be used to mark IDs as invalid,
> and tested before idr_find at read/write/mmap to avoid the idr_find
> complexity in case passthrough is disabled for that file.
>
> What do you think?
>

As I wrote above, this sounds unnecessarily complicated.

Miklos,

Do you agree with my interpretation of
FUSE_DEV_IOC_PASSTHROUGH_CLOSE?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2022-09-08 15:36           ` Amir Goldstein
@ 2022-09-09 19:07             ` Miklos Szeredi
  2022-09-10  8:52               ` Amir Goldstein
  0 siblings, 1 reply; 66+ messages in thread
From: Miklos Szeredi @ 2022-09-09 19:07 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Alessio Balsini, Peng Tao, Akilesh Kailash, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team, linux-fsdevel

On Thu, 8 Sept 2022 at 17:36, Amir Goldstein <amir73il@gmail.com> wrote:
>
> Hi Alessio and Miklos,
>
> Some time has passed.. and I was thinking of picking up these patches.
>
> On Mon, Mar 1, 2021 at 7:05 PM Alessio Balsini <balsini@android.com> wrote:
> >
> > On Fri, Feb 19, 2021 at 09:40:21AM +0100, Miklos Szeredi wrote:
> > > On Fri, Feb 19, 2021 at 8:05 AM Peng Tao <bergwolf@gmail.com> wrote:
> > > >
> > > > On Wed, Feb 17, 2021 at 9:41 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > > > What I think would be useful is to have an explicit
> > > > > FUSE_DEV_IOC_PASSTHROUGH_CLOSE ioctl, that would need to be called
> > > > > once the fuse server no longer needs this ID.   If this turns out to
> > > > > be a performance problem, we could still add the auto-close behavior
> > > > > with an explicit FOPEN_PASSTHROUGH_AUTOCLOSE flag later.
> > > > Hi Miklos,
> > > >
> > > > W/o auto closing, what happens if user space daemon forgets to call
> > > > FUSE_DEV_IOC_PASSTHROUGH_CLOSE? Do we keep the ID alive somewhere?
> > >
> > > Kernel would keep the ID open until explicit close or fuse connection
> > > is released.
> > >
> > > There should be some limit on the max open files referenced through
> > > ID's, though.   E.g. inherit RLIMIT_NOFILE from mounting task.
> > >
> > > Thanks,
> > > Miklos
> >
> > I like the idea of FUSE_DEV_IOC_PASSTHROUGH_CLOSE to revoke the
> > passthrough access, that is something I was already working on. What I
> > had in mind was simply to break that 1:1 connection between fuse_file
> > and lower filp setting a specific fuse_file::passthrough::filp to NULL,
> > but this is slightly different from what you mentioned.
> >
>
> I don't like the idea of switching between passthrough and server mid-life
> of an open file.
>
> There are consequences related to syncing the attribute cache of the kernel
> and the server that I don't even want to think about.
>
> > AFAIU you are suggesting to allocate one ID for each lower fs file
> > opened with passthrough within a connection, and maybe using idr_find at
> > every read/write/mmap operation to check if passthrough is enabled on
> > that file. Something similar to fuse2_map_get().
> > This way the fuse server can pass the same ID to one or more
> > fuse_file(s).
> > FUSE_DEV_IOC_PASSTHROUGH_CLOSE would idr_remove the ID, so idr_find
> > would fail, preventing the use of passthrough on that ID. CMIIW.
> >
>
> I don't think that FUSE_DEV_IOC_PASSTHROUGH_CLOSE should remove the ID.
> We can use a refcount for the mapping and FUSE_DEV_IOC_PASSTHROUGH_CLOSE
> just drops the initial server's refcount.
>
> Implementing revoke for an existing mapping is something completely different.
> It can be done, not even so hard, but I don't think it should be part of this
> series and in any case revoke will not remove the ID.
>
> > After FUSE_DEV_IOC_PASSTHROUGH_CLOSE(ID) it may happen that if some
> > fuse_file(s) storing that ID are still open and the same ID is reclaimed
> > in a new idr_alloc, this would lead to mismatching lower fs filp being
> > used by our fuse_file(s).  So also the ID stored in the fuse_file(s)
> > must be invalidated to prevent future uses of deallocated IDs.
>
> Obtaining a refcount on FOPEN_PASSTHROUGH will solve that.
>
> >
> > Would it make sense to have a list of fuse_files using the same ID, that
> > must be traversed at FUSE_DEV_IOC_PASSTHROUGH_CLOSE time?
> > Negative values (maybe -ENOENT) might be used to mark IDs as invalid,
> > and tested before idr_find at read/write/mmap to avoid the idr_find
> > complexity in case passthrough is disabled for that file.
> >
> > What do you think?
> >
>
> As I wrote above, this sounds unnecessarily complicated.
>
> Miklos,
>
> Do you agree with my interpretation of
> FUSE_DEV_IOC_PASSTHROUGH_CLOSE?

We need to deal with the case of too many open files.   The server
could manage this, but then we do need to handle the case when a
cached mapping disappears, i.e:

 client opens file
 [time passes]
 cached passthrough fd gets evicted to make room for other passthrough I/O
 [time passes]
 new I/O request comes in
 need to reestablish passthrough fd before finishing I/O

The way I think of this is that a passthrough mapping is assigned at
open time, which is cached (which may have the lifetime longer than
the open file, but shorter as well).  When
FUSE_DEV_IOC_PASSTHROUGH_CLOSE and there are cached mapping referring
to this particular handle, then those mappings need to be purged.   On
a new I/O request, the mapping will need to be reestablished by
sending a FUSE_MAP request, which triggers
FUSE_DEV_IOC_PASSTHROUGH_OPEN.

One other question that's nagging me is how to "unhide" these pseudo-fds.

Could we create a kernel thread for each fuse sb which has normal
file-table for these?  This would would allow inspecting state through
/proc/$KTHREDID/fd, etc..

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2022-09-09 19:07             ` Miklos Szeredi
@ 2022-09-10  8:52               ` Amir Goldstein
  2022-09-10 13:03                 ` Bernd Schubert
  2022-09-12  9:29                 ` Miklos Szeredi
  0 siblings, 2 replies; 66+ messages in thread
From: Amir Goldstein @ 2022-09-10  8:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Alessio Balsini, Peng Tao, Akilesh Kailash, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team, linux-fsdevel

On Fri, Sep 9, 2022 at 10:07 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Thu, 8 Sept 2022 at 17:36, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > Hi Alessio and Miklos,
> >
> > Some time has passed.. and I was thinking of picking up these patches.
> >
> > On Mon, Mar 1, 2021 at 7:05 PM Alessio Balsini <balsini@android.com> wrote:
> > >
> > > On Fri, Feb 19, 2021 at 09:40:21AM +0100, Miklos Szeredi wrote:
> > > > On Fri, Feb 19, 2021 at 8:05 AM Peng Tao <bergwolf@gmail.com> wrote:
> > > > >
> > > > > On Wed, Feb 17, 2021 at 9:41 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > > > What I think would be useful is to have an explicit
> > > > > > FUSE_DEV_IOC_PASSTHROUGH_CLOSE ioctl, that would need to be called
> > > > > > once the fuse server no longer needs this ID.   If this turns out to
> > > > > > be a performance problem, we could still add the auto-close behavior
> > > > > > with an explicit FOPEN_PASSTHROUGH_AUTOCLOSE flag later.
> > > > > Hi Miklos,
> > > > >
> > > > > W/o auto closing, what happens if user space daemon forgets to call
> > > > > FUSE_DEV_IOC_PASSTHROUGH_CLOSE? Do we keep the ID alive somewhere?
> > > >
> > > > Kernel would keep the ID open until explicit close or fuse connection
> > > > is released.
> > > >
> > > > There should be some limit on the max open files referenced through
> > > > ID's, though.   E.g. inherit RLIMIT_NOFILE from mounting task.
> > > >
> > > > Thanks,
> > > > Miklos
> > >
> > > I like the idea of FUSE_DEV_IOC_PASSTHROUGH_CLOSE to revoke the
> > > passthrough access, that is something I was already working on. What I
> > > had in mind was simply to break that 1:1 connection between fuse_file
> > > and lower filp setting a specific fuse_file::passthrough::filp to NULL,
> > > but this is slightly different from what you mentioned.
> > >
> >
> > I don't like the idea of switching between passthrough and server mid-life
> > of an open file.
> >
> > There are consequences related to syncing the attribute cache of the kernel
> > and the server that I don't even want to think about.
> >
> > > AFAIU you are suggesting to allocate one ID for each lower fs file
> > > opened with passthrough within a connection, and maybe using idr_find at
> > > every read/write/mmap operation to check if passthrough is enabled on
> > > that file. Something similar to fuse2_map_get().
> > > This way the fuse server can pass the same ID to one or more
> > > fuse_file(s).
> > > FUSE_DEV_IOC_PASSTHROUGH_CLOSE would idr_remove the ID, so idr_find
> > > would fail, preventing the use of passthrough on that ID. CMIIW.
> > >
> >
> > I don't think that FUSE_DEV_IOC_PASSTHROUGH_CLOSE should remove the ID.
> > We can use a refcount for the mapping and FUSE_DEV_IOC_PASSTHROUGH_CLOSE
> > just drops the initial server's refcount.
> >
> > Implementing revoke for an existing mapping is something completely different.
> > It can be done, not even so hard, but I don't think it should be part of this
> > series and in any case revoke will not remove the ID.
> >
> > > After FUSE_DEV_IOC_PASSTHROUGH_CLOSE(ID) it may happen that if some
> > > fuse_file(s) storing that ID are still open and the same ID is reclaimed
> > > in a new idr_alloc, this would lead to mismatching lower fs filp being
> > > used by our fuse_file(s).  So also the ID stored in the fuse_file(s)
> > > must be invalidated to prevent future uses of deallocated IDs.
> >
> > Obtaining a refcount on FOPEN_PASSTHROUGH will solve that.
> >
> > >
> > > Would it make sense to have a list of fuse_files using the same ID, that
> > > must be traversed at FUSE_DEV_IOC_PASSTHROUGH_CLOSE time?
> > > Negative values (maybe -ENOENT) might be used to mark IDs as invalid,
> > > and tested before idr_find at read/write/mmap to avoid the idr_find
> > > complexity in case passthrough is disabled for that file.
> > >
> > > What do you think?
> > >
> >
> > As I wrote above, this sounds unnecessarily complicated.
> >
> > Miklos,
> >
> > Do you agree with my interpretation of
> > FUSE_DEV_IOC_PASSTHROUGH_CLOSE?
>
> We need to deal with the case of too many open files.   The server
> could manage this, but then we do need to handle the case when a
> cached mapping disappears, i.e:
>
>  client opens file
>  [time passes]
>  cached passthrough fd gets evicted to make room for other passthrough I/O
>  [time passes]
>  new I/O request comes in
>  need to reestablish passthrough fd before finishing I/O
>
> The way I think of this is that a passthrough mapping is assigned at
> open time, which is cached (which may have the lifetime longer than
> the open file, but shorter as well).  When
> FUSE_DEV_IOC_PASSTHROUGH_CLOSE and there are cached mapping referring
> to this particular handle, then those mappings need to be purged.   On
> a new I/O request, the mapping will need to be reestablished by
> sending a FUSE_MAP request, which triggers
> FUSE_DEV_IOC_PASSTHROUGH_OPEN.
>

Do we really need all this complication?

I mean, if we do this then the server may end up thrashing this
passthrough cache
when the client has many open files.

I think we should accept the fact that just as any current FUSE
passthrough (in userspace) implementation is limited to max number of
open files as the server's process limitation, kernel passthrough implementation
will be limited by inheriting the mounter's process limitation.

There is no reason that the server should need to keep more
passthrough fd's open than client open fds.

If we only support FOPEN_PASSTHROUGH_AUTOCLOSE as v12
patches implicitly do, then the memory overhead is not much different
than the extra overlayfs pseudo realfiles.

So IMO, we can start with a refcounted mapping implementation
and only if there is interest in server managed mappings eviction
we could implement FUSE_DEV_IOC_PASSTHROUGH_FORGET.

> One other question that's nagging me is how to "unhide" these pseudo-fds.
>
> Could we create a kernel thread for each fuse sb which has normal
> file-table for these?  This would would allow inspecting state through
> /proc/$KTHREDID/fd, etc..
>

Yeah that sounds like a good idea.
As I mentioned elsewhere in the thread, io_uring also has a mechanism
to register open files with the kernel to perform IO on them later.
I assume those files are also visible via some /proc/$KTHREDID/fd,
but I'll need to check.

BTW, I see that the Android team is presenting eBPF-FUSE on LPC
coming Tuesday [1].

There are affordable and free options to attend virtually [2].

I wonder when patches will be available ;)

Thanks,
Amir.

[1] https://lpc.events/event/16/contributions/1339/
[2] https://lpc.events/event/16/page/181-attend

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2022-09-10  8:52               ` Amir Goldstein
@ 2022-09-10 13:03                 ` Bernd Schubert
  2022-09-12  9:29                 ` Miklos Szeredi
  1 sibling, 0 replies; 66+ messages in thread
From: Bernd Schubert @ 2022-09-10 13:03 UTC (permalink / raw)
  To: Amir Goldstein, Miklos Szeredi
  Cc: Alessio Balsini, Peng Tao, Akilesh Kailash, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team, linux-fsdevel



> 
> BTW, I see that the Android team is presenting eBPF-FUSE on LPC
> coming Tuesday [1].
> 
> There are affordable and free options to attend virtually [2].
> 
> I wonder when patches will be available ;)

Oh interesting. Btw, I'm currently working on uring communication, 
similar to what ublk is doing, I hope to have a very first version by 
next week.

Bernd

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2022-09-10  8:52               ` Amir Goldstein
  2022-09-10 13:03                 ` Bernd Schubert
@ 2022-09-12  9:29                 ` Miklos Szeredi
  2022-09-12 12:29                   ` Amir Goldstein
  1 sibling, 1 reply; 66+ messages in thread
From: Miklos Szeredi @ 2022-09-12  9:29 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Alessio Balsini, Peng Tao, Akilesh Kailash, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team, linux-fsdevel

On Sat, 10 Sept 2022 at 10:52, Amir Goldstein <amir73il@gmail.com> wrote:

> I think we should accept the fact that just as any current FUSE
> passthrough (in userspace) implementation is limited to max number of
> open files as the server's process limitation, kernel passthrough implementation
> will be limited by inheriting the mounter's process limitation.
>
> There is no reason that the server should need to keep more
> passthrough fd's open than client open fds.

Maybe you're right.

> If we only support FOPEN_PASSTHROUGH_AUTOCLOSE as v12
> patches implicitly do, then the memory overhead is not much different
> than the extra overlayfs pseudo realfiles.

How exactly would this work?

ioctl(F_D_I_P_OPEN) - create passthrough fd with ref 1
open/FOPEN_PASSTHOUGH -  inc refcount in passthrough fd
release - put refcount in passthrough fd
ioctl(F_D_I_P_CLOSE) - put ref in passthrough fd

Due to being refcounted the F_D_I_P_CLOSE can come at any point past
the finished open request.

Or did you have something else in mind?

> > One other question that's nagging me is how to "unhide" these pseudo-fds.
> >
> > Could we create a kernel thread for each fuse sb which has normal
> > file-table for these?  This would would allow inspecting state through
> > /proc/$KTHREDID/fd, etc..
> >
>
> Yeah that sounds like a good idea.
> As I mentioned elsewhere in the thread, io_uring also has a mechanism
> to register open files with the kernel to perform IO on them later.
> I assume those files are also visible via some /proc/$KTHREDID/fd,
> but I'll need to check.
>
> BTW, I see that the Android team is presenting eBPF-FUSE on LPC
> coming Tuesday [1].

At first glance it looks like a filtered kernel-only passthrough +
fuse fallback, where filtering is provided by eBPF scripts and only
falls back to userspace access on more complex cases.  Maybe it's a
good direction, we'll see.  Apparently the passthrough case is
important enough for various use cases.

Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2022-09-12  9:29                 ` Miklos Szeredi
@ 2022-09-12 12:29                   ` Amir Goldstein
  2022-09-12 13:03                     ` Miklos Szeredi
  2023-05-12 19:37                     ` [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough Amir Goldstein
  0 siblings, 2 replies; 66+ messages in thread
From: Amir Goldstein @ 2022-09-12 12:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Alessio Balsini, Peng Tao, Akilesh Kailash, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team, linux-fsdevel

On Mon, Sep 12, 2022 at 12:29 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Sat, 10 Sept 2022 at 10:52, Amir Goldstein <amir73il@gmail.com> wrote:
>
> > I think we should accept the fact that just as any current FUSE
> > passthrough (in userspace) implementation is limited to max number of
> > open files as the server's process limitation, kernel passthrough implementation
> > will be limited by inheriting the mounter's process limitation.
> >
> > There is no reason that the server should need to keep more
> > passthrough fd's open than client open fds.
>
> Maybe you're right.
>
> > If we only support FOPEN_PASSTHROUGH_AUTOCLOSE as v12
> > patches implicitly do, then the memory overhead is not much different
> > than the extra overlayfs pseudo realfiles.
>
> How exactly would this work?
>
> ioctl(F_D_I_P_OPEN) - create passthrough fd with ref 1
> open/FOPEN_PASSTHOUGH -  inc refcount in passthrough fd
> release - put refcount in passthrough fd
> ioctl(F_D_I_P_CLOSE) - put ref in passthrough fd
>
> Due to being refcounted the F_D_I_P_CLOSE can come at any point past
> the finished open request.
>
> Or did you have something else in mind?
>

What I had in mind is that FOPEN_PASSTHROUGH_AUTOCLOSE
"transfers" the server's refcount to the kernel and server does
not need to call explicit F_D_I_P_CLOSE.

This is useful for servers that don't care about reusing mappings.

> > > One other question that's nagging me is how to "unhide" these pseudo-fds.
> > >
> > > Could we create a kernel thread for each fuse sb which has normal
> > > file-table for these?  This would would allow inspecting state through
> > > /proc/$KTHREDID/fd, etc..
> > >
> >
> > Yeah that sounds like a good idea.
> > As I mentioned elsewhere in the thread, io_uring also has a mechanism
> > to register open files with the kernel to perform IO on them later.
> > I assume those files are also visible via some /proc/$KTHREDID/fd,
> > but I'll need to check.
> >
> > BTW, I see that the Android team is presenting eBPF-FUSE on LPC
> > coming Tuesday [1].
>
> At first glance it looks like a filtered kernel-only passthrough +
> fuse fallback, where filtering is provided by eBPF scripts and only
> falls back to userspace access on more complex cases.  Maybe it's a
> good direction, we'll see.

Yeh, we'll see.

> Apparently the passthrough case is
> important enough for various use cases.
>

Indeed.
My use case is HSM and I think that using FUSE for HSM is becoming
more and more common these days.

One of the things that bothers me is that both this FUSE_PASSTHROUGH
patch set and any future eBPF-FUSE passthrough implementation is
bound to duplicate a lot of code and know how from overlayfs
(along with the bugs).

We could try to factor out some common bits to a kernel fs passthough
library.

Anotehr options to consider is not to add any passthrough logic
to FUSE at all.

Instead, implement a "switch" fs to choose between passthrough
to one of several underlying fs "branches", where one of the branches
could be local fs and another a FUSE fs (i.e. for the complex cases).

A similar design was described at:
https://github.com/github/libprojfs/blob/master/docs/design.md#phase-2--hybrid

This "switch" fs is not that much different from overlayfs, when
removing the "merge dir" logic and replacing the "is_upper" logic
with a generic eBPF "choose_branch" logic.

Food for thought.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2022-09-12 12:29                   ` Amir Goldstein
@ 2022-09-12 13:03                     ` Miklos Szeredi
  2022-09-12 13:05                       ` Miklos Szeredi
  2022-09-12 13:26                       ` Amir Goldstein
  2023-05-12 19:37                     ` [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough Amir Goldstein
  1 sibling, 2 replies; 66+ messages in thread
From: Miklos Szeredi @ 2022-09-12 13:03 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Alessio Balsini, Peng Tao, Akilesh Kailash, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team, linux-fsdevel

On Mon, 12 Sept 2022 at 14:29, Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Sep 12, 2022 at 12:29 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Sat, 10 Sept 2022 at 10:52, Amir Goldstein <amir73il@gmail.com> wrote:

> > > BTW, I see that the Android team is presenting eBPF-FUSE on LPC
> > > coming Tuesday [1].
> >
> > At first glance it looks like a filtered kernel-only passthrough +
> > fuse fallback, where filtering is provided by eBPF scripts and only
> > falls back to userspace access on more complex cases.  Maybe it's a
> > good direction, we'll see.
>
> Yeh, we'll see.
>
> > Apparently the passthrough case is
> > important enough for various use cases.
> >
>
> Indeed.
> My use case is HSM and I think that using FUSE for HSM is becoming
> more and more common these days.

HSM?

>
> One of the things that bothers me is that both this FUSE_PASSTHROUGH
> patch set and any future eBPF-FUSE passthrough implementation is
> bound to duplicate a lot of code and know how from overlayfs
> (along with the bugs).
>
> We could try to factor out some common bits to a kernel fs passthough
> library.

Yeah, although fuse/passthrough might not want all the complexity.
Getting rid of the context switch latency is the easy part.  Getting
rid of dcache duplication is the hard one, though it seems that the
current level of hacks in overlayfs seems sufficient and nobody much
cares for the corner cases (or works around them).

>
> Anotehr options to consider is not to add any passthrough logic
> to FUSE at all.
>
> Instead, implement a "switch" fs to choose between passthrough
> to one of several underlying fs "branches", where one of the branches
> could be local fs and another a FUSE fs (i.e. for the complex cases).

st_dev/st_ino management might become a headache (as it is in overlayfs).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2022-09-12 13:03                     ` Miklos Szeredi
@ 2022-09-12 13:05                       ` Miklos Szeredi
  2022-09-12 13:26                       ` Amir Goldstein
  1 sibling, 0 replies; 66+ messages in thread
From: Miklos Szeredi @ 2022-09-12 13:05 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Alessio Balsini, Peng Tao, Akilesh Kailash, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team, linux-fsdevel

On Mon, 12 Sept 2022 at 15:03, Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, 12 Sept 2022 at 14:29, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, Sep 12, 2022 at 12:29 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Sat, 10 Sept 2022 at 10:52, Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > > BTW, I see that the Android team is presenting eBPF-FUSE on LPC
> > > > coming Tuesday [1].
> > >
> > > At first glance it looks like a filtered kernel-only passthrough +
> > > fuse fallback, where filtering is provided by eBPF scripts and only
> > > falls back to userspace access on more complex cases.  Maybe it's a
> > > good direction, we'll see.
> >
> > Yeh, we'll see.
> >
> > > Apparently the passthrough case is
> > > important enough for various use cases.
> > >
> >
> > Indeed.
> > My use case is HSM and I think that using FUSE for HSM is becoming
> > more and more common these days.
>
> HSM?
>
> >
> > One of the things that bothers me is that both this FUSE_PASSTHROUGH
> > patch set and any future eBPF-FUSE passthrough implementation is
> > bound to duplicate a lot of code and know how from overlayfs
> > (along with the bugs).
> >
> > We could try to factor out some common bits to a kernel fs passthough
> > library.
>
> Yeah, although fuse/passthrough might not want all the complexity.
> Getting rid of the context switch latency is the easy part.  Getting
> rid of dcache duplication is the hard one, though it seems that the

s/dcache/page cache/

> current level of hacks in overlayfs seems sufficient and nobody much
> cares for the corner cases (or works around them).
>
> >
> > Anotehr options to consider is not to add any passthrough logic
> > to FUSE at all.
> >
> > Instead, implement a "switch" fs to choose between passthrough
> > to one of several underlying fs "branches", where one of the branches
> > could be local fs and another a FUSE fs (i.e. for the complex cases).
>
> st_dev/st_ino management might become a headache (as it is in overlayfs).
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2022-09-12 13:03                     ` Miklos Szeredi
  2022-09-12 13:05                       ` Miklos Szeredi
@ 2022-09-12 13:26                       ` Amir Goldstein
  2022-09-12 14:22                         ` Miklos Szeredi
  1 sibling, 1 reply; 66+ messages in thread
From: Amir Goldstein @ 2022-09-12 13:26 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Alessio Balsini, Peng Tao, Akilesh Kailash, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team, linux-fsdevel

On Mon, Sep 12, 2022 at 4:03 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, 12 Sept 2022 at 14:29, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, Sep 12, 2022 at 12:29 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Sat, 10 Sept 2022 at 10:52, Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > > BTW, I see that the Android team is presenting eBPF-FUSE on LPC
> > > > coming Tuesday [1].
> > >
> > > At first glance it looks like a filtered kernel-only passthrough +
> > > fuse fallback, where filtering is provided by eBPF scripts and only
> > > falls back to userspace access on more complex cases.  Maybe it's a
> > > good direction, we'll see.
> >
> > Yeh, we'll see.
> >
> > > Apparently the passthrough case is
> > > important enough for various use cases.
> > >
> >
> > Indeed.
> > My use case is HSM and I think that using FUSE for HSM is becoming
> > more and more common these days.
>
> HSM?
>

Sorry, Hierarchical Storage Management.
such as the product described at:
https://github.com/github/libprojfs/blob/master/docs/design.md#vfsforgit-on-windows

> >
> > One of the things that bothers me is that both this FUSE_PASSTHROUGH
> > patch set and any future eBPF-FUSE passthrough implementation is
> > bound to duplicate a lot of code and know how from overlayfs
> > (along with the bugs).
> >
> > We could try to factor out some common bits to a kernel fs passthough
> > library.
>
> Yeah, although fuse/passthrough might not want all the complexity.
> Getting rid of the context switch latency is the easy part.  Getting
> rid of  page cache duplication is the hard one, though it seems that the
> current level of hacks in overlayfs seems sufficient and nobody much
> cares for the corner cases (or works around them).
>

FWIW duplicate page cache exists in passthough FUSE whether
passthrough is in kernel or in userspace, but going through yet another
"switch" fs would make things even worse.

I have another completely different solution that I am considering
for HSM that is a little less flexible than FUSE, but does not have many of
the passthrough challenges:

https://lore.kernel.org/linux-fsdevel/CAOQ4uxhrQ7hySTyHM0Atq=uzbNdHyGV5wfadJarhAu1jDFOUTg@mail.gmail.com/

> >
> > Anotehr options to consider is not to add any passthrough logic
> > to FUSE at all.
> >
> > Instead, implement a "switch" fs to choose between passthrough
> > to one of several underlying fs "branches", where one of the branches
> > could be local fs and another a FUSE fs (i.e. for the complex cases).
>
> st_dev/st_ino management might become a headache (as it is in overlayfs).
>

Yeh. It's interesting how passthough of readdir and lookup/create in
eBPF-FUSE is going to handle those things...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2022-09-12 13:26                       ` Amir Goldstein
@ 2022-09-12 14:22                         ` Miklos Szeredi
  2022-09-12 15:39                           ` Amir Goldstein
  0 siblings, 1 reply; 66+ messages in thread
From: Miklos Szeredi @ 2022-09-12 14:22 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Alessio Balsini, Peng Tao, Akilesh Kailash, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team, linux-fsdevel

On Mon, 12 Sept 2022 at 15:26, Amir Goldstein <amir73il@gmail.com> wrote:

> FWIW duplicate page cache exists in passthough FUSE whether
> passthrough is in kernel or in userspace, but going through yet another
> "switch" fs would make things even worse.

I imagine the "switch" layer for a HSM would be simple enough:

a) if file exists on fastest layer (upper) then take that
b) if not then fall back to fuse layer (lower) .

It's almost like a read-only overlayfs (no copy up) except it would be
read-write and copy-up/down would be performed by the server as
needed. No page cache duplication for upper, and AFAICS no corner
cases that overlayfs has, since all layers are consistent (the fuse
layer would reference the upper if that is currently the up-to-date
one).

readdir would go to the layer which has the complete directory (which
I guess the lower one must always have, but the upper could also).

I'm probably missing lots of details, though...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2022-09-12 14:22                         ` Miklos Szeredi
@ 2022-09-12 15:39                           ` Amir Goldstein
  2022-09-12 17:43                             ` Hao Luo
  0 siblings, 1 reply; 66+ messages in thread
From: Amir Goldstein @ 2022-09-12 15:39 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Alessio Balsini, Peng Tao, Akilesh Kailash, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team, linux-fsdevel,
	Hao Luo

On Mon, Sep 12, 2022 at 5:22 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, 12 Sept 2022 at 15:26, Amir Goldstein <amir73il@gmail.com> wrote:
>
> > FWIW duplicate page cache exists in passthough FUSE whether
> > passthrough is in kernel or in userspace, but going through yet another
> > "switch" fs would make things even worse.
>
> I imagine the "switch" layer for a HSM would be simple enough:
>
> a) if file exists on fastest layer (upper) then take that
> b) if not then fall back to fuse layer (lower) .
>
> It's almost like a read-only overlayfs (no copy up) except it would be
> read-write and copy-up/down would be performed by the server as
> needed. No page cache duplication for upper, and AFAICS no corner
> cases that overlayfs has, since all layers are consistent (the fuse
> layer would reference the upper if that is currently the up-to-date
> one).

On recent LSF/MM/BPF, BPF developers asked me about using overlayfs
for something that looks like the above - merging of non overlapping layers
without any copy-up/down, but with write to lower.

I gave them the same solution (overlayfs without copy-up)
but I said I didn't know what you would think about this overlayfs mode
and I also pointed them to the eBPF-FUSE developers as another
possible solution to their use case.

>
> readdir would go to the layer which has the complete directory (which
> I guess the lower one must always have, but the upper could also).
>
> I'm probably missing lots of details, though...
>

That's what I said too :)

Does that mean that you are open to seeing patches for
an overlayfs mode that does not copy-up on write to lower?
I can come up with some semantics for readdir that will
make sense.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2022-09-12 15:39                           ` Amir Goldstein
@ 2022-09-12 17:43                             ` Hao Luo
  2022-09-12 18:28                               ` Overlayfs with writable lower layer Amir Goldstein
  0 siblings, 1 reply; 66+ messages in thread
From: Hao Luo @ 2022-09-12 17:43 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Alessio Balsini, Peng Tao, Akilesh Kailash,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel,
	kernel-team, linux-fsdevel

Sorry, resend, my response was bounced back by mail system due to not
using plain text.

On Mon, Sep 12, 2022 at 8:40 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Sep 12, 2022 at 5:22 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Mon, 12 Sept 2022 at 15:26, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > > FWIW duplicate page cache exists in passthough FUSE whether
> > > passthrough is in kernel or in userspace, but going through yet another
> > > "switch" fs would make things even worse.
> >
> > I imagine the "switch" layer for a HSM would be simple enough:
> >
> > a) if file exists on fastest layer (upper) then take that
> > b) if not then fall back to fuse layer (lower) .
> >
> > It's almost like a read-only overlayfs (no copy up) except it would be
> > read-write and copy-up/down would be performed by the server as
> > needed. No page cache duplication for upper, and AFAICS no corner
> > cases that overlayfs has, since all layers are consistent (the fuse
> > layer would reference the upper if that is currently the up-to-date
> > one).
>
> On recent LSF/MM/BPF, BPF developers asked me about using overlayfs
> for something that looks like the above - merging of non overlapping layers
> without any copy-up/down, but with write to lower.
>
> I gave them the same solution (overlayfs without copy-up)
> but I said I didn't know what you would think about this overlayfs mode
> and I also pointed them to the eBPF-FUSE developers as another
> possible solution to their use case.
>

Thanks Amir for adding me in the thread. This idea is very useful for
BPF use cases.

A bit more context here: we were thinking of overlaying two
filesystems together to create a view that extends the filesystem at
the lower layer. In our design, the lower layer is a pseudo
filesystem, which one can _not_ create/delete files, but make
directories _indirectly_, via creating a kernel object; the upper is
bpf filesystem, from which, one can create files. The file's purpose
is to describe the directory in the lower layer, that is, to describe
the kernel object that directory corresponds to.

With the flexibility brought by BPF, it can be a quite flexible
solution to query kernel objects' states.

>
> >
> > readdir would go to the layer which has the complete directory (which
> > I guess the lower one must always have, but the upper could also).
> >
> > I'm probably missing lots of details, though...
> >
>
> That's what I said too :)
>
> Does that mean that you are open to seeing patches for
> an overlayfs mode that does not copy-up on write to lower?
> I can come up with some semantics for readdir that will
> make sense.

I am excited about this. Thanks Amir!

>
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Overlayfs with writable lower layer
  2022-09-12 17:43                             ` Hao Luo
@ 2022-09-12 18:28                               ` Amir Goldstein
  2022-09-13 18:26                                 ` Hao Luo
  0 siblings, 1 reply; 66+ messages in thread
From: Amir Goldstein @ 2022-09-12 18:28 UTC (permalink / raw)
  To: Hao Luo; +Cc: Miklos Szeredi, linux-fsdevel

[Changing subject, reduce CC list and add move to overlayfs list]

On Mon, Sep 12, 2022 at 8:43 PM Hao Luo <haoluo@google.com> wrote:
>
> Sorry, resend, my response was bounced back by mail system due to not
> using plain text.
>
> On Mon, Sep 12, 2022 at 8:40 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, Sep 12, 2022 at 5:22 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > I imagine the "switch" layer for a HSM would be simple enough:
> > >
> > > a) if file exists on fastest layer (upper) then take that
> > > b) if not then fall back to fuse layer (lower) .
> > >
> > > It's almost like a read-only overlayfs (no copy up) except it would be
> > > read-write and copy-up/down would be performed by the server as
> > > needed. No page cache duplication for upper, and AFAICS no corner
> > > cases that overlayfs has, since all layers are consistent (the fuse
> > > layer would reference the upper if that is currently the up-to-date
> > > one).
> >
> > On recent LSF/MM/BPF, BPF developers asked me about using overlayfs
> > for something that looks like the above - merging of non overlapping layers
> > without any copy-up/down, but with write to lower.
> >
> > I gave them the same solution (overlayfs without copy-up)
> > but I said I didn't know what you would think about this overlayfs mode
> > and I also pointed them to the eBPF-FUSE developers as another
> > possible solution to their use case.
> >
>
> Thanks Amir for adding me in the thread. This idea is very useful for
> BPF use cases.

Hi Hao,

Thanks for chiming in.
This thread has long diverged from the FUSE_PASSTHROUGH
patch review so I started a new thread to discuss the overlayfs
option.

Am I to understand that the eBPF-FUSE option currently
does not fit your needs (maybe because it is not ready yet)?

>
> A bit more context here: we were thinking of overlaying two
> filesystems together to create a view that extends the filesystem at
> the lower layer. In our design, the lower layer is a pseudo
> filesystem, which one can _not_ create/delete files, but make
> directories _indirectly_, via creating a kernel object; the upper is
> bpf filesystem, from which, one can create files. The file's purpose
> is to describe the directory in the lower layer, that is, to describe
> the kernel object that directory corresponds to.
>
> With the flexibility brought by BPF, it can be a quite flexible
> solution to query kernel objects' states.
>

Can't say I was able to understand the description, but let me
try to write the requirement in overlayfs terminology.
Please correct me if I am wrong.

1. The "lower" fs (cgroupfs?) is a "remote" fs where directories
    may appear or disappear due to "remote" events

I think there were similar requirements to support changes
to lower fs which in a network fs in the past.
As long as those directories are strictly lower that should be
possible.

Does upper fs have directories of the same name that need to
be merged with those lower dirs?

2. You did not mention this but IIRC, the lower fs has (pseudo)
    files that you want to be able to write to, and those files never
    exist in the upper fs

That translates to allowing write to lower fs without triggering copy-up

3. The upper fs files are all "pure" - that means those file paths
    do not exist in lower fs

Creating files in the upper fs is normal in overlayfs, so not a problem.

4. Merging upper and lower directory content is usually not needed??

It must be needed for the root dir at least but where else?
In which directories are the upper files created?
Which directories are "pure" (containing either lower or upper files)
and which directories are "merge" of upper and lower children?

If the answer is that some directories exist both in upper and in lower
and that those directories should be merged for the unified view then
this is standard overlayfs behavior.

Once I get the requirement in order I will try to write coherent
semantics for this mode and see if it can meet the needs of other
projects, such as HSM with FUSE as lower fs.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Overlayfs with writable lower layer
  2022-09-12 18:28                               ` Overlayfs with writable lower layer Amir Goldstein
@ 2022-09-13 18:26                                 ` Hao Luo
  2022-09-13 18:54                                   ` Amir Goldstein
  0 siblings, 1 reply; 66+ messages in thread
From: Hao Luo @ 2022-09-13 18:26 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, linux-fsdevel

On Mon, Sep 12, 2022 at 11:28 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Sep 12, 2022 at 8:43 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Mon, Sep 12, 2022 at 8:40 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Mon, Sep 12, 2022 at 5:22 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > I imagine the "switch" layer for a HSM would be simple enough:
> > > >
> > > > a) if file exists on fastest layer (upper) then take that
> > > > b) if not then fall back to fuse layer (lower) .
> > > >
> > > > It's almost like a read-only overlayfs (no copy up) except it would be
> > > > read-write and copy-up/down would be performed by the server as
> > > > needed. No page cache duplication for upper, and AFAICS no corner
> > > > cases that overlayfs has, since all layers are consistent (the fuse
> > > > layer would reference the upper if that is currently the up-to-date
> > > > one).
> > >
> > > On recent LSF/MM/BPF, BPF developers asked me about using overlayfs
> > > for something that looks like the above - merging of non overlapping layers
> > > without any copy-up/down, but with write to lower.
> > >
> > > I gave them the same solution (overlayfs without copy-up)
> > > but I said I didn't know what you would think about this overlayfs mode
> > > and I also pointed them to the eBPF-FUSE developers as another
> > > possible solution to their use case.
> > >
> >
> > Thanks Amir for adding me in the thread. This idea is very useful for
> > BPF use cases.
>
[...]
>
> Am I to understand that the eBPF-FUSE option currently
> does not fit your needs (maybe because it is not ready yet)?
>

Yeah, mostly because eBPF-FUSE is not ready. I talked to Alessio and
his colleague after LSF/MM/BPF. They were distracted from eBPF-FUSE
development at that time and I didn't follow up, working on other BPF
stuff.

> >
[...]
> Can't say I was able to understand the description, but let me
> try to write the requirement in overlayfs terminology.
> Please correct me if I am wrong.
>
> 1. The "lower" fs (cgroupfs?) is a "remote" fs where directories
>     may appear or disappear due to "remote" events
>

Right. Seems we are aligned on this.

> I think there were similar requirements to support changes
> to lower fs which in a network fs in the past.
> As long as those directories are strictly lower that should be
> possible.
>
> Does upper fs have directories of the same name that need to
> be merged with those lower dirs?
>

No, I don't think so. Upper fs should only have files in my use case.

> 2. You did not mention this but IIRC, the lower fs has (pseudo)
>     files that you want to be able to write to, and those files never
>     exist in the upper fs
>
> That translates to allowing write to lower fs without triggering copy-up
>

Writing to lower is not needed right now.

> 3. The upper fs files are all "pure" - that means those file paths
>     do not exist in lower fs
>
> Creating files in the upper fs is normal in overlayfs, so not a problem.
>

Yes. If used in an expected way, the lower fs won't have file paths
that also exist in the upper.

> 4. Merging upper and lower directory content is usually not needed??
>
> It must be needed for the root dir at least but where else?
> In which directories are the upper files created?
> Which directories are "pure" (containing either lower or upper files)
> and which directories are "merge" of upper and lower children?
>

In my use case, if that's doable, all the directories are "pure",
except the root dir, and they are in the lower. The files are either
from upper or from lower, so no merge. This should be sufficient for
the BPF use case.

> If the answer is that some directories exist both in upper and in lower
> and that those directories should be merged for the unified view then
> this is standard overlayfs behavior.
>
> Once I get the requirement in order I will try to write coherent
> semantics for this mode and see if it can meet the needs of other
> projects, such as HSM with FUSE as lower fs.

Thanks Amir. I want to clarify that with my very very limited
knowledge on FS, I can't say what I am thinking right now is
absolutely correct. Please take my request for features with a grain
of salt and feel free to pick the semantics that you see make most
sense.

Hao

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Overlayfs with writable lower layer
  2022-09-13 18:26                                 ` Hao Luo
@ 2022-09-13 18:54                                   ` Amir Goldstein
  2022-09-13 20:33                                     ` Hao Luo
  0 siblings, 1 reply; 66+ messages in thread
From: Amir Goldstein @ 2022-09-13 18:54 UTC (permalink / raw)
  To: Hao Luo; +Cc: Miklos Szeredi, linux-fsdevel

On Tue, Sep 13, 2022 at 9:26 PM Hao Luo <haoluo@google.com> wrote:
>
> On Mon, Sep 12, 2022 at 11:28 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, Sep 12, 2022 at 8:43 PM Hao Luo <haoluo@google.com> wrote:
> > >
> > > On Mon, Sep 12, 2022 at 8:40 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > On Mon, Sep 12, 2022 at 5:22 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > >
> > > > > I imagine the "switch" layer for a HSM would be simple enough:
> > > > >
> > > > > a) if file exists on fastest layer (upper) then take that
> > > > > b) if not then fall back to fuse layer (lower) .
> > > > >
> > > > > It's almost like a read-only overlayfs (no copy up) except it would be
> > > > > read-write and copy-up/down would be performed by the server as
> > > > > needed. No page cache duplication for upper, and AFAICS no corner
> > > > > cases that overlayfs has, since all layers are consistent (the fuse
> > > > > layer would reference the upper if that is currently the up-to-date
> > > > > one).
> > > >
> > > > On recent LSF/MM/BPF, BPF developers asked me about using overlayfs
> > > > for something that looks like the above - merging of non overlapping layers
> > > > without any copy-up/down, but with write to lower.
> > > >
> > > > I gave them the same solution (overlayfs without copy-up)
> > > > but I said I didn't know what you would think about this overlayfs mode
> > > > and I also pointed them to the eBPF-FUSE developers as another
> > > > possible solution to their use case.
> > > >
> > >
> > > Thanks Amir for adding me in the thread. This idea is very useful for
> > > BPF use cases.
> >
> [...]
> >
> > Am I to understand that the eBPF-FUSE option currently
> > does not fit your needs (maybe because it is not ready yet)?
> >
>
> Yeah, mostly because eBPF-FUSE is not ready. I talked to Alessio and
> his colleague after LSF/MM/BPF. They were distracted from eBPF-FUSE
> development at that time and I didn't follow up, working on other BPF
> stuff.
>
> > >
> [...]
> > Can't say I was able to understand the description, but let me
> > try to write the requirement in overlayfs terminology.
> > Please correct me if I am wrong.
> >
> > 1. The "lower" fs (cgroupfs?) is a "remote" fs where directories
> >     may appear or disappear due to "remote" events
> >
>
> Right. Seems we are aligned on this.
>
> > I think there were similar requirements to support changes
> > to lower fs which in a network fs in the past.
> > As long as those directories are strictly lower that should be
> > possible.
> >
> > Does upper fs have directories of the same name that need to
> > be merged with those lower dirs?
> >
>
> No, I don't think so. Upper fs should only have files in my use case.
>
> > 2. You did not mention this but IIRC, the lower fs has (pseudo)
> >     files that you want to be able to write to, and those files never
> >     exist in the upper fs
> >
> > That translates to allowing write to lower fs without triggering copy-up
> >
>
> Writing to lower is not needed right now.
>
> > 3. The upper fs files are all "pure" - that means those file paths
> >     do not exist in lower fs
> >
> > Creating files in the upper fs is normal in overlayfs, so not a problem.
> >
>
> Yes. If used in an expected way, the lower fs won't have file paths
> that also exist in the upper.
>
> > 4. Merging upper and lower directory content is usually not needed??
> >
> > It must be needed for the root dir at least but where else?
> > In which directories are the upper files created?
> > Which directories are "pure" (containing either lower or upper files)
> > and which directories are "merge" of upper and lower children?
> >
>
> In my use case, if that's doable, all the directories are "pure",
> except the root dir, and they are in the lower. The files are either
> from upper or from lower, so no merge. This should be sufficient for
> the BPF use case.
>
> > If the answer is that some directories exist both in upper and in lower
> > and that those directories should be merged for the unified view then
> > this is standard overlayfs behavior.
> >
> > Once I get the requirement in order I will try to write coherent
> > semantics for this mode and see if it can meet the needs of other
> > projects, such as HSM with FUSE as lower fs.
>
> Thanks Amir. I want to clarify that with my very very limited
> knowledge on FS, I can't say what I am thinking right now is
> absolutely correct. Please take my request for features with a grain
> of salt and feel free to pick the semantics that you see make most
> sense.
>

OK. IIUC, you have upper fs files only in the root dir?
And the lower root dir has only subdirs?

If that is all then it sounds pretty simple.
It could be described something like this:
1. merged directories cannot appear/disappear
2. lower pure directories can appear/disappear
3. upper files/dirs can be created inside merge dirs and pure upper dirs

I think I have some patches that could help with #2.

Can you give a small example of an upper a lower and their
union trees just for the sake of discussion?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Overlayfs with writable lower layer
  2022-09-13 18:54                                   ` Amir Goldstein
@ 2022-09-13 20:33                                     ` Hao Luo
  2022-09-14  3:46                                       ` Amir Goldstein
  0 siblings, 1 reply; 66+ messages in thread
From: Hao Luo @ 2022-09-13 20:33 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, linux-fsdevel

On Tue, Sep 13, 2022 at 11:54 AM Amir Goldstein <amir73il@gmail.com> wrote:
> OK. IIUC, you have upper fs files only in the root dir?

Sorry, no, the upper fs files need to be in subdir.

> And the lower root dir has only subdirs?

There could be files.

> Can you give a small example of an upper a lower and their
> union trees just for the sake of discussion?
>

For example, assume lower has the following layout:
$ tree lower
.
└── A
    ├── B
    │   └── lower
    └── lower

I can't create files in the fs in the lower.
$ touch A/B/file
touch: cannot touch 'A/B/file': Permission denied

The upper is initially empty.

I would like to overlay a writable fs on top of lower, so the union
tree looks like
$ tree union
.
└── A
    ├── B
    │   └── lower
    └── lower
$ touch A/B/file
$ tree union
.
└── A
    ├── B
    │   ├── file
    │   └── lower2
    └── lower1

Here, 'file' exists in the upper.

Further, directory B could disappear from lower. When that happens, I
think there are two possible behaviors:
 - make 'file' disappear from union as well;
 - make 'file' and its directory accessible as well.

In behavior 1, it will look like
$ tree union
.
└── A
    └── lower1

In behavior 2, it will look like
$ tree union
.
└── A
    ├── B
    │   └── file
    └── lower1

IMHO, behavior 1 works better in my use case. But if the FS experts
think behavior 2 makes more sense, I can work around.

>
> If that is all then it sounds pretty simple.
> It could be described something like this:
> 1. merged directories cannot appear/disappear
> 2. lower pure directories can appear/disappear
> 3. upper files/dirs can be created inside merge dirs and pure upper dirs
>
> I think I have some patches that could help with #2.
>

These three semantics looks good to me.

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Overlayfs with writable lower layer
  2022-09-13 20:33                                     ` Hao Luo
@ 2022-09-14  3:46                                       ` Amir Goldstein
  2022-09-14 18:00                                         ` Hao Luo
  0 siblings, 1 reply; 66+ messages in thread
From: Amir Goldstein @ 2022-09-14  3:46 UTC (permalink / raw)
  To: Hao Luo; +Cc: Miklos Szeredi, linux-fsdevel, overlayfs, Christian Brauner

On Tue, Sep 13, 2022 at 11:33 PM Hao Luo <haoluo@google.com> wrote:
>
> On Tue, Sep 13, 2022 at 11:54 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > OK. IIUC, you have upper fs files only in the root dir?
>
> Sorry, no, the upper fs files need to be in subdir.
>
> > And the lower root dir has only subdirs?
>
> There could be files.
>

And assuming that those files are cgroupfs files, why
did you say there is no need to write to those files?

I seem to recall that was an important distinction from
standard overlayfs when you described the problem in LSFMM.

> > Can you give a small example of an upper a lower and their
> > union trees just for the sake of discussion?
> >
>
> For example, assume lower has the following layout:
> $ tree lower
> .
> └── A
>     ├── B
>     │   └── lower
>     └── lower
>
> I can't create files in the fs in the lower.
> $ touch A/B/file
> touch: cannot touch 'A/B/file': Permission denied
>
> The upper is initially empty.
>
> I would like to overlay a writable fs on top of lower, so the union
> tree looks like
> $ tree union
> .
> └── A
>     ├── B
>     │   └── lower
>     └── lower
> $ touch A/B/file
> $ tree union
> .
> └── A
>     ├── B
>     │   ├── file
>     │   └── lower2
>     └── lower1
>
> Here, 'file' exists in the upper.
>

So B is now called a "merged" dir - it is not a "pure" dir
anymore because it contains both upper and lower files.

Normally in overlayfs before creating 'file' in upper,
the hierarchy A/B/ needs to be created in upper fs
to contain the file.

Unless your upper fs automagically has the same
dirs hierarchy as the lower fs?

You should know that overlayfs does more than just
mkdir("A");mkdir("A/B")
it created tmp dirs, sets xattrs and attrs on them and moves
them into place.
I am not sure if you planned to support all those operations
in your upper fs?

There are probably some other limitations at the moment
related to pseudo filesystems that prevent them from being
used as upper and/or lower fs in overlayfs.

We will need to check what those limitations are and whether
those limitations could be lifted for your specific use case.

> Further, directory B could disappear from lower. When that happens, I
> think there are two possible behaviors:
>  - make 'file' disappear from union as well;
>  - make 'file' and its directory accessible as well.
>
> In behavior 1, it will look like
> $ tree union
> .
> └── A
>     └── lower1
>
> In behavior 2, it will look like
> $ tree union
> .
> └── A
>     ├── B
>     │   └── file
>     └── lower1
>
> IMHO, behavior 1 works better in my use case. But if the FS experts
> think behavior 2 makes more sense, I can work around.
>

Something that I always wanted to try is to get rid of the duplicated
upper fs hierarchy.

It's a bit complicated to explain the details, but if your use case
does not involve any directory renames(?), then the upper path
for the merge directories can be index based and not hierarchical.

IOW:
- union tree lookup/traversal by path is performed on the lower fs
- upper fs is traversed by name only for pure upper dirs
- merged dirs are found by index of lower dir inode

The result will be behavior 1 that you wanted

I have some other use cases that might benefit from this mode.

> >
> > If that is all then it sounds pretty simple.
> > It could be described something like this:
> > 1. merged directories cannot appear/disappear
> > 2. lower pure directories can appear/disappear
> > 3. upper files/dirs can be created inside merge dirs and pure upper dirs
> >
> > I think I have some patches that could help with #2.
> >
>
> These three semantics looks good to me.
>

Except the case about disappearing B that you just described
above breaks rule #1 ;)
so other semantics are needed.

I will need some more clarifications about your use case to
understand if what I have in mind could help your use case.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Overlayfs with writable lower layer
  2022-09-14  3:46                                       ` Amir Goldstein
@ 2022-09-14 18:00                                         ` Hao Luo
  2022-09-14 19:23                                           ` Amir Goldstein
  0 siblings, 1 reply; 66+ messages in thread
From: Hao Luo @ 2022-09-14 18:00 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, linux-fsdevel, overlayfs, Christian Brauner

On Tue, Sep 13, 2022 at 8:46 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Sep 13, 2022 at 11:33 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Tue, Sep 13, 2022 at 11:54 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > OK. IIUC, you have upper fs files only in the root dir?
> >
> > Sorry, no, the upper fs files need to be in subdir.
> >
> > > And the lower root dir has only subdirs?
> >
> > There could be files.
> >
>
> And assuming that those files are cgroupfs files, why
> did you say there is no need to write to those files?
>
> I seem to recall that was an important distinction from
> standard overlayfs when you described the problem in LSFMM.

In my last reply, I was assuming all the writes to the cgroupfs files
happen from remote, not from the union. One can read files from union,
or create files that exist in the upper.

The idea is, I can provide two copies of lower to users. One is the
original lower, writable, so any update happens there. And the other
is a union of the lower and the upper, it's a read-only view of the
lower, but extended by the upper.

I actually don't know whether supporting writes to the lower from the
union is better. It probably is, because then I can combine the two
copies into one.

>
>
> > > Can you give a small example of an upper a lower and their
> > > union trees just for the sake of discussion?
> > >
> >
> > For example, assume lower has the following layout:
> > $ tree lower
> > .
> > └── A
> >     ├── B
> >     │   └── lower
> >     └── lower
> >
> > I can't create files in the fs in the lower.
> > $ touch A/B/file
> > touch: cannot touch 'A/B/file': Permission denied
> >
> > The upper is initially empty.
> >
> > I would like to overlay a writable fs on top of lower, so the union
> > tree looks like
> > $ tree union
> > .
> > └── A
> >     ├── B
> >     │   └── lower
> >     └── lower
> > $ touch A/B/file
> > $ tree union
> > .
> > └── A
> >     ├── B
> >     │   ├── file
> >     │   └── lower2
> >     └── lower1
> >
> > Here, 'file' exists in the upper.
> >
>
> So B is now called a "merged" dir - it is not a "pure" dir
> anymore because it contains both upper and lower files.
>
> Normally in overlayfs before creating 'file' in upper,
> the hierarchy A/B/ needs to be created in upper fs
> to contain the file.
>
> Unless your upper fs automagically has the same
> dirs hierarchy as the lower fs?
>
> You should know that overlayfs does more than just
> mkdir("A");mkdir("A/B")
> it created tmp dirs, sets xattrs and attrs on them and moves
> them into place.
> I am not sure if you planned to support all those operations
> in your upper fs?
>

Yeah. I can add support for tmp dirs, tmp files in my upper fs, that
is, bpffs. I played it a bit back in May, that is totally doable. I
remembered I successfully made bpffs accepted as overlayfs's upper
without xattrs and attrs back then. Maybe I missed something.

> There are probably some other limitations at the moment
> related to pseudo filesystems that prevent them from being
> used as upper and/or lower fs in overlayfs.
>
> We will need to check what those limitations are and whether
> those limitations could be lifted for your specific use case.
>

How can we approach this? Maybe I can send my patch that adds tmp dir,
tmp files and xattr, attr to upstream as RFC, so you can take a look?

> > Further, directory B could disappear from lower. When that happens, I
> > think there are two possible behaviors:
> >  - make 'file' disappear from union as well;
> >  - make 'file' and its directory accessible as well.
> >
> > In behavior 1, it will look like
> > $ tree union
> > .
> > └── A
> >     └── lower1
> >
> > In behavior 2, it will look like
> > $ tree union
> > .
> > └── A
> >     ├── B
> >     │   └── file
> >     └── lower1
> >
> > IMHO, behavior 1 works better in my use case. But if the FS experts
> > think behavior 2 makes more sense, I can work around.
> >
>
> Something that I always wanted to try is to get rid of the duplicated
> upper fs hierarchy.
>
> It's a bit complicated to explain the details, but if your use case
> does not involve any directory renames(?), then the upper path
> for the merge directories can be index based and not hierarchical.
>

Yeah, I don't expect directory renaming. But I can't say if there is
anyone trying to do that by accident, or by bad intention.

> IOW:
> - union tree lookup/traversal by path is performed on the lower fs
> - upper fs is traversed by name only for pure upper dirs
> - merged dirs are found by index of lower dir inode
>
> The result will be behavior 1 that you wanted
>
> I have some other use cases that might benefit from this mode.
>

That will be very cool!

> > >
> > > If that is all then it sounds pretty simple.
> > > It could be described something like this:
> > > 1. merged directories cannot appear/disappear
> > > 2. lower pure directories can appear/disappear
> > > 3. upper files/dirs can be created inside merge dirs and pure upper dirs
> > >
> > > I think I have some patches that could help with #2.
> > >
> >
> > These three semantics looks good to me.
> >
>
> Except the case about disappearing B that you just described
> above breaks rule #1 ;)
> so other semantics are needed.
>

Yes :) Now I understand. Thanks for the explanation on "merged" dir.

> I will need some more clarifications about your use case to
> understand if what I have in mind could help your use case.
>
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Overlayfs with writable lower layer
  2022-09-14 18:00                                         ` Hao Luo
@ 2022-09-14 19:23                                           ` Amir Goldstein
  2022-09-14 19:33                                             ` Hao Luo
  0 siblings, 1 reply; 66+ messages in thread
From: Amir Goldstein @ 2022-09-14 19:23 UTC (permalink / raw)
  To: Hao Luo; +Cc: Miklos Szeredi, linux-fsdevel, overlayfs, Christian Brauner

On Wed, Sep 14, 2022 at 9:00 PM Hao Luo <haoluo@google.com> wrote:
>
> On Tue, Sep 13, 2022 at 8:46 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Tue, Sep 13, 2022 at 11:33 PM Hao Luo <haoluo@google.com> wrote:
> > >
> > > On Tue, Sep 13, 2022 at 11:54 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > OK. IIUC, you have upper fs files only in the root dir?
> > >
> > > Sorry, no, the upper fs files need to be in subdir.
> > >
> > > > And the lower root dir has only subdirs?
> > >
> > > There could be files.
> > >
> >
> > And assuming that those files are cgroupfs files, why
> > did you say there is no need to write to those files?
> >
> > I seem to recall that was an important distinction from
> > standard overlayfs when you described the problem in LSFMM.
>
> In my last reply, I was assuming all the writes to the cgroupfs files
> happen from remote, not from the union. One can read files from union,
> or create files that exist in the upper.
>
> The idea is, I can provide two copies of lower to users. One is the
> original lower, writable, so any update happens there. And the other
> is a union of the lower and the upper, it's a read-only view of the
> lower, but extended by the upper.
>
> I actually don't know whether supporting writes to the lower from the
> union is better. It probably is, because then I can combine the two
> copies into one.
>

Understood.

> >
> >
> > > > Can you give a small example of an upper a lower and their
> > > > union trees just for the sake of discussion?
> > > >
> > >
> > > For example, assume lower has the following layout:
> > > $ tree lower
> > > .
> > > └── A
> > >     ├── B
> > >     │   └── lower
> > >     └── lower
> > >
> > > I can't create files in the fs in the lower.
> > > $ touch A/B/file
> > > touch: cannot touch 'A/B/file': Permission denied
> > >
> > > The upper is initially empty.
> > >
> > > I would like to overlay a writable fs on top of lower, so the union
> > > tree looks like
> > > $ tree union
> > > .
> > > └── A
> > >     ├── B
> > >     │   └── lower
> > >     └── lower
> > > $ touch A/B/file
> > > $ tree union
> > > .
> > > └── A
> > >     ├── B
> > >     │   ├── file
> > >     │   └── lower2
> > >     └── lower1
> > >
> > > Here, 'file' exists in the upper.
> > >
> >
> > So B is now called a "merged" dir - it is not a "pure" dir
> > anymore because it contains both upper and lower files.
> >
> > Normally in overlayfs before creating 'file' in upper,
> > the hierarchy A/B/ needs to be created in upper fs
> > to contain the file.
> >
> > Unless your upper fs automagically has the same
> > dirs hierarchy as the lower fs?
> >
> > You should know that overlayfs does more than just
> > mkdir("A");mkdir("A/B")
> > it created tmp dirs, sets xattrs and attrs on them and moves
> > them into place.
> > I am not sure if you planned to support all those operations
> > in your upper fs?
> >
>
> Yeah. I can add support for tmp dirs, tmp files in my upper fs, that
> is, bpffs. I played it a bit back in May, that is totally doable. I
> remembered I successfully made bpffs accepted as overlayfs's upper
> without xattrs and attrs back then. Maybe I missed something.
>

sounds reasonable.
xattr is not a hard requirement but some things (that you do not need)
cannot work without xattr like merged directory rename and rmdir.

> > There are probably some other limitations at the moment
> > related to pseudo filesystems that prevent them from being
> > used as upper and/or lower fs in overlayfs.
> >
> > We will need to check what those limitations are and whether
> > those limitations could be lifted for your specific use case.
> >
>
> How can we approach this? Maybe I can send my patch that adds tmp dir,
> tmp files and xattr, attr to upstream as RFC, so you can take a look?
>

I don't think I need your fs to test.
The only thing special in this setup as far as I can tell is the dynamic
cgroupfs (or cgroup2?) lower dirs.

IIUC, everything worked for you except for oddities related to
lower directories not appearing and not disappearing from the union.
Is that correct? is that the only thing that you need a fix for?

> > > Further, directory B could disappear from lower. When that happens, I
> > > think there are two possible behaviors:
> > >  - make 'file' disappear from union as well;
> > >  - make 'file' and its directory accessible as well.
> > >
> > > In behavior 1, it will look like
> > > $ tree union
> > > .
> > > └── A
> > >     └── lower1
> > >
> > > In behavior 2, it will look like
> > > $ tree union
> > > .
> > > └── A
> > >     ├── B
> > >     │   └── file
> > >     └── lower1
> > >
> > > IMHO, behavior 1 works better in my use case. But if the FS experts
> > > think behavior 2 makes more sense, I can work around.
> > >
> >
> > Something that I always wanted to try is to get rid of the duplicated
> > upper fs hierarchy.
> >
> > It's a bit complicated to explain the details, but if your use case
> > does not involve any directory renames(?), then the upper path
> > for the merge directories can be index based and not hierarchical.
> >
>
> Yeah, I don't expect directory renaming. But I can't say if there is
> anyone trying to do that by accident, or by bad intention.
>

Your fs will return an error for rename if you did not implement it.

Anyway, if you can accept behavior 2, it is much more simple.
This other idea is very vague and not simple, so better not risk it.

If you confirm that you only need to get uptodate view of
lower dirs in union, then I will look for the patches that I have
and see if they can help you.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Overlayfs with writable lower layer
  2022-09-14 19:23                                           ` Amir Goldstein
@ 2022-09-14 19:33                                             ` Hao Luo
  2022-09-15 10:54                                               ` Amir Goldstein
  0 siblings, 1 reply; 66+ messages in thread
From: Hao Luo @ 2022-09-14 19:33 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, linux-fsdevel, overlayfs, Christian Brauner

On Wed, Sep 14, 2022 at 12:23 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Sep 14, 2022 at 9:00 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Tue, Sep 13, 2022 at 8:46 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Tue, Sep 13, 2022 at 11:33 PM Hao Luo <haoluo@google.com> wrote:
> > > >
> > > > On Tue, Sep 13, 2022 at 11:54 AM Amir Goldstein <amir73il@gmail.com> wrote:
[...]
> > > There are probably some other limitations at the moment
> > > related to pseudo filesystems that prevent them from being
> > > used as upper and/or lower fs in overlayfs.
> > >
> > > We will need to check what those limitations are and whether
> > > those limitations could be lifted for your specific use case.
> > >
> >
> > How can we approach this? Maybe I can send my patch that adds tmp dir,
> > tmp files and xattr, attr to upstream as RFC, so you can take a look?
> >
>
> I don't think I need your fs to test.
> The only thing special in this setup as far as I can tell is the dynamic
> cgroupfs (or cgroup2?) lower dirs.
>
> IIUC, everything worked for you except for oddities related to
> lower directories not appearing and not disappearing from the union.
> Is that correct? is that the only thing that you need a fix for?
>

Yes, that's correct.

> > > > Further, directory B could disappear from lower. When that happens, I
> > > > think there are two possible behaviors:
> > > >  - make 'file' disappear from union as well;
> > > >  - make 'file' and its directory accessible as well.
> > > >
> > > > In behavior 1, it will look like
> > > > $ tree union
> > > > .
> > > > └── A
> > > >     └── lower1
> > > >
> > > > In behavior 2, it will look like
> > > > $ tree union
> > > > .
> > > > └── A
> > > >     ├── B
> > > >     │   └── file
> > > >     └── lower1
> > > >
> > > > IMHO, behavior 1 works better in my use case. But if the FS experts
> > > > think behavior 2 makes more sense, I can work around.
> > > >
> > >
> > > Something that I always wanted to try is to get rid of the duplicated
> > > upper fs hierarchy.
> > >
> > > It's a bit complicated to explain the details, but if your use case
> > > does not involve any directory renames(?), then the upper path
> > > for the merge directories can be index based and not hierarchical.
> > >
> >
> > Yeah, I don't expect directory renaming. But I can't say if there is
> > anyone trying to do that by accident, or by bad intention.
> >
>
> Your fs will return an error for rename if you did not implement it.
>
> Anyway, if you can accept behavior 2, it is much more simple.
> This other idea is very vague and not simple, so better not risk it.
>
> If you confirm that you only need to get uptodate view of
> lower dirs in union, then I will look for the patches that I have
> and see if they can help you.
>

Yes, I acknowledge that behavior 2 works for me.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Overlayfs with writable lower layer
  2022-09-14 19:33                                             ` Hao Luo
@ 2022-09-15 10:54                                               ` Amir Goldstein
  0 siblings, 0 replies; 66+ messages in thread
From: Amir Goldstein @ 2022-09-15 10:54 UTC (permalink / raw)
  To: Hao Luo
  Cc: Miklos Szeredi, linux-fsdevel, overlayfs, Christian Brauner, Greg KH

On Wed, Sep 14, 2022 at 10:33 PM Hao Luo <haoluo@google.com> wrote:
>
> On Wed, Sep 14, 2022 at 12:23 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Wed, Sep 14, 2022 at 9:00 PM Hao Luo <haoluo@google.com> wrote:
> > >
> > > On Tue, Sep 13, 2022 at 8:46 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 13, 2022 at 11:33 PM Hao Luo <haoluo@google.com> wrote:
> > > > >
> > > > > On Tue, Sep 13, 2022 at 11:54 AM Amir Goldstein <amir73il@gmail.com> wrote:
> [...]
> > > > There are probably some other limitations at the moment
> > > > related to pseudo filesystems that prevent them from being
> > > > used as upper and/or lower fs in overlayfs.
> > > >
> > > > We will need to check what those limitations are and whether
> > > > those limitations could be lifted for your specific use case.
> > > >
> > >
> > > How can we approach this? Maybe I can send my patch that adds tmp dir,
> > > tmp files and xattr, attr to upstream as RFC, so you can take a look?
> > >
> >
> > I don't think I need your fs to test.
> > The only thing special in this setup as far as I can tell is the dynamic
> > cgroupfs (or cgroup2?) lower dirs.
> >
> > IIUC, everything worked for you except for oddities related to
> > lower directories not appearing and not disappearing from the union.
> > Is that correct? is that the only thing that you need a fix for?
> >
>
> Yes, that's correct.
>
> > > > > Further, directory B could disappear from lower. When that happens, I
> > > > > think there are two possible behaviors:
> > > > >  - make 'file' disappear from union as well;
> > > > >  - make 'file' and its directory accessible as well.
> > > > >
> > > > > In behavior 1, it will look like
> > > > > $ tree union
> > > > > .
> > > > > └── A
> > > > >     └── lower1
> > > > >
> > > > > In behavior 2, it will look like
> > > > > $ tree union
> > > > > .
> > > > > └── A
> > > > >     ├── B
> > > > >     │   └── file
> > > > >     └── lower1
> > > > >
> > > > > IMHO, behavior 1 works better in my use case. But if the FS experts
> > > > > think behavior 2 makes more sense, I can work around.
> > > > >
> > > >
> > > > Something that I always wanted to try is to get rid of the duplicated
> > > > upper fs hierarchy.
> > > >
> > > > It's a bit complicated to explain the details, but if your use case
> > > > does not involve any directory renames(?), then the upper path
> > > > for the merge directories can be index based and not hierarchical.
> > > >
> > >
> > > Yeah, I don't expect directory renaming. But I can't say if there is
> > > anyone trying to do that by accident, or by bad intention.
> > >
> >
> > Your fs will return an error for rename if you did not implement it.
> >
> > Anyway, if you can accept behavior 2, it is much more simple.
> > This other idea is very vague and not simple, so better not risk it.
> >
> > If you confirm that you only need to get uptodate view of
> > lower dirs in union, then I will look for the patches that I have
> > and see if they can help you.
> >
>
> Yes, I acknowledge that behavior 2 works for me.

OK. I took a closer look and there are some challenges.
Nothing that cannot be fixed if you are willing to do the work.
I will try to explain the challenges and possible solutions.

Current overlayfs code assumes in many places that the
lower fs is not being changed at all while overlayfs is mounted.
As overlayfs.rst says:

"Changes to the underlying filesystems while part of a mounted overlay
filesystem are not allowed.  If the underlying filesystem is changed,
the behavior of the overlay is undefined, though it will not result in
a crash or deadlock."

One of the most visible impacts of changes to lower later
is that the merge dir cache is not invalidated, which is the
immediate reason that you are seeing the ghost lower dir A/B
in the union even if you did not create file A/B/file.

You can check if this hack fixes your first order problem:

diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 78f62cc1797b..4eb6fcf341de 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -326,7 +326,7 @@ static void ovl_dir_reset(struct file *file)
        struct dentry *dentry = file->f_path.dentry;
        bool is_real;

-       if (cache && ovl_dentry_version_get(dentry) != cache->version) {
+       if (cache /*&& ovl_dentry_version_get(dentry) != cache->version*/) {
                ovl_cache_put(od, dentry);
                od->cache = NULL;
                od->cursor = NULL;
---

If it does, it may be acceptable to add that behavior as a mount option.

But it doesn't end here, there is also lookup cache and possibly other
issues as well related to merge dirs with ghosted lower.

If you did create file A/B/file then trying to list A/B after A/B
has gone from lower fs,  may depend on the lower fs behavior.
Some of the issues are not related to overlayfs but to cgroupfs.

For "standard" Linux fs, if you keep an open fd to a directory,
that directory can be removed and then if you try to readdir from
the open fd, or use the fd in one of the XXXat() syscalls,
you will get ENOENT, because of the IS_DEADDIR(dir) checks
in the vfs.

Do you get this behavior with an open fd on a cgroupfs dir
that has disappeared? Please check.

I think that ovl_iterate() can be made more tolerant to
ENOENT when iterating a merge dir with ghosted lower dir.
If you run into this error when trying to list A/B, find out
the place in the code that returns the error and I'll see
if that error may be relaxed.

The patches that I have are doing something different.
The idea is that overlayfs can watch for lower fs changes using
fsnotify() callbacks and do "something" proactive when they happen.

My Overlayfs watch [1] patches do "something" else - they
record the changes to lower fs when they happen, but they
demonstrate the basic concept of watching changes in lower fs.

[1] https://github.com/amir73il/overlayfs/wiki/Overlayfs-watch

The "something" that overlayfs could do when a lower dir
is removed is to invalidate the caches of the union dir and
everything under it.

There is one other small problem with this method w.r.t
lower cgroupfs - cgroupfs does not call any fsnotify callbacks when
directories disappear...

cgroupfs is an instance of kernfs.
kenfs is calling the fsnotify_modify() hook when kernel changes
the content of a file:

d911d9874801 kernfs: make kernfs_notify() trigger inotify events too

but it does not call fsnotify_rmdir/mkdir/delete/create() like other pseudo
fs do (debugfs, configfs, tracefs, ...) when directories appear/disappear -
at least I don't think that it does.

Please run inotifywatch on cgroupfs and find out for yourself.

Hope that some of the info here can help you move forward.
Most of it you can probably ignore.

Thanks,
Amir.

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2022-09-12 12:29                   ` Amir Goldstein
  2022-09-12 13:03                     ` Miklos Szeredi
@ 2023-05-12 19:37                     ` Amir Goldstein
  2023-05-15  7:29                       ` Miklos Szeredi
  1 sibling, 1 reply; 66+ messages in thread
From: Amir Goldstein @ 2023-05-12 19:37 UTC (permalink / raw)
  To: Miklos Szeredi, Daniel Borkmann
  Cc: Alessio Balsini, Peng Tao, Akilesh Kailash, Antonio SJ Musumeci,
	David Anderson, Giuseppe Scrivano, Jann Horn, Jens Axboe,
	Martijn Coenen, Palmer Dabbelt, Paul Lawrence, Stefano Duo,
	Zimuzo Ezeozue, wuyan, fuse-devel, kernel-team, linux-fsdevel

On Mon, Sep 12, 2022 at 8:29 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Sep 12, 2022 at 12:29 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Sat, 10 Sept 2022 at 10:52, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > > I think we should accept the fact that just as any current FUSE
> > > passthrough (in userspace) implementation is limited to max number of
> > > open files as the server's process limitation, kernel passthrough implementation
> > > will be limited by inheriting the mounter's process limitation.
> > >
> > > There is no reason that the server should need to keep more
> > > passthrough fd's open than client open fds.
> >
> > Maybe you're right.
> >
> > > If we only support FOPEN_PASSTHROUGH_AUTOCLOSE as v12
> > > patches implicitly do, then the memory overhead is not much different
> > > than the extra overlayfs pseudo realfiles.
> >
> > How exactly would this work?
> >
> > ioctl(F_D_I_P_OPEN) - create passthrough fd with ref 1
> > open/FOPEN_PASSTHOUGH -  inc refcount in passthrough fd
> > release - put refcount in passthrough fd
> > ioctl(F_D_I_P_CLOSE) - put ref in passthrough fd
> >
> > Due to being refcounted the F_D_I_P_CLOSE can come at any point past
> > the finished open request.
> >
> > Or did you have something else in mind?
> >
>
> What I had in mind is that FOPEN_PASSTHROUGH_AUTOCLOSE
> "transfers" the server's refcount to the kernel and server does
> not need to call explicit F_D_I_P_CLOSE.
>
> This is useful for servers that don't care about reusing mappings.
>

Hi Daniel,

I was waiting for LSFMM to see if and how FUSE-BPF intends to
address the highest value use case of read/write passthrough.

From what I've seen, you are still taking a very broad approach of
all-or-nothing which still has a lot of core design issues to address,
while these old patches already address the most important use case
of read/write passthrough of fd without any of the core issues
(credentials, hidden fds).

As far as I can tell, this old implementation is mostly independent of your
lookup based approach - they share the low level read/write passthrough
functions but not much more than that, so merging them should not be
a blocker to your efforts in the longer run.
Please correct me if I am wrong.

As things stand, I intend to re-post these old patches with mandatory
FOPEN_PASSTHROUGH_AUTOCLOSE to eliminate the open
questions about managing mappings.

Miklos, please stop me if I missed something and if you do not
think that these two approaches are independent.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2023-05-12 19:37                     ` [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough Amir Goldstein
@ 2023-05-15  7:29                       ` Miklos Szeredi
  2023-05-15 14:00                         ` Amir Goldstein
  0 siblings, 1 reply; 66+ messages in thread
From: Miklos Szeredi @ 2023-05-15  7:29 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Daniel Borkmann, Alessio Balsini, Peng Tao, Akilesh Kailash,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel,
	kernel-team, linux-fsdevel

On Fri, 12 May 2023 at 21:37, Amir Goldstein <amir73il@gmail.com> wrote:

> I was waiting for LSFMM to see if and how FUSE-BPF intends to
> address the highest value use case of read/write passthrough.
>
> From what I've seen, you are still taking a very broad approach of
> all-or-nothing which still has a lot of core design issues to address,
> while these old patches already address the most important use case
> of read/write passthrough of fd without any of the core issues
> (credentials, hidden fds).
>
> As far as I can tell, this old implementation is mostly independent of your
> lookup based approach - they share the low level read/write passthrough
> functions but not much more than that, so merging them should not be
> a blocker to your efforts in the longer run.
> Please correct me if I am wrong.
>
> As things stand, I intend to re-post these old patches with mandatory
> FOPEN_PASSTHROUGH_AUTOCLOSE to eliminate the open
> questions about managing mappings.
>
> Miklos, please stop me if I missed something and if you do not
> think that these two approaches are independent.

Do you mean that the BPF patches should use their own passthrough mechanism?

I think it would be better if we could agree on a common interface for
passthough (or per Paul's suggestion: backing) mechanism.

Let's see this patchset and then we can discuss how this could be
usable for the BPF case as well.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2023-05-15  7:29                       ` Miklos Szeredi
@ 2023-05-15 14:00                         ` Amir Goldstein
  2023-05-15 20:16                           ` [fuse-devel] " Nikolaus Rath
  0 siblings, 1 reply; 66+ messages in thread
From: Amir Goldstein @ 2023-05-15 14:00 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Daniel Borkmann, Alessio Balsini, Peng Tao, Akilesh Kailash,
	Antonio SJ Musumeci, David Anderson, Giuseppe Scrivano,
	Jann Horn, Jens Axboe, Martijn Coenen, Palmer Dabbelt,
	Paul Lawrence, Stefano Duo, Zimuzo Ezeozue, wuyan, fuse-devel,
	kernel-team, linux-fsdevel

On Mon, May 15, 2023 at 10:29 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Fri, 12 May 2023 at 21:37, Amir Goldstein <amir73il@gmail.com> wrote:
>
> > I was waiting for LSFMM to see if and how FUSE-BPF intends to
> > address the highest value use case of read/write passthrough.
> >
> > From what I've seen, you are still taking a very broad approach of
> > all-or-nothing which still has a lot of core design issues to address,
> > while these old patches already address the most important use case
> > of read/write passthrough of fd without any of the core issues
> > (credentials, hidden fds).
> >
> > As far as I can tell, this old implementation is mostly independent of your
> > lookup based approach - they share the low level read/write passthrough
> > functions but not much more than that, so merging them should not be
> > a blocker to your efforts in the longer run.
> > Please correct me if I am wrong.
> >
> > As things stand, I intend to re-post these old patches with mandatory
> > FOPEN_PASSTHROUGH_AUTOCLOSE to eliminate the open
> > questions about managing mappings.
> >
> > Miklos, please stop me if I missed something and if you do not
> > think that these two approaches are independent.
>
> Do you mean that the BPF patches should use their own passthrough mechanism?
>
> I think it would be better if we could agree on a common interface for
> passthough (or per Paul's suggestion: backing) mechanism.

Well, not exactly different.
With BFP patches, if you have a backing inode that was established during
LOOKUP with rules to do passthrough for open(), you'd get a backing file and
that backing file would be used to passthrough read/write.

FOPEN_PASSTHROUGH is another way to configure passthrough read/write
to a backing file that is controlled by the server per open fd instead of by BFP
for every open of the backing inode.

Obviously, both methods would use the same backing_file field and
same read/write passthrough methods regardless of how the backing file
was setup.

Obviously, the BFP patches will not use the same ioctl to setup passthrough
(and/or BPF program) to a backing inode, but I don't think that matters much.
When we settle on ioctls for setting up backing inodes, we can also add new
ioctls for setting up backing file with optional BPF program.
I don't see any reason to make the first ioctl more complicated than this:

struct fuse_passthrough_out {
        uint32_t        fd;
        /* For future implementation */
        uint32_t        len;
        void            *vec;
};

One advantage with starting with FOPEN_PASSTHROUGH, besides
dealing with the highest priority performance issue, is how it deals with
resource limits on open files.

While the backing files are not accounted to the server, the server
is very likely to keep an open fd for the backing file until release,
otherwise, the server will not be able to perform other non-passthrough
file operations (e.g. fallocate) on the backing fd, so at least with
FOPEN_PASSTHROUGH_AUTOCLOSE, there should be only
up to 2 times the number of open files, very much the same as overlayfs.

I intend to enforce this heuristically by counting the number of passthrough
fds and restrict new passthrough fd setup to the number of current open fds
by the server, so a malicious or misbehaving server cannot setup infinite
number of backing fds that it does not also keep open itself.

>
> Let's see this patchset and then we can discuss how this could be
> usable for the BPF case as well.
>

OK. I'll try to dust off these patches and re-submit.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [fuse-devel] [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2023-05-15 14:00                         ` Amir Goldstein
@ 2023-05-15 20:16                           ` Nikolaus Rath
  2023-05-15 21:11                             ` Bernd Schubert
  0 siblings, 1 reply; 66+ messages in thread
From: Nikolaus Rath @ 2023-05-15 20:16 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Jens Axboe, Giuseppe Scrivano, fuse-devel,
	Daniel Borkmann, Jann Horn, Zimuzo Ezeozue, kernel-team,
	Peng Tao, Palmer Dabbelt, Stefano Duo, David Anderson, wuyan,
	linux-fsdevel, Akilesh Kailash, Martijn Coenen

On May 15 2023, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, May 15, 2023 at 10:29 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Fri, 12 May 2023 at 21:37, Amir Goldstein <amir73il@gmail.com> wrote:
>>
>> > I was waiting for LSFMM to see if and how FUSE-BPF intends to
>> > address the highest value use case of read/write passthrough.
>> >
>> > From what I've seen, you are still taking a very broad approach of
>> > all-or-nothing which still has a lot of core design issues to address,
>> > while these old patches already address the most important use case
>> > of read/write passthrough of fd without any of the core issues
>> > (credentials, hidden fds).
>> >
>> > As far as I can tell, this old implementation is mostly independent of your
>> > lookup based approach - they share the low level read/write passthrough
>> > functions but not much more than that, so merging them should not be
>> > a blocker to your efforts in the longer run.
>> > Please correct me if I am wrong.
>> >
>> > As things stand, I intend to re-post these old patches with mandatory
>> > FOPEN_PASSTHROUGH_AUTOCLOSE to eliminate the open
>> > questions about managing mappings.
>> >
>> > Miklos, please stop me if I missed something and if you do not
>> > think that these two approaches are independent.
>>
>> Do you mean that the BPF patches should use their own passthrough mechanism?
>>
>> I think it would be better if we could agree on a common interface for
>> passthough (or per Paul's suggestion: backing) mechanism.
>
> Well, not exactly different.
> With BFP patches, if you have a backing inode that was established during
> LOOKUP with rules to do passthrough for open(), you'd get a backing file and
> that backing file would be used to passthrough read/write.
>
> FOPEN_PASSTHROUGH is another way to configure passthrough read/write
> to a backing file that is controlled by the server per open fd instead of by BFP
> for every open of the backing inode.
>
> Obviously, both methods would use the same backing_file field and
> same read/write passthrough methods regardless of how the backing file
> was setup.
>
> Obviously, the BFP patches will not use the same ioctl to setup passthrough
> (and/or BPF program) to a backing inode, but I don't think that matters much.
> When we settle on ioctls for setting up backing inodes, we can also add new
> ioctls for setting up backing file with optional BPF program.

> I don't see any reason to make the first ioctl more complicated than this:
>
> struct fuse_passthrough_out {
>         uint32_t        fd;
>         /* For future implementation */
>         uint32_t        len;
>         void            *vec;
> };
>
> One advantage with starting with FOPEN_PASSTHROUGH, besides
> dealing with the highest priority performance issue, is how it deals with
> resource limits on open files.

One thing that struck me when we discussed FUSE-BPF at LSF was that from
a userspace point of view, FUSE-BPF presents an almost completely
different API than traditional FUSE (at least in its current form).

As long as there is no support for falling back to standard FUSE
callbacks, using FUSE-BPF means that most of the existing API no longer
works, and instead there is a large new API surface that doesn't work in
standard FUSE (the pre-filter and post-filter callbacks for each
operation).

I think this means that FUSE-BPF file systems won't work with FUSE, and
FUSE filesystems won't work with FUSE-BPF.

Would it be worth thinking about FUSE-BPF as a completely separate
approach that stands next to FUSE, as opposed to considering it an
extension?

In that case, we wouldn't need to worry about a FUSE-passthrough
implementation being forward compatible with FUSE-BPF or not.



Best,
-Nikolaus


-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [fuse-devel] [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2023-05-15 20:16                           ` [fuse-devel] " Nikolaus Rath
@ 2023-05-15 21:11                             ` Bernd Schubert
  2023-05-15 21:45                               ` Paul Lawrence
  0 siblings, 1 reply; 66+ messages in thread
From: Bernd Schubert @ 2023-05-15 21:11 UTC (permalink / raw)
  To: Amir Goldstein, Miklos Szeredi, Jens Axboe, Giuseppe Scrivano,
	fuse-devel, Daniel Borkmann, Jann Horn, Zimuzo Ezeozue,
	kernel-team, Peng Tao, Palmer Dabbelt, Stefano Duo,
	David Anderson, wuyan, linux-fsdevel, Akilesh Kailash,
	Martijn Coenen



On 5/15/23 22:16, Nikolaus Rath wrote:
> On May 15 2023, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Mon, May 15, 2023 at 10:29 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>>> On Fri, 12 May 2023 at 21:37, Amir Goldstein <amir73il@gmail.com> wrote:
>>>
>>>> I was waiting for LSFMM to see if and how FUSE-BPF intends to
>>>> address the highest value use case of read/write passthrough.
>>>>
>>>>  From what I've seen, you are still taking a very broad approach of
>>>> all-or-nothing which still has a lot of core design issues to address,
>>>> while these old patches already address the most important use case
>>>> of read/write passthrough of fd without any of the core issues
>>>> (credentials, hidden fds).
>>>>
>>>> As far as I can tell, this old implementation is mostly independent of your
>>>> lookup based approach - they share the low level read/write passthrough
>>>> functions but not much more than that, so merging them should not be
>>>> a blocker to your efforts in the longer run.
>>>> Please correct me if I am wrong.
>>>>
>>>> As things stand, I intend to re-post these old patches with mandatory
>>>> FOPEN_PASSTHROUGH_AUTOCLOSE to eliminate the open
>>>> questions about managing mappings.
>>>>
>>>> Miklos, please stop me if I missed something and if you do not
>>>> think that these two approaches are independent.
>>>
>>> Do you mean that the BPF patches should use their own passthrough mechanism?
>>>
>>> I think it would be better if we could agree on a common interface for
>>> passthough (or per Paul's suggestion: backing) mechanism.
>>
>> Well, not exactly different.
>> With BFP patches, if you have a backing inode that was established during
>> LOOKUP with rules to do passthrough for open(), you'd get a backing file and
>> that backing file would be used to passthrough read/write.
>>
>> FOPEN_PASSTHROUGH is another way to configure passthrough read/write
>> to a backing file that is controlled by the server per open fd instead of by BFP
>> for every open of the backing inode.
>>
>> Obviously, both methods would use the same backing_file field and
>> same read/write passthrough methods regardless of how the backing file
>> was setup.
>>
>> Obviously, the BFP patches will not use the same ioctl to setup passthrough
>> (and/or BPF program) to a backing inode, but I don't think that matters much.
>> When we settle on ioctls for setting up backing inodes, we can also add new
>> ioctls for setting up backing file with optional BPF program.
> 
>> I don't see any reason to make the first ioctl more complicated than this:
>>
>> struct fuse_passthrough_out {
>>          uint32_t        fd;
>>          /* For future implementation */
>>          uint32_t        len;
>>          void            *vec;
>> };
>>
>> One advantage with starting with FOPEN_PASSTHROUGH, besides
>> dealing with the highest priority performance issue, is how it deals with
>> resource limits on open files.
> 
> One thing that struck me when we discussed FUSE-BPF at LSF was that from
> a userspace point of view, FUSE-BPF presents an almost completely
> different API than traditional FUSE (at least in its current form).
> 
> As long as there is no support for falling back to standard FUSE
> callbacks, using FUSE-BPF means that most of the existing API no longer
> works, and instead there is a large new API surface that doesn't work in
> standard FUSE (the pre-filter and post-filter callbacks for each
> operation).
> 
> I think this means that FUSE-BPF file systems won't work with FUSE, and
> FUSE filesystems won't work with FUSE-BPF.

Is that so? I think found some incompatibilities in the patches (need to 
double check), but doesn't it just do normal fuse operations and then 
replies with an ioctl to do passthrough? BPF is used for additional 
filtering, that would have to be done otherwise in userspace.

Really difficult in the current patch set and data structures is to see 
what is actually BPF and what is passthrough.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [fuse-devel] [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2023-05-15 21:11                             ` Bernd Schubert
@ 2023-05-15 21:45                               ` Paul Lawrence
  2023-05-16  8:43                                 ` Miklos Szeredi
  2023-05-16  8:48                                 ` Amir Goldstein
  0 siblings, 2 replies; 66+ messages in thread
From: Paul Lawrence @ 2023-05-15 21:45 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Amir Goldstein, Miklos Szeredi, Jens Axboe, Giuseppe Scrivano,
	fuse-devel, Daniel Borkmann, Jann Horn, Zimuzo Ezeozue,
	kernel-team, Peng Tao, Palmer Dabbelt, Stefano Duo,
	David Anderson, wuyan, linux-fsdevel, Akilesh Kailash,
	Martijn Coenen

On Mon, May 15, 2023 at 2:11 PM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
> On 5/15/23 22:16, Nikolaus Rath wrote:
> > On May 15 2023, Amir Goldstein <amir73il@gmail.com> wrote:
> >> On Mon, May 15, 2023 at 10:29 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >>> On Fri, 12 May 2023 at 21:37, Amir Goldstein <amir73il@gmail.com> wrote:
> >>>
> >>>> I was waiting for LSFMM to see if and how FUSE-BPF intends to
> >>>> address the highest value use case of read/write passthrough.
> >>>>
> >>>>  From what I've seen, you are still taking a very broad approach of
> >>>> all-or-nothing which still has a lot of core design issues to address,
> >>>> while these old patches already address the most important use case
> >>>> of read/write passthrough of fd without any of the core issues
> >>>> (credentials, hidden fds).
> >>>>
> >>>> As far as I can tell, this old implementation is mostly independent of your
> >>>> lookup based approach - they share the low level read/write passthrough
> >>>> functions but not much more than that, so merging them should not be
> >>>> a blocker to your efforts in the longer run.
> >>>> Please correct me if I am wrong.
> >>>>
> >>>> As things stand, I intend to re-post these old patches with mandatory
> >>>> FOPEN_PASSTHROUGH_AUTOCLOSE to eliminate the open
> >>>> questions about managing mappings.
> >>>>
> >>>> Miklos, please stop me if I missed something and if you do not
> >>>> think that these two approaches are independent.
> >>>
> >>> Do you mean that the BPF patches should use their own passthrough mechanism?
> >>>
> >>> I think it would be better if we could agree on a common interface for
> >>> passthough (or per Paul's suggestion: backing) mechanism.
> >>
> >> Well, not exactly different.
> >> With BFP patches, if you have a backing inode that was established during
> >> LOOKUP with rules to do passthrough for open(), you'd get a backing file and
> >> that backing file would be used to passthrough read/write.
> >>
> >> FOPEN_PASSTHROUGH is another way to configure passthrough read/write
> >> to a backing file that is controlled by the server per open fd instead of by BFP
> >> for every open of the backing inode.
> >>
> >> Obviously, both methods would use the same backing_file field and
> >> same read/write passthrough methods regardless of how the backing file
> >> was setup.
> >>
> >> Obviously, the BFP patches will not use the same ioctl to setup passthrough
> >> (and/or BPF program) to a backing inode, but I don't think that matters much.
> >> When we settle on ioctls for setting up backing inodes, we can also add new
> >> ioctls for setting up backing file with optional BPF program.
> >
> >> I don't see any reason to make the first ioctl more complicated than this:
> >>
> >> struct fuse_passthrough_out {
> >>          uint32_t        fd;
> >>          /* For future implementation */
> >>          uint32_t        len;
> >>          void            *vec;
> >> };
> >>
> >> One advantage with starting with FOPEN_PASSTHROUGH, besides
> >> dealing with the highest priority performance issue, is how it deals with
> >> resource limits on open files.
> >
> > One thing that struck me when we discussed FUSE-BPF at LSF was that from
> > a userspace point of view, FUSE-BPF presents an almost completely
> > different API than traditional FUSE (at least in its current form).
> >
> > As long as there is no support for falling back to standard FUSE
> > callbacks, using FUSE-BPF means that most of the existing API no longer
> > works, and instead there is a large new API surface that doesn't work in
> > standard FUSE (the pre-filter and post-filter callbacks for each
> > operation).
> >
> > I think this means that FUSE-BPF file systems won't work with FUSE, and
> > FUSE filesystems won't work with FUSE-BPF.
>
> Is that so? I think found some incompatibilities in the patches (need to
> double check), but doesn't it just do normal fuse operations and then
> replies with an ioctl to do passthrough? BPF is used for additional
> filtering, that would have to be done otherwise in userspace.
>
> Really difficult in the current patch set and data structures is to see
> what is actually BPF and what is passthrough.

I hope that fuse and fuse-bpf play together a little better than that
;) In the current design, you can set a backing file from within
traditional fuse lookups, which moves you to fuse-bpf for that
file/directory, and you can remove the backing file during the
post-filter, moving that node back to fuse. You can also return a
value from the bpf prefilter that tells fuse to use traditional fuse
for that command. I think this is a very useful feature - it's one of
the first ones we used in Android.

If we do find any areas where we can't easily switch between
traditional fuse and fuse-bpf, we would consider that a bug and fix it
as fast as possible.

And yes, we got the feedback from LSFMMBPF that the current patches
are hard to follow, and we will be reordering them and resending them
as three patchsets. One will add backing files, one will add backing
directories, and the final will add bpf filters to both. Hopefully
that will make them easier to understand.

Paul

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [fuse-devel] [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2023-05-15 21:45                               ` Paul Lawrence
@ 2023-05-16  8:43                                 ` Miklos Szeredi
  2023-05-16 10:16                                   ` Nikolaus Rath
  2023-05-16  8:48                                 ` Amir Goldstein
  1 sibling, 1 reply; 66+ messages in thread
From: Miklos Szeredi @ 2023-05-16  8:43 UTC (permalink / raw)
  To: Paul Lawrence
  Cc: Bernd Schubert, Amir Goldstein, Jens Axboe, Giuseppe Scrivano,
	fuse-devel, Daniel Borkmann, Jann Horn, Zimuzo Ezeozue,
	kernel-team, Peng Tao, Palmer Dabbelt, Stefano Duo,
	David Anderson, wuyan, linux-fsdevel, Akilesh Kailash,
	Martijn Coenen

On Mon, 15 May 2023 at 23:45, Paul Lawrence <paullawrence@google.com> wrote:
>
> On Mon, May 15, 2023 at 2:11 PM Bernd Schubert
> <bernd.schubert@fastmail.fm> wrote:
> > On 5/15/23 22:16, Nikolaus Rath wrote:

> > > One thing that struck me when we discussed FUSE-BPF at LSF was that from
> > > a userspace point of view, FUSE-BPF presents an almost completely
> > > different API than traditional FUSE (at least in its current form).
> > >
> > > As long as there is no support for falling back to standard FUSE
> > > callbacks, using FUSE-BPF means that most of the existing API no longer
> > > works, and instead there is a large new API surface that doesn't work in
> > > standard FUSE (the pre-filter and post-filter callbacks for each
> > > operation).
> > >
> > > I think this means that FUSE-BPF file systems won't work with FUSE, and
> > > FUSE filesystems won't work with FUSE-BPF.
> >
> > Is that so? I think found some incompatibilities in the patches (need to
> > double check), but doesn't it just do normal fuse operations and then
> > replies with an ioctl to do passthrough? BPF is used for additional
> > filtering, that would have to be done otherwise in userspace.

I think Nikolaus' concern is that the BPF hooks add a major upgrade to
the API, i.e. it looks very difficult to port a BPF based fs to
non-BPF based fuse.  The new API should at least come with sufficient
warnings about portability issues.

I don't think the other direction has problems. The fuse API/ABI must
remain backward compatible and old filesystems must be able to work
after this feature is added.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [fuse-devel] [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2023-05-15 21:45                               ` Paul Lawrence
  2023-05-16  8:43                                 ` Miklos Szeredi
@ 2023-05-16  8:48                                 ` Amir Goldstein
  1 sibling, 0 replies; 66+ messages in thread
From: Amir Goldstein @ 2023-05-16  8:48 UTC (permalink / raw)
  To: Paul Lawrence
  Cc: Bernd Schubert, Miklos Szeredi, Jens Axboe, Giuseppe Scrivano,
	fuse-devel, Daniel Borkmann, Jann Horn, Zimuzo Ezeozue,
	kernel-team, Peng Tao, Palmer Dabbelt, Stefano Duo,
	David Anderson, wuyan, linux-fsdevel, Akilesh Kailash,
	Martijn Coenen

On Tue, May 16, 2023 at 12:45 AM Paul Lawrence <paullawrence@google.com> wrote:
>
> On Mon, May 15, 2023 at 2:11 PM Bernd Schubert
> <bernd.schubert@fastmail.fm> wrote:
> > On 5/15/23 22:16, Nikolaus Rath wrote:
> > > On May 15 2023, Amir Goldstein <amir73il@gmail.com> wrote:
> > >> On Mon, May 15, 2023 at 10:29 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >>> On Fri, 12 May 2023 at 21:37, Amir Goldstein <amir73il@gmail.com> wrote:
> > >>>
> > >>>> I was waiting for LSFMM to see if and how FUSE-BPF intends to
> > >>>> address the highest value use case of read/write passthrough.
> > >>>>
> > >>>>  From what I've seen, you are still taking a very broad approach of
> > >>>> all-or-nothing which still has a lot of core design issues to address,
> > >>>> while these old patches already address the most important use case
> > >>>> of read/write passthrough of fd without any of the core issues
> > >>>> (credentials, hidden fds).
> > >>>>
> > >>>> As far as I can tell, this old implementation is mostly independent of your
> > >>>> lookup based approach - they share the low level read/write passthrough
> > >>>> functions but not much more than that, so merging them should not be
> > >>>> a blocker to your efforts in the longer run.
> > >>>> Please correct me if I am wrong.
> > >>>>
> > >>>> As things stand, I intend to re-post these old patches with mandatory
> > >>>> FOPEN_PASSTHROUGH_AUTOCLOSE to eliminate the open
> > >>>> questions about managing mappings.
> > >>>>
> > >>>> Miklos, please stop me if I missed something and if you do not
> > >>>> think that these two approaches are independent.
> > >>>
> > >>> Do you mean that the BPF patches should use their own passthrough mechanism?
> > >>>
> > >>> I think it would be better if we could agree on a common interface for
> > >>> passthough (or per Paul's suggestion: backing) mechanism.
> > >>
> > >> Well, not exactly different.
> > >> With BFP patches, if you have a backing inode that was established during
> > >> LOOKUP with rules to do passthrough for open(), you'd get a backing file and
> > >> that backing file would be used to passthrough read/write.
> > >>
> > >> FOPEN_PASSTHROUGH is another way to configure passthrough read/write
> > >> to a backing file that is controlled by the server per open fd instead of by BFP
> > >> for every open of the backing inode.
> > >>
> > >> Obviously, both methods would use the same backing_file field and
> > >> same read/write passthrough methods regardless of how the backing file
> > >> was setup.
> > >>
> > >> Obviously, the BFP patches will not use the same ioctl to setup passthrough
> > >> (and/or BPF program) to a backing inode, but I don't think that matters much.
> > >> When we settle on ioctls for setting up backing inodes, we can also add new
> > >> ioctls for setting up backing file with optional BPF program.
> > >
> > >> I don't see any reason to make the first ioctl more complicated than this:
> > >>
> > >> struct fuse_passthrough_out {
> > >>          uint32_t        fd;
> > >>          /* For future implementation */
> > >>          uint32_t        len;
> > >>          void            *vec;
> > >> };
> > >>
> > >> One advantage with starting with FOPEN_PASSTHROUGH, besides
> > >> dealing with the highest priority performance issue, is how it deals with
> > >> resource limits on open files.
> > >
> > > One thing that struck me when we discussed FUSE-BPF at LSF was that from
> > > a userspace point of view, FUSE-BPF presents an almost completely
> > > different API than traditional FUSE (at least in its current form).
> > >
> > > As long as there is no support for falling back to standard FUSE
> > > callbacks, using FUSE-BPF means that most of the existing API no longer
> > > works, and instead there is a large new API surface that doesn't work in
> > > standard FUSE (the pre-filter and post-filter callbacks for each
> > > operation).

I think there is a confusion here that needs to be clarified.
I was confused when you asked in the session why the usermode
post-filter was needed.

IIUC, there is no usermode post filter. There are only in-kernel BPF
pre/post filters.

Paul/Daniel will correct me if I am wrong, but I think the FUSE server
can be called at most once per op as legacy FUSE, but with
FUSE-BPF, the server may be bypassed.

Pre/post filters are used to toggle the bypass mode permanently
or for a specific op and post filter can also be used to modify the
server response.

> > >
> > > I think this means that FUSE-BPF file systems won't work with FUSE, and
> > > FUSE filesystems won't work with FUSE-BPF.
> >
> > Is that so? I think found some incompatibilities in the patches (need to
> > double check), but doesn't it just do normal fuse operations and then
> > replies with an ioctl to do passthrough?

About that, I wanted to ask.
Alessio's initial patches used to have a similar approach.
Without ioctl, but the passthrough/backing fd was provided as part of the
response to OPEN request.

Following feedback from Miklos and Jens, not only the passthrough
request was moved to ioctl, but it was also decoupled from the OPEN
response.

This allows the server more flexibility in managing the passthrough
mode of files (or inodes in FUSE-BPF case).
FUSE-BPF patches use ioctl for response, but without decoupling.
I wonder if that should be amended for the next version?

> > BPF is used for additional
> > filtering, that would have to be done otherwise in userspace.
> >
> > Really difficult in the current patch set and data structures is to see
> > what is actually BPF and what is passthrough.
>
> I hope that fuse and fuse-bpf play together a little better than that
> ;) In the current design, you can set a backing file from within
> traditional fuse lookups, which moves you to fuse-bpf for that
> file/directory, and you can remove the backing file during the
> post-filter, moving that node back to fuse. You can also return a
> value from the bpf prefilter that tells fuse to use traditional fuse
> for that command. I think this is a very useful feature - it's one of
> the first ones we used in Android.
>
> If we do find any areas where we can't easily switch between
> traditional fuse and fuse-bpf, we would consider that a bug and fix it
> as fast as possible.
>
> And yes, we got the feedback from LSFMMBPF that the current patches
> are hard to follow, and we will be reordering them and resending them
> as three patchsets. One will add backing files, one will add backing
> directories, and the final will add bpf filters to both. Hopefully
> that will make them easier to understand.
>

That sounds great!
I started to dust off Alessio's patches.
I might just post what I have as a reference implementation that
we can compare to your "backing file" series.
I would much rather that your version is the one that ends up being
merged at the end ;-)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [fuse-devel] [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough
  2023-05-16  8:43                                 ` Miklos Szeredi
@ 2023-05-16 10:16                                   ` Nikolaus Rath
  0 siblings, 0 replies; 66+ messages in thread
From: Nikolaus Rath @ 2023-05-16 10:16 UTC (permalink / raw)
  To: Miklos Szeredi via fuse-devel
  Cc: Paul Lawrence, Miklos Szeredi, Jens Axboe, Giuseppe Scrivano,
	Peng Tao, Daniel Borkmann, Jann Horn, Bernd Schubert,
	Martijn Coenen, Zimuzo Ezeozue, Palmer Dabbelt, Stefano Duo,
	David Anderson, Akilesh Kailash, linux-fsdevel, wuyan,
	kernel-team

On May 16 2023, Miklos Szeredi via fuse-devel <fuse-devel@lists.sourceforge.net> wrote:
> On Mon, 15 May 2023 at 23:45, Paul Lawrence <paullawrence@google.com> wrote:
>>
>> On Mon, May 15, 2023 at 2:11 PM Bernd Schubert
>> <bernd.schubert@fastmail.fm> wrote:
>> > On 5/15/23 22:16, Nikolaus Rath wrote:
>
>> > > One thing that struck me when we discussed FUSE-BPF at LSF was that from
>> > > a userspace point of view, FUSE-BPF presents an almost completely
>> > > different API than traditional FUSE (at least in its current form).
>> > >
>> > > As long as there is no support for falling back to standard FUSE
>> > > callbacks, using FUSE-BPF means that most of the existing API no longer
>> > > works, and instead there is a large new API surface that doesn't work in
>> > > standard FUSE (the pre-filter and post-filter callbacks for each
>> > > operation).
>> > >
>> > > I think this means that FUSE-BPF file systems won't work with FUSE, and
>> > > FUSE filesystems won't work with FUSE-BPF.
>> >
>> > Is that so? I think found some incompatibilities in the patches (need to
>> > double check), but doesn't it just do normal fuse operations and then
>> > replies with an ioctl to do passthrough? BPF is used for additional
>> > filtering, that would have to be done otherwise in userspace.
>
> I think Nikolaus' concern is that the BPF hooks add a major upgrade to
> the API, i.e. it looks very difficult to port a BPF based fs to
> non-BPF based fuse.  The new API should at least come with sufficient
> warnings about portability issues.
>
> I don't think the other direction has problems. The fuse API/ABI must
> remain backward compatible and old filesystems must be able to work
> after this feature is added.

I wouldn't say I'm concerned, it's more of an observation.

To me it seemed like we are combining two very different
approaches/interfaces in the same kernel module / userspace
library. This doesn't result in a compatibility problem, but it seems to
me that we could cleanly split this into two different components (that
may share code) with almost no API overlap.

But it seems I may have misunderstood some aspects about how the
fallback works. Let's wait for the FUSE-BPF patches and then revisit the
question :-).



Best,
-Nikolaus

-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2023-05-16 10:16 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-25 15:30 [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write Alessio Balsini
2021-01-25 15:30 ` [PATCH RESEND V12 1/8] fs: Generic function to convert iocb to rw flags Alessio Balsini
2021-01-25 16:46   ` Alessio Balsini
2021-03-24  7:43     ` Rokudo Yan
2021-03-24 14:02       ` Alessio Balsini
2021-01-25 15:30 ` [PATCH RESEND V12 2/8] fuse: 32-bit user space ioctl compat for fuse device Alessio Balsini
     [not found]   ` <CAMAHBGzkfEd9-1u0iKXp65ReJQgUi_=4sMpmfkwEOaMp6Ux7pg@mail.gmail.com>
2021-01-27 13:40     ` Alessio Balsini
     [not found]       ` <CAMAHBGwpKW+30kNQ_Apt8A-FTmr94hBOzkT21cjEHHW+t7yUMQ@mail.gmail.com>
2021-01-28 14:15         ` Alessio Balsini
2021-02-05  9:54           ` Peng Tao
2021-03-16 18:57           ` Arnd Bergmann
2021-02-17 10:21   ` Miklos Szeredi
2021-03-01 12:26     ` Alessio Balsini
2021-03-16 18:53   ` Arnd Bergmann
2021-03-18 16:13     ` Alessio Balsini
2021-03-18 21:15       ` Arnd Bergmann
2021-03-19 15:21         ` Alessio Balsini
2021-01-25 15:30 ` [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough Alessio Balsini
2021-02-17 13:41   ` Miklos Szeredi
2021-02-19  7:05     ` Peng Tao
2021-02-19  8:40       ` Miklos Szeredi
2021-03-01 17:05         ` Alessio Balsini
2022-09-08 15:36           ` Amir Goldstein
2022-09-09 19:07             ` Miklos Szeredi
2022-09-10  8:52               ` Amir Goldstein
2022-09-10 13:03                 ` Bernd Schubert
2022-09-12  9:29                 ` Miklos Szeredi
2022-09-12 12:29                   ` Amir Goldstein
2022-09-12 13:03                     ` Miklos Szeredi
2022-09-12 13:05                       ` Miklos Szeredi
2022-09-12 13:26                       ` Amir Goldstein
2022-09-12 14:22                         ` Miklos Szeredi
2022-09-12 15:39                           ` Amir Goldstein
2022-09-12 17:43                             ` Hao Luo
2022-09-12 18:28                               ` Overlayfs with writable lower layer Amir Goldstein
2022-09-13 18:26                                 ` Hao Luo
2022-09-13 18:54                                   ` Amir Goldstein
2022-09-13 20:33                                     ` Hao Luo
2022-09-14  3:46                                       ` Amir Goldstein
2022-09-14 18:00                                         ` Hao Luo
2022-09-14 19:23                                           ` Amir Goldstein
2022-09-14 19:33                                             ` Hao Luo
2022-09-15 10:54                                               ` Amir Goldstein
2023-05-12 19:37                     ` [PATCH RESEND V12 3/8] fuse: Definitions and ioctl for passthrough Amir Goldstein
2023-05-15  7:29                       ` Miklos Szeredi
2023-05-15 14:00                         ` Amir Goldstein
2023-05-15 20:16                           ` [fuse-devel] " Nikolaus Rath
2023-05-15 21:11                             ` Bernd Schubert
2023-05-15 21:45                               ` Paul Lawrence
2023-05-16  8:43                                 ` Miklos Szeredi
2023-05-16 10:16                                   ` Nikolaus Rath
2023-05-16  8:48                                 ` Amir Goldstein
2021-01-25 15:30 ` [PATCH RESEND V12 4/8] fuse: Passthrough initialization and release Alessio Balsini
2021-02-17 13:52   ` Miklos Szeredi
2021-05-05 12:21     ` Amir Goldstein
2021-05-17 11:36       ` Alessio Balsini
2021-05-17 13:21         ` Amir Goldstein
2021-01-25 15:30 ` [PATCH RESEND V12 5/8] fuse: Introduce synchronous read and write for passthrough Alessio Balsini
2021-02-17 14:00   ` Miklos Szeredi
2021-01-25 15:30 ` [PATCH RESEND V12 6/8] fuse: Handle asynchronous read and write in passthrough Alessio Balsini
2021-01-25 15:30 ` [PATCH RESEND V12 7/8] fuse: Use daemon creds in passthrough mode Alessio Balsini
2021-02-05  9:23   ` Peng Tao
2021-02-05 11:21     ` Alessio Balsini
2021-01-25 15:30 ` [PATCH RESEND V12 8/8] fuse: Introduce passthrough for mmap Alessio Balsini
2021-02-17 14:05   ` Miklos Szeredi
2021-04-01 11:24     ` Alessio Balsini
2021-11-18 18:31 ` [PATCH RESEND V12 0/8] fuse: Add support for passthrough read/write Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.