($INBOX_DIR/description missing)
 help / color / Atom feed
* [PATCH v3 0/6] Overlayfs stacked f_op fixes
@ 2018-08-27 12:55 Amir Goldstein
  2018-08-27 12:55 ` [PATCH v3 1/6] ovl: respect FIEMAP_FLAG_SYNC flag Amir Goldstein
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Amir Goldstein @ 2018-08-27 12:55 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

Miklos,

Following are fixes to some stacked f_op regressions, also
available on my ovl-fixes branch [1].

Patches 1-2 fix fiemap() and swapon() regressions detected
by xfstests (no change from v2).

Patches 3-6 implement the fadvise() file operation in overlayfs
to fix fadvise64(2) and readahead(2) syscalls, per your suggestion.

Correctness and error cases of readahead(2) were verified with the LTP
readahead tests including a new overlayfs readahead test [2].
Error cases of fadvise64(2) were verified with the LTP fadvise tests,
but correctness is not yet covered by those tests.

Thanks,
Amir.

Changes from v2:
- avoid file_real() vfs hack
- implement ovl_fadvise()
- update vfs.txt

[1] https://github.com/amir73il/linux/commits/ovl-fixes
[2] https://github.com/amir73il/ltp/commits/overlayfs-devel

Amir Goldstein (6):
  ovl: respect FIEMAP_FLAG_SYNC flag
  ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
  Documentation/filesystems: update documentation of file_operations to
    kernel 4.18
  vfs: add the fadvise() file operation
  vfs: implement readahead(2) using POSIX_FADV_WILLNEED
  ovl: add ovl_fadvise()

 Documentation/filesystems/vfs.txt | 25 +++++++++++--
 fs/overlayfs/file.c               | 21 +++++++++--
 fs/overlayfs/inode.c              | 10 +++++
 include/linux/fs.h                |  5 +++
 mm/fadvise.c                      | 78 ++++++++++++++++++++++-----------------
 mm/readahead.c                    | 28 ++++++--------
 6 files changed, 110 insertions(+), 57 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 1/6] ovl: respect FIEMAP_FLAG_SYNC flag
  2018-08-27 12:55 [PATCH v3 0/6] Overlayfs stacked f_op fixes Amir Goldstein
@ 2018-08-27 12:55 ` Amir Goldstein
  2018-08-27 12:56 ` [PATCH v3 2/6] ovl: fix GPF in swapfile_activate of file from overlayfs over xfs Amir Goldstein
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Amir Goldstein @ 2018-08-27 12:55 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

Stacked overlayfs fiemap operation broke xfstests that test delayed
allocation (with "_test_generic_punch -d"), because ovl_fiemap()
failed to write dirty pages when requested.

Fixes: 9e142c4102db ("ovl: add ovl_fiemap()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/inode.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index e0bb217c01e2..5014749fd4b4 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -467,6 +467,10 @@ static int ovl_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		return -EOPNOTSUPP;
 
 	old_cred = ovl_override_creds(inode->i_sb);
+
+	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC)
+		filemap_write_and_wait(realinode->i_mapping);
+
 	err = realinode->i_op->fiemap(realinode, fieinfo, start, len);
 	revert_creds(old_cred);
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 2/6] ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
  2018-08-27 12:55 [PATCH v3 0/6] Overlayfs stacked f_op fixes Amir Goldstein
  2018-08-27 12:55 ` [PATCH v3 1/6] ovl: respect FIEMAP_FLAG_SYNC flag Amir Goldstein
@ 2018-08-27 12:56 ` Amir Goldstein
  2018-08-27 12:56 ` [PATCH v3 3/6] Documentation/filesystems: update documentation of file_operations to kernel 4.18 Amir Goldstein
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Amir Goldstein @ 2018-08-27 12:56 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

Since overlayfs implements stacked file operations, the underlying
filesystems are not supposed to be exposed to the overlayfs file,
whose f_inode is an overlayfs inode.

Assigning an overlayfs file to swap_file results in an attempt of xfs
code to dereference an xfs_inode struct from an ovl_inode pointer:

 CPU: 0 PID: 2462 Comm: swapon Not tainted
 4.18.0-xfstests-12721-g33e17876ea4e #3402
 RIP: 0010:xfs_find_bdev_for_inode+0x23/0x2f
 Call Trace:
  xfs_iomap_swapfile_activate+0x1f/0x43
  __se_sys_swapon+0xb1a/0xee9

Fix this by not assigning the real inode mapping to f_mapping, which
will cause swapon() to return an error (-EINVAL). Although it makes
sense not to allow setting swpafile on an overlayfs file, some users
may depend on it, so we may need to fix this up in the future.

Keeping f_mapping pointing to overlay inode mapping will cause O_DIRECT
open to fail. Fix this by installing ovl_aops with noop_direct_IO in
overlay inode mapping.

Keeping f_mapping pointing to overlay inode mapping will cause other
a_ops related operations to fail (e.g. readahead()). Those will be
fixed by follow up patches.

Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: f7c72396d0de ("ovl: add O_DIRECT support")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/file.c  | 3 ---
 fs/overlayfs/inode.c | 6 ++++++
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 32e9282893c9..a4acd84591d4 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -131,9 +131,6 @@ static int ovl_open(struct inode *inode, struct file *file)
 	if (IS_ERR(realfile))
 		return PTR_ERR(realfile);
 
-	/* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
-	file->f_mapping = realfile->f_mapping;
-
 	file->private_data = realfile;
 
 	return 0;
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 5014749fd4b4..b6ac545b5a32 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -504,6 +504,11 @@ static const struct inode_operations ovl_special_inode_operations = {
 	.update_time	= ovl_update_time,
 };
 
+const struct address_space_operations ovl_aops = {
+	/* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
+	.direct_IO		= noop_direct_IO,
+};
+
 /*
  * It is possible to stack overlayfs instance on top of another
  * overlayfs instance as lower layer. We need to annonate the
@@ -575,6 +580,7 @@ static void ovl_fill_inode(struct inode *inode, umode_t mode, dev_t rdev,
 	case S_IFREG:
 		inode->i_op = &ovl_file_inode_operations;
 		inode->i_fop = &ovl_file_operations;
+		inode->i_mapping->a_ops = &ovl_aops;
 		break;
 
 	case S_IFDIR:
-- 
2.7.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 3/6] Documentation/filesystems: update documentation of file_operations to kernel 4.18
  2018-08-27 12:55 [PATCH v3 0/6] Overlayfs stacked f_op fixes Amir Goldstein
  2018-08-27 12:55 ` [PATCH v3 1/6] ovl: respect FIEMAP_FLAG_SYNC flag Amir Goldstein
  2018-08-27 12:56 ` [PATCH v3 2/6] ovl: fix GPF in swapfile_activate of file from overlayfs over xfs Amir Goldstein
@ 2018-08-27 12:56 ` Amir Goldstein
  2018-08-27 12:56 ` [PATCH v3 4/6] vfs: add the fadvise() file operation Amir Goldstein
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Amir Goldstein @ 2018-08-27 12:56 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 Documentation/filesystems/vfs.txt | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 4b2084d0f1fb..e0ace944a0e1 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -848,7 +848,7 @@ struct file_operations
 ----------------------
 
 This describes how the VFS can manipulate an open file. As of kernel
-4.1, the following members are defined:
+4.18, the following members are defined:
 
 struct file_operations {
 	struct module *owner;
@@ -858,11 +858,11 @@ struct file_operations {
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
 	int (*iterate) (struct file *, struct dir_context *);
+	int (*iterate_shared) (struct file *, struct dir_context *);
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
-	int (*mremap)(struct file *, struct vm_area_struct *);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
@@ -882,6 +882,9 @@ struct file_operations {
 #ifndef CONFIG_MMU
 	unsigned (*mmap_capabilities)(struct file *);
 #endif
+	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
+	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
+	int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
 };
 
 Again, all methods are called without any locks being held, unless
@@ -899,6 +902,9 @@ otherwise noted.
 
   iterate: called when the VFS needs to read the directory contents
 
+  iterate_shared: called when the VFS needs to read the directory contents
+	when filesystem supports concurrent dir iterators
+
   poll: called by the VFS when a process wants to check if there is
 	activity on this file and (optionally) go to sleep until there
 	is activity. Called by the select(2) and poll(2) system calls
@@ -906,7 +912,7 @@ otherwise noted.
   unlocked_ioctl: called by the ioctl(2) system call.
 
   compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
- 	 are used on 64 bit kernels.
+	 are used on 64 bit kernels.
 
   mmap: called by the mmap(2) system call
 
@@ -931,7 +937,7 @@ otherwise noted.
 	(non-blocking) mode is enabled for a file
 
   lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
-  	commands
+	commands
 
   get_unmapped_area: called by the mmap(2) system call
 
@@ -951,6 +957,14 @@ otherwise noted.
 
   fallocate: called by the VFS to preallocate blocks or punch a hole.
 
+  copy_file_range: called by the copy_file_range(2) system call.
+
+  clone_file_range: called by the ioctl(2) system call for FICLONERANGE and
+	FICLONE commands.
+
+  dedupe_file_range: called by the ioctl(2) system call for FIDEDUPERANGE
+	command.
+
 Note that the file operations are implemented by the specific
 filesystem in which the inode resides. When opening a device node
 (character or block special) most filesystems will call special
-- 
2.7.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 4/6] vfs: add the fadvise() file operation
  2018-08-27 12:55 [PATCH v3 0/6] Overlayfs stacked f_op fixes Amir Goldstein
                   ` (2 preceding siblings ...)
  2018-08-27 12:56 ` [PATCH v3 3/6] Documentation/filesystems: update documentation of file_operations to kernel 4.18 Amir Goldstein
@ 2018-08-27 12:56 ` Amir Goldstein
  2018-08-27 13:56   ` Steve French
  2018-08-27 12:56 ` [PATCH v3 5/6] vfs: implement readahead(2) using POSIX_FADV_WILLNEED Amir Goldstein
  2018-08-27 12:56 ` [PATCH v3 6/6] ovl: add ovl_fadvise() Amir Goldstein
  5 siblings, 1 reply; 14+ messages in thread
From: Amir Goldstein @ 2018-08-27 12:56 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

This is going to be used by overlayfs and possibly useful
for other filesystems.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 Documentation/filesystems/vfs.txt |  3 ++
 include/linux/fs.h                |  5 +++
 mm/fadvise.c                      | 78 ++++++++++++++++++++++-----------------
 3 files changed, 53 insertions(+), 33 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index e0ace944a0e1..4868fa9c0758 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -885,6 +885,7 @@ struct file_operations {
 	ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
 	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
 	int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
+	int (*fadvise)(struct file *, loff_t, loff_t, int);
 };
 
 Again, all methods are called without any locks being held, unless
@@ -965,6 +966,8 @@ otherwise noted.
   dedupe_file_range: called by the ioctl(2) system call for FIDEDUPERANGE
 	command.
 
+  fadvise: possibly called by the fadvise64() system call.
+
 Note that the file operations are implemented by the specific
 filesystem in which the inode resides. When opening a device node
 (character or block special) most filesystems will call special
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 33322702c910..6c0b4a1c22ff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1763,6 +1763,7 @@ struct file_operations {
 			u64);
 	int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t,
 			u64);
+	int (*fadvise)(struct file *, loff_t, loff_t, int);
 } __randomize_layout;
 
 struct inode_operations {
@@ -3459,4 +3460,8 @@ static inline bool dir_relax_shared(struct inode *inode)
 extern bool path_noexec(const struct path *path);
 extern void inode_nohighmem(struct inode *inode);
 
+/* mm/fadvise.c */
+extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
+		       int advice);
+
 #endif /* _LINUX_FS_H */
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 2d8376e3c640..2f59bac1cb77 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -27,9 +27,9 @@
  * deactivate the pages and clear PG_Referenced.
  */
 
-int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
+static int generic_fadvise(struct file *file, loff_t offset, loff_t len,
+			   int advice)
 {
-	struct fd f = fdget(fd);
 	struct inode *inode;
 	struct address_space *mapping;
 	struct backing_dev_info *bdi;
@@ -37,22 +37,14 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
 	pgoff_t start_index;
 	pgoff_t end_index;
 	unsigned long nrpages;
-	int ret = 0;
 
-	if (!f.file)
-		return -EBADF;
+	inode = file_inode(file);
+	if (S_ISFIFO(inode->i_mode))
+		return -ESPIPE;
 
-	inode = file_inode(f.file);
-	if (S_ISFIFO(inode->i_mode)) {
-		ret = -ESPIPE;
-		goto out;
-	}
-
-	mapping = f.file->f_mapping;
-	if (!mapping || len < 0) {
-		ret = -EINVAL;
-		goto out;
-	}
+	mapping = file->f_mapping;
+	if (!mapping || len < 0)
+		return -EINVAL;
 
 	bdi = inode_to_bdi(mapping->host);
 
@@ -67,9 +59,9 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
 			/* no bad return value, but ignore advice */
 			break;
 		default:
-			ret = -EINVAL;
+			return -EINVAL;
 		}
-		goto out;
+		return 0;
 	}
 
 	/*
@@ -85,21 +77,21 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
 
 	switch (advice) {
 	case POSIX_FADV_NORMAL:
-		f.file->f_ra.ra_pages = bdi->ra_pages;
-		spin_lock(&f.file->f_lock);
-		f.file->f_mode &= ~FMODE_RANDOM;
-		spin_unlock(&f.file->f_lock);
+		file->f_ra.ra_pages = bdi->ra_pages;
+		spin_lock(&file->f_lock);
+		file->f_mode &= ~FMODE_RANDOM;
+		spin_unlock(&file->f_lock);
 		break;
 	case POSIX_FADV_RANDOM:
-		spin_lock(&f.file->f_lock);
-		f.file->f_mode |= FMODE_RANDOM;
-		spin_unlock(&f.file->f_lock);
+		spin_lock(&file->f_lock);
+		file->f_mode |= FMODE_RANDOM;
+		spin_unlock(&file->f_lock);
 		break;
 	case POSIX_FADV_SEQUENTIAL:
-		f.file->f_ra.ra_pages = bdi->ra_pages * 2;
-		spin_lock(&f.file->f_lock);
-		f.file->f_mode &= ~FMODE_RANDOM;
-		spin_unlock(&f.file->f_lock);
+		file->f_ra.ra_pages = bdi->ra_pages * 2;
+		spin_lock(&file->f_lock);
+		file->f_mode &= ~FMODE_RANDOM;
+		spin_unlock(&file->f_lock);
 		break;
 	case POSIX_FADV_WILLNEED:
 		/* First and last PARTIAL page! */
@@ -115,8 +107,7 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
 		 * Ignore return value because fadvise() shall return
 		 * success even if filesystem can't retrieve a hint,
 		 */
-		force_page_cache_readahead(mapping, f.file, start_index,
-					   nrpages);
+		force_page_cache_readahead(mapping, file, start_index, nrpages);
 		break;
 	case POSIX_FADV_NOREUSE:
 		break;
@@ -183,9 +174,30 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
 		}
 		break;
 	default:
-		ret = -EINVAL;
+		return -EINVAL;
 	}
-out:
+	return 0;
+}
+
+int vfs_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
+{
+	if (file->f_op->fadvise)
+		return file->f_op->fadvise(file, offset, len, advice);
+
+	return generic_fadvise(file, offset, len, advice);
+}
+EXPORT_SYMBOL(vfs_fadvise);
+
+int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
+{
+	struct fd f = fdget(fd);
+	int ret;
+
+	if (!f.file)
+		return -EBADF;
+
+	ret = vfs_fadvise(f.file, offset, len, advice);
+
 	fdput(f);
 	return ret;
 }
-- 
2.7.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 5/6] vfs: implement readahead(2) using POSIX_FADV_WILLNEED
  2018-08-27 12:55 [PATCH v3 0/6] Overlayfs stacked f_op fixes Amir Goldstein
                   ` (3 preceding siblings ...)
  2018-08-27 12:56 ` [PATCH v3 4/6] vfs: add the fadvise() file operation Amir Goldstein
@ 2018-08-27 12:56 ` Amir Goldstein
  2018-08-27 23:30   ` Dave Chinner
  2018-08-27 12:56 ` [PATCH v3 6/6] ovl: add ovl_fadvise() Amir Goldstein
  5 siblings, 1 reply; 14+ messages in thread
From: Amir Goldstein @ 2018-08-27 12:56 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

The implementation of readahead(2) syscall is identical to that of
fadvise64(POSIX_FADV_WILLNEED) with a few exceptions:
1. readahead(2) returns -EINVAL for !mapping->a_ops and fadvise64()
   ignores the request and returns 0.
2. fadvise64() checks for integer overflow corner case
3. fadvise64() calls the optional filesystem fadvice() file operation

Unite the two implementations by calling vfs_fadvice() from readahead(2)
syscall. Check the !mapping->a_ops in readahead(2) syscall to preserve
legacy behavior.

Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: d1d04ef8572b ("ovl: stack file ops")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 mm/readahead.c | 28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index a59ea70527b9..867a5cc3a62e 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -20,6 +20,7 @@
 #include <linux/file.h>
 #include <linux/mm_inline.h>
 #include <linux/blk-cgroup.h>
+#include <linux/fadvise.h>
 
 #include "internal.h"
 
@@ -576,21 +577,19 @@ page_cache_async_readahead(struct address_space *mapping,
 EXPORT_SYMBOL_GPL(page_cache_async_readahead);
 
 static ssize_t
-do_readahead(struct address_space *mapping, struct file *filp,
-	     pgoff_t index, unsigned long nr)
+do_readahead(struct file *file, loff_t offset, size_t count)
 {
-	if (!mapping || !mapping->a_ops)
-		return -EINVAL;
+	struct address_space *mapping = file->f_mapping;
 
 	/*
-	 * Readahead doesn't make sense for DAX inodes, but we don't want it
-	 * to report a failure either.  Instead, we just return success and
-	 * don't do any work.
+	 * fadvise() silently ignores an advice for a file with !a_ops and
+	 * returns -EPIPE for a pipe. Keep this check here to comply with legacy
+	 * -EINVAL behavior of readahead(2).
 	 */
-	if (dax_mapping(mapping))
-		return 0;
+	if (!mapping || !mapping->a_ops || !S_ISREG(file_inode(file)->i_mode))
+		return -EINVAL;
 
-	return force_page_cache_readahead(mapping, filp, index, nr);
+	return vfs_fadvise(file, offset, count, POSIX_FADV_WILLNEED);
 }
 
 ssize_t ksys_readahead(int fd, loff_t offset, size_t count)
@@ -601,13 +600,8 @@ ssize_t ksys_readahead(int fd, loff_t offset, size_t count)
 	ret = -EBADF;
 	f = fdget(fd);
 	if (f.file) {
-		if (f.file->f_mode & FMODE_READ) {
-			struct address_space *mapping = f.file->f_mapping;
-			pgoff_t start = offset >> PAGE_SHIFT;
-			pgoff_t end = (offset + count - 1) >> PAGE_SHIFT;
-			unsigned long len = end - start + 1;
-			ret = do_readahead(mapping, f.file, start, len);
-		}
+		if (f.file->f_mode & FMODE_READ)
+			ret = do_readahead(f.file, offset, count);
 		fdput(f);
 	}
 	return ret;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 6/6] ovl: add ovl_fadvise()
  2018-08-27 12:55 [PATCH v3 0/6] Overlayfs stacked f_op fixes Amir Goldstein
                   ` (4 preceding siblings ...)
  2018-08-27 12:56 ` [PATCH v3 5/6] vfs: implement readahead(2) using POSIX_FADV_WILLNEED Amir Goldstein
@ 2018-08-27 12:56 ` Amir Goldstein
  2018-08-27 18:52   ` Vivek Goyal
  2019-12-28  5:49   ` Andreas Dilger
  5 siblings, 2 replies; 14+ messages in thread
From: Amir Goldstein @ 2018-08-27 12:56 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

Implement stacked fadvise to fix syscalls readahead(2) and fadvise64(2)
on an overlayfs file.

Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: d1d04ef8572b ("ovl: stack file ops")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/file.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index a4acd84591d4..42d2d034d85c 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -331,6 +331,23 @@ static long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len
 	return ret;
 }
 
+int ovl_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
+{
+	struct fd real;
+	int ret;
+
+	ret = ovl_real_fdget(file, &real);
+	if (ret)
+		return ret;
+
+	/* XXX: do we need mounter credentials? */
+	ret = vfs_fadvise(real.file, offset, len, advice);
+
+	fdput(real);
+
+	return ret;
+}
+
 static long ovl_real_ioctl(struct file *file, unsigned int cmd,
 			   unsigned long arg)
 {
@@ -499,6 +516,7 @@ const struct file_operations ovl_file_operations = {
 	.fsync		= ovl_fsync,
 	.mmap		= ovl_mmap,
 	.fallocate	= ovl_fallocate,
+	.fadvise	= ovl_fadvise,
 	.unlocked_ioctl	= ovl_ioctl,
 	.compat_ioctl	= ovl_compat_ioctl,
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 4/6] vfs: add the fadvise() file operation
  2018-08-27 12:56 ` [PATCH v3 4/6] vfs: add the fadvise() file operation Amir Goldstein
@ 2018-08-27 13:56   ` Steve French
  0 siblings, 0 replies; 14+ messages in thread
From: Steve French @ 2018-08-27 13:56 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

This reminds me of the relatively recent addition of two flags on
write (and one on read) to Windows.  Any thoughts on whether this
could be used to solve one of the problems virtualization engines
(apparently) ran into running over the network and solved on other
platforms by adding per-write flags to change whether the write is
write-through or not and whether the write should be unbuffered (also
allowed for read requests).  See MS-SMB2 section 2.2.21.
This was added to the network file system protocol fairly recently
(dialect 3.02) but I can see how it could be useful for a few
application use case and seems related to the fadvise api, although
perhaps easier to use.  Maybe fadvise could be used by cifs.ko to
properly set these flags on the write (and read) protocol ops.
Thoughts?

Any obvious ways that they could be mapped together.
On Mon, Aug 27, 2018 at 7:55 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> This is going to be used by overlayfs and possibly useful
> for other filesystems.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  Documentation/filesystems/vfs.txt |  3 ++
>  include/linux/fs.h                |  5 +++
>  mm/fadvise.c                      | 78 ++++++++++++++++++++++-----------------
>  3 files changed, 53 insertions(+), 33 deletions(-)
>
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index e0ace944a0e1..4868fa9c0758 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -885,6 +885,7 @@ struct file_operations {
>         ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
>         int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
>         int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
> +       int (*fadvise)(struct file *, loff_t, loff_t, int);
>  };
>
>  Again, all methods are called without any locks being held, unless
> @@ -965,6 +966,8 @@ otherwise noted.
>    dedupe_file_range: called by the ioctl(2) system call for FIDEDUPERANGE
>         command.
>
> +  fadvise: possibly called by the fadvise64() system call.
> +
>  Note that the file operations are implemented by the specific
>  filesystem in which the inode resides. When opening a device node
>  (character or block special) most filesystems will call special
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 33322702c910..6c0b4a1c22ff 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1763,6 +1763,7 @@ struct file_operations {
>                         u64);
>         int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t,
>                         u64);
> +       int (*fadvise)(struct file *, loff_t, loff_t, int);
>  } __randomize_layout;
>
>  struct inode_operations {
> @@ -3459,4 +3460,8 @@ static inline bool dir_relax_shared(struct inode *inode)
>  extern bool path_noexec(const struct path *path);
>  extern void inode_nohighmem(struct inode *inode);
>
> +/* mm/fadvise.c */
> +extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
> +                      int advice);
> +
>  #endif /* _LINUX_FS_H */
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index 2d8376e3c640..2f59bac1cb77 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -27,9 +27,9 @@
>   * deactivate the pages and clear PG_Referenced.
>   */
>
> -int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
> +static int generic_fadvise(struct file *file, loff_t offset, loff_t len,
> +                          int advice)
>  {
> -       struct fd f = fdget(fd);
>         struct inode *inode;
>         struct address_space *mapping;
>         struct backing_dev_info *bdi;
> @@ -37,22 +37,14 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
>         pgoff_t start_index;
>         pgoff_t end_index;
>         unsigned long nrpages;
> -       int ret = 0;
>
> -       if (!f.file)
> -               return -EBADF;
> +       inode = file_inode(file);
> +       if (S_ISFIFO(inode->i_mode))
> +               return -ESPIPE;
>
> -       inode = file_inode(f.file);
> -       if (S_ISFIFO(inode->i_mode)) {
> -               ret = -ESPIPE;
> -               goto out;
> -       }
> -
> -       mapping = f.file->f_mapping;
> -       if (!mapping || len < 0) {
> -               ret = -EINVAL;
> -               goto out;
> -       }
> +       mapping = file->f_mapping;
> +       if (!mapping || len < 0)
> +               return -EINVAL;
>
>         bdi = inode_to_bdi(mapping->host);
>
> @@ -67,9 +59,9 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
>                         /* no bad return value, but ignore advice */
>                         break;
>                 default:
> -                       ret = -EINVAL;
> +                       return -EINVAL;
>                 }
> -               goto out;
> +               return 0;
>         }
>
>         /*
> @@ -85,21 +77,21 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
>
>         switch (advice) {
>         case POSIX_FADV_NORMAL:
> -               f.file->f_ra.ra_pages = bdi->ra_pages;
> -               spin_lock(&f.file->f_lock);
> -               f.file->f_mode &= ~FMODE_RANDOM;
> -               spin_unlock(&f.file->f_lock);
> +               file->f_ra.ra_pages = bdi->ra_pages;
> +               spin_lock(&file->f_lock);
> +               file->f_mode &= ~FMODE_RANDOM;
> +               spin_unlock(&file->f_lock);
>                 break;
>         case POSIX_FADV_RANDOM:
> -               spin_lock(&f.file->f_lock);
> -               f.file->f_mode |= FMODE_RANDOM;
> -               spin_unlock(&f.file->f_lock);
> +               spin_lock(&file->f_lock);
> +               file->f_mode |= FMODE_RANDOM;
> +               spin_unlock(&file->f_lock);
>                 break;
>         case POSIX_FADV_SEQUENTIAL:
> -               f.file->f_ra.ra_pages = bdi->ra_pages * 2;
> -               spin_lock(&f.file->f_lock);
> -               f.file->f_mode &= ~FMODE_RANDOM;
> -               spin_unlock(&f.file->f_lock);
> +               file->f_ra.ra_pages = bdi->ra_pages * 2;
> +               spin_lock(&file->f_lock);
> +               file->f_mode &= ~FMODE_RANDOM;
> +               spin_unlock(&file->f_lock);
>                 break;
>         case POSIX_FADV_WILLNEED:
>                 /* First and last PARTIAL page! */
> @@ -115,8 +107,7 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
>                  * Ignore return value because fadvise() shall return
>                  * success even if filesystem can't retrieve a hint,
>                  */
> -               force_page_cache_readahead(mapping, f.file, start_index,
> -                                          nrpages);
> +               force_page_cache_readahead(mapping, file, start_index, nrpages);
>                 break;
>         case POSIX_FADV_NOREUSE:
>                 break;
> @@ -183,9 +174,30 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
>                 }
>                 break;
>         default:
> -               ret = -EINVAL;
> +               return -EINVAL;
>         }
> -out:
> +       return 0;
> +}
> +
> +int vfs_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
> +{
> +       if (file->f_op->fadvise)
> +               return file->f_op->fadvise(file, offset, len, advice);
> +
> +       return generic_fadvise(file, offset, len, advice);
> +}
> +EXPORT_SYMBOL(vfs_fadvise);
> +
> +int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
> +{
> +       struct fd f = fdget(fd);
> +       int ret;
> +
> +       if (!f.file)
> +               return -EBADF;
> +
> +       ret = vfs_fadvise(f.file, offset, len, advice);
> +
>         fdput(f);
>         return ret;
>  }
> --
> 2.7.4
>


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 6/6] ovl: add ovl_fadvise()
  2018-08-27 12:56 ` [PATCH v3 6/6] ovl: add ovl_fadvise() Amir Goldstein
@ 2018-08-27 18:52   ` Vivek Goyal
  2018-08-27 19:05     ` Amir Goldstein
  2019-12-28  5:49   ` Andreas Dilger
  1 sibling, 1 reply; 14+ messages in thread
From: Vivek Goyal @ 2018-08-27 18:52 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

On Mon, Aug 27, 2018 at 03:56:04PM +0300, Amir Goldstein wrote:
> Implement stacked fadvise to fix syscalls readahead(2) and fadvise64(2)
> on an overlayfs file.
> 
> Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
> Fixes: d1d04ef8572b ("ovl: stack file ops")
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/overlayfs/file.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index a4acd84591d4..42d2d034d85c 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -331,6 +331,23 @@ static long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len
>  	return ret;
>  }
>  
> +int ovl_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
> +{
> +	struct fd real;
> +	int ret;
> +
> +	ret = ovl_real_fdget(file, &real);
> +	if (ret)
> +		return ret;
> +
> +	/* XXX: do we need mounter credentials? */

Given we are switching creds to mounter for rest of the file operations,
so I would think we need to do it here as well to be consistent with
this security model.

Thanks
Vivek
> +	ret = vfs_fadvise(real.file, offset, len, advice);
> +
> +	fdput(real);
> +
> +	return ret;
> +}
> +
>  static long ovl_real_ioctl(struct file *file, unsigned int cmd,
>  			   unsigned long arg)
>  {
> @@ -499,6 +516,7 @@ const struct file_operations ovl_file_operations = {
>  	.fsync		= ovl_fsync,
>  	.mmap		= ovl_mmap,
>  	.fallocate	= ovl_fallocate,
> +	.fadvise	= ovl_fadvise,
>  	.unlocked_ioctl	= ovl_ioctl,
>  	.compat_ioctl	= ovl_compat_ioctl,
>  
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 6/6] ovl: add ovl_fadvise()
  2018-08-27 18:52   ` Vivek Goyal
@ 2018-08-27 19:05     ` Amir Goldstein
  2018-08-27 19:29       ` Miklos Szeredi
  0 siblings, 1 reply; 14+ messages in thread
From: Amir Goldstein @ 2018-08-27 19:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Miklos Szeredi, Al Viro, Dave Chinner, overlayfs, linux-fsdevel

On Mon, Aug 27, 2018 at 9:52 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Aug 27, 2018 at 03:56:04PM +0300, Amir Goldstein wrote:
> > Implement stacked fadvise to fix syscalls readahead(2) and fadvise64(2)
> > on an overlayfs file.
> >
> > Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
> > Fixes: d1d04ef8572b ("ovl: stack file ops")
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >  fs/overlayfs/file.c | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> >
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index a4acd84591d4..42d2d034d85c 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -331,6 +331,23 @@ static long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len
> >       return ret;
> >  }
> >
> > +int ovl_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
> > +{
> > +     struct fd real;
> > +     int ret;
> > +
> > +     ret = ovl_real_fdget(file, &real);
> > +     if (ret)
> > +             return ret;
> > +
> > +     /* XXX: do we need mounter credentials? */
>
> Given we are switching creds to mounter for rest of the file operations,
> so I would think we need to do it here as well to be consistent with
> this security model.
>

Yeh, I guess so, although I did not see any security checks in
fadvise64(2) syscall. readahead(2) at least checks for FMODE_READ
on the open file fadvise doesn't even bother with that...

Miklos, let me know if you want me to add override_creds or will
you add it yourself.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 6/6] ovl: add ovl_fadvise()
  2018-08-27 19:05     ` Amir Goldstein
@ 2018-08-27 19:29       ` Miklos Szeredi
  0 siblings, 0 replies; 14+ messages in thread
From: Miklos Szeredi @ 2018-08-27 19:29 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Vivek Goyal, Al Viro, Dave Chinner, overlayfs, linux-fsdevel

On Mon, Aug 27, 2018 at 9:05 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Aug 27, 2018 at 9:52 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>>
>> On Mon, Aug 27, 2018 at 03:56:04PM +0300, Amir Goldstein wrote:
>> > Implement stacked fadvise to fix syscalls readahead(2) and fadvise64(2)
>> > on an overlayfs file.
>> >
>> > Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
>> > Fixes: d1d04ef8572b ("ovl: stack file ops")
>> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> > ---
>> >  fs/overlayfs/file.c | 18 ++++++++++++++++++
>> >  1 file changed, 18 insertions(+)
>> >
>> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>> > index a4acd84591d4..42d2d034d85c 100644
>> > --- a/fs/overlayfs/file.c
>> > +++ b/fs/overlayfs/file.c
>> > @@ -331,6 +331,23 @@ static long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len
>> >       return ret;
>> >  }
>> >
>> > +int ovl_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
>> > +{
>> > +     struct fd real;
>> > +     int ret;
>> > +
>> > +     ret = ovl_real_fdget(file, &real);
>> > +     if (ret)
>> > +             return ret;
>> > +
>> > +     /* XXX: do we need mounter credentials? */
>>
>> Given we are switching creds to mounter for rest of the file operations,
>> so I would think we need to do it here as well to be consistent with
>> this security model.
>>
>
> Yeh, I guess so, although I did not see any security checks in
> fadvise64(2) syscall. readahead(2) at least checks for FMODE_READ
> on the open file fadvise doesn't even bother with that...
>
> Miklos, let me know if you want me to add override_creds or will
> you add it yourself.

I'll add it.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 5/6] vfs: implement readahead(2) using POSIX_FADV_WILLNEED
  2018-08-27 12:56 ` [PATCH v3 5/6] vfs: implement readahead(2) using POSIX_FADV_WILLNEED Amir Goldstein
@ 2018-08-27 23:30   ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2018-08-27 23:30 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, Al Viro, linux-unionfs, linux-fsdevel

On Mon, Aug 27, 2018 at 03:56:03PM +0300, Amir Goldstein wrote:
> The implementation of readahead(2) syscall is identical to that of
> fadvise64(POSIX_FADV_WILLNEED) with a few exceptions:
> 1. readahead(2) returns -EINVAL for !mapping->a_ops and fadvise64()
>    ignores the request and returns 0.
> 2. fadvise64() checks for integer overflow corner case
> 3. fadvise64() calls the optional filesystem fadvice() file operation
> 
> Unite the two implementations by calling vfs_fadvice() from readahead(2)
> syscall. Check the !mapping->a_ops in readahead(2) syscall to preserve
> legacy behavior.
> 
> Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
> Fixes: d1d04ef8572b ("ovl: stack file ops")
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  mm/readahead.c | 28 +++++++++++-----------------
>  1 file changed, 11 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index a59ea70527b9..867a5cc3a62e 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -20,6 +20,7 @@
>  #include <linux/file.h>
>  #include <linux/mm_inline.h>
>  #include <linux/blk-cgroup.h>
> +#include <linux/fadvise.h>
>  
>  #include "internal.h"
>  
> @@ -576,21 +577,19 @@ page_cache_async_readahead(struct address_space *mapping,
>  EXPORT_SYMBOL_GPL(page_cache_async_readahead);
>  
>  static ssize_t
> -do_readahead(struct address_space *mapping, struct file *filp,
> -	     pgoff_t index, unsigned long nr)
> +do_readahead(struct file *file, loff_t offset, size_t count)
>  {
> -	if (!mapping || !mapping->a_ops)
> -		return -EINVAL;
> +	struct address_space *mapping = file->f_mapping;
>  
>  	/*
> -	 * Readahead doesn't make sense for DAX inodes, but we don't want it
> -	 * to report a failure either.  Instead, we just return success and
> -	 * don't do any work.
> +	 * fadvise() silently ignores an advice for a file with !a_ops and
> +	 * returns -EPIPE for a pipe. Keep this check here to comply with legacy
> +	 * -EINVAL behavior of readahead(2).
>  	 */
> -	if (dax_mapping(mapping))
> -		return 0;
> +	if (!mapping || !mapping->a_ops || !S_ISREG(file_inode(file)->i_mode))
> +		return -EINVAL;
> -	return force_page_cache_readahead(mapping, filp, index, nr);
> +	return vfs_fadvise(file, offset, count, POSIX_FADV_WILLNEED);
>  }
>  
>  ssize_t ksys_readahead(int fd, loff_t offset, size_t count)
> @@ -601,13 +600,8 @@ ssize_t ksys_readahead(int fd, loff_t offset, size_t count)
>  	ret = -EBADF;
>  	f = fdget(fd);
>  	if (f.file) {
> -		if (f.file->f_mode & FMODE_READ) {
> -			struct address_space *mapping = f.file->f_mapping;
> -			pgoff_t start = offset >> PAGE_SHIFT;
> -			pgoff_t end = (offset + count - 1) >> PAGE_SHIFT;
> -			unsigned long len = end - start + 1;
> -			ret = do_readahead(mapping, f.file, start, len);
> -		}
> +		if (f.file->f_mode & FMODE_READ)
> +			ret = do_readahead(f.file, offset, count);
>  		fdput(f);

Can we just get rid of do_readahead helper and call vfs_fadvise()
directly? This code is not shared with anyone and the error
semantics are specific to the readahead syscall, so IMO those three
lines of code should just be in line....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 6/6] ovl: add ovl_fadvise()
  2018-08-27 12:56 ` [PATCH v3 6/6] ovl: add ovl_fadvise() Amir Goldstein
  2018-08-27 18:52   ` Vivek Goyal
@ 2019-12-28  5:49   ` Andreas Dilger
  2019-12-28 10:10     ` Amir Goldstein
  1 sibling, 1 reply; 14+ messages in thread
From: Andreas Dilger @ 2019-12-28  5:49 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1785 bytes --]

On Aug 27, 2018, at 6:56 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> 
> Implement stacked fadvise to fix syscalls readahead(2) and fadvise64(2)
> on an overlayfs file.

I was just looking into the existence of the "new" fadvise() method in
the VFS being able to communicate application hints directly to the
filesystem to see if it could be used to address the word size issue in
https://bugzilla.kernel.org/show_bug.cgi?id=205957 without adding a new
syscall, and came across this patch and the 4/6 patch that adds the
vfs_fadvise() function itself (copied below for clarity).

It seems to me that this implementation is broken?  Only vfs_fadvise()
is called from the fadvise64() syscall, and it will call f_op->fadvise()
if the filesystem provides this method.  Only overlayfs provides the
.fadvise method today.  However, it looks that ovl_fadvise() calls back
into vfs_fadvise() again, in a seemingly endless loop?

It seems like generic_fadvise() should be EXPORT_SYMBOL() so that any
filesystem that implements its own .fadvise method can do its own thing,
and then call generic_fadvise() to handle the remaining MM-specific work.

Thoughts?

> +int vfs_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
> +{
> +	if (file->f_op->fadvise)
> +		return file->f_op->fadvise(file, offset, len, advice);
> +
> +	return generic_fadvise(file, offset, len, advice);
> +}
> +EXPORT_SYMBOL(vfs_fadvise);
> 
> +int ovl_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
> +{
> +	struct fd real;
> +	int ret;
> +
> +	ret = ovl_real_fdget(file, &real);
> +	if (ret)
> +		return ret;
> +
> +	/* XXX: do we need mounter credentials? */
> +	ret = vfs_fadvise(real.file, offset, len, advice);
> +
> +	fdput(real);
> +
> +	return ret;
> +}

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 6/6] ovl: add ovl_fadvise()
  2019-12-28  5:49   ` Andreas Dilger
@ 2019-12-28 10:10     ` Amir Goldstein
  0 siblings, 0 replies; 14+ messages in thread
From: Amir Goldstein @ 2019-12-28 10:10 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Miklos Szeredi, Al Viro, Dave Chinner, overlayfs, linux-fsdevel

On Sat, Dec 28, 2019 at 7:49 AM Andreas Dilger <adilger@dilger.ca> wrote:
>
> On Aug 27, 2018, at 6:56 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > Implement stacked fadvise to fix syscalls readahead(2) and fadvise64(2)
> > on an overlayfs file.
>
> I was just looking into the existence of the "new" fadvise() method in
> the VFS being able to communicate application hints directly to the
> filesystem to see if it could be used to address the word size issue in
> https://bugzilla.kernel.org/show_bug.cgi?id=205957 without adding a new
> syscall, and came across this patch and the 4/6 patch that adds the
> vfs_fadvise() function itself (copied below for clarity).
>
> It seems to me that this implementation is broken?  Only vfs_fadvise()
> is called from the fadvise64() syscall, and it will call f_op->fadvise()
> if the filesystem provides this method.  Only overlayfs provides the
> .fadvise method today.  However, it looks that ovl_fadvise() calls back
> into vfs_fadvise() again, in a seemingly endless loop?
>

You are confusing endless loop with recursion that has a stop condition.
The entire concept of stacked filesystem is recursion back into vfs.
This is essentially what most of the ovl file operations do, but they recurse
on the "real.file", which is supposed to be on a filesystem with lower
sb->s_stack_depth (FILESYSTEM_MAX_STACK_DEPTH is 2).

> It seems like generic_fadvise() should be EXPORT_SYMBOL() so that any
> filesystem that implements its own .fadvise method can do its own thing,
> and then call generic_fadvise() to handle the remaining MM-specific work.
>
> Thoughts?

Sure makes sense.
Overlayfs just doesn't need to call the generic helper.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, back to index

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-27 12:55 [PATCH v3 0/6] Overlayfs stacked f_op fixes Amir Goldstein
2018-08-27 12:55 ` [PATCH v3 1/6] ovl: respect FIEMAP_FLAG_SYNC flag Amir Goldstein
2018-08-27 12:56 ` [PATCH v3 2/6] ovl: fix GPF in swapfile_activate of file from overlayfs over xfs Amir Goldstein
2018-08-27 12:56 ` [PATCH v3 3/6] Documentation/filesystems: update documentation of file_operations to kernel 4.18 Amir Goldstein
2018-08-27 12:56 ` [PATCH v3 4/6] vfs: add the fadvise() file operation Amir Goldstein
2018-08-27 13:56   ` Steve French
2018-08-27 12:56 ` [PATCH v3 5/6] vfs: implement readahead(2) using POSIX_FADV_WILLNEED Amir Goldstein
2018-08-27 23:30   ` Dave Chinner
2018-08-27 12:56 ` [PATCH v3 6/6] ovl: add ovl_fadvise() Amir Goldstein
2018-08-27 18:52   ` Vivek Goyal
2018-08-27 19:05     ` Amir Goldstein
2018-08-27 19:29       ` Miklos Szeredi
2019-12-28  5:49   ` Andreas Dilger
2019-12-28 10:10     ` Amir Goldstein

($INBOX_DIR/description missing)

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-unionfs/0 linux-unionfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-unionfs linux-unionfs/ https://lore.kernel.org/linux-unionfs \
		linux-unionfs@vger.kernel.org
	public-inbox-index linux-unionfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-unionfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git