linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] Overlayfs stacked f_op fixes
@ 2018-08-26 16:25 Amir Goldstein
  2018-08-26 16:25 ` [PATCH v2 1/6] vfs: add helper to get "real" overlayfs file Amir Goldstein
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Amir Goldstein @ 2018-08-26 16:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

Miklos,

Following are fixes to some stacked f_op regressions, also
availabel on my ovl-fixes branch [1].

Patches 2-3 fix fiemap() and swapon() regressions detected
by xfstests.

Per your suggestion, f_mapping now points to the overlay
inode mapping and as a result, swapfile support is disabled on
an overlayfs file.
Also disabled is FIBMAP on an overlayfs file - not a big loss.

Patch 1 adds the vfs helper file_real() to access overlayfs
real file and patches 4-6 use this helper to untangle from
the f_mapping change.

Please note that patch 4 fixes a problem that was verified
with an LTP test [2], but patches 5-6 fix theoretical problems
that I have not yet demonstrated with a test.

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/ovl-fixes
[2] https://github.com/amir73il/ltp/commits/overlayfs-devel

Amir Goldstein (6):
  vfs: add helper to get "real" overlayfs file
  ovl: respect FIEMAP_FLAG_SYNC flag
  ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
  vfs: fix readahead syscall on an overlayfs file
  vfs: fix fadvise64 syscall on an overlayfs file
  vfs: fix sync_file_range syscall on an overlayfs file

 fs/overlayfs/file.c  |  4 +---
 fs/overlayfs/inode.c | 10 ++++++++++
 fs/sync.c            | 14 ++++++++++----
 include/linux/fs.h   | 27 +++++++++++++++++++++++++++
 mm/fadvise.c         | 34 ++++++++++++++++++++--------------
 mm/readahead.c       |  9 +++++++--
 6 files changed, 75 insertions(+), 23 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/6] vfs: add helper to get "real" overlayfs file
  2018-08-26 16:25 [PATCH v2 0/6] Overlayfs stacked f_op fixes Amir Goldstein
@ 2018-08-26 16:25 ` Amir Goldstein
  2018-08-26 16:25 ` [PATCH v2 2/6] ovl: respect FIEMAP_FLAG_SYNC flag Amir Goldstein
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Amir Goldstein @ 2018-08-26 16:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

We recently got rid of most VFS "hacks" that made VFS be aware of
overlayfs real underlying dentries. Due to collateral damage of
removing these hacks, we now need to make VFS be aware of overlayfs
real underlying files. Hopefully, the end result will be fewer places
where VFS is aware of overlayfs quirks.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/file.c |  1 +
 include/linux/fs.h  | 27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 32e9282893c9..24c9e0d70c3b 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -135,6 +135,7 @@ static int ovl_open(struct inode *inode, struct file *file)
 	file->f_mapping = realfile->f_mapping;
 
 	file->private_data = realfile;
+	file->f_mode |= FMODE_STACKED;
 
 	return 0;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 33322702c910..51c5d167498f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -161,6 +161,12 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 
 /* File does not contribute to nr_files count */
 #define FMODE_NOACCOUNT	((__force fmode_t)0x20000000)
+/*
+ * File is stacked over a "real" file that is used for the actual io
+ * operations. This is set by overlayfs and tested by file_real() in VFS code.
+ * Not every "stacked" filesystem needs to set this flag.
+ */
+#define FMODE_STACKED	((__force fmode_t)0x40000000)
 
 /*
  * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
@@ -2450,6 +2456,27 @@ static inline struct file *file_clone_open(struct file *file)
 }
 extern int filp_close(struct file *, fl_owner_t id);
 
+/**
+ * file_real - Return the real file of a stacked file
+ * @file: The file to query
+ *
+ * If @file is on a union/overlay, then return the underlying, real file.
+ * Otherwise return @file.
+ */
+static inline struct file *file_real(struct file *file)
+{
+	/*
+	 * XXX: Instead of pretending we have a variaty of stacked filesystems
+	 * and implement a f_op->real() operation, just open code the simple
+	 * overlayfs implementation and if we ever need something more fancy,
+	 * we can add the f_op->real() then.
+	 */
+	if (unlikely(file->f_mode & FMODE_STACKED))
+		return (struct file *)file->private_data;
+	else
+		return file;
+}
+
 extern struct filename *getname_flags(const char __user *, int, int *);
 extern struct filename *getname(const char __user *);
 extern struct filename *getname_kernel(const char *);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/6] ovl: respect FIEMAP_FLAG_SYNC flag
  2018-08-26 16:25 [PATCH v2 0/6] Overlayfs stacked f_op fixes Amir Goldstein
  2018-08-26 16:25 ` [PATCH v2 1/6] vfs: add helper to get "real" overlayfs file Amir Goldstein
@ 2018-08-26 16:25 ` Amir Goldstein
  2018-08-26 19:26   ` Miklos Szeredi
  2018-08-27  3:38   ` Dave Chinner
  2018-08-26 16:25 ` [PATCH v2 3/6] ovl: fix GPF in swapfile_activate of file from overlayfs over xfs Amir Goldstein
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 20+ messages in thread
From: Amir Goldstein @ 2018-08-26 16:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

Stacked overlayfs fiemap operation broke xfstests that test delayed
allocation (with "_test_generic_punch -d"), because ovl_fiemap()
failed to write dirty pages when requested.

Fixes: 9e142c4102db ("ovl: add ovl_fiemap()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/inode.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index e0bb217c01e2..5014749fd4b4 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -467,6 +467,10 @@ static int ovl_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		return -EOPNOTSUPP;
 
 	old_cred = ovl_override_creds(inode->i_sb);
+
+	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC)
+		filemap_write_and_wait(realinode->i_mapping);
+
 	err = realinode->i_op->fiemap(realinode, fieinfo, start, len);
 	revert_creds(old_cred);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/6] ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
  2018-08-26 16:25 [PATCH v2 0/6] Overlayfs stacked f_op fixes Amir Goldstein
  2018-08-26 16:25 ` [PATCH v2 1/6] vfs: add helper to get "real" overlayfs file Amir Goldstein
  2018-08-26 16:25 ` [PATCH v2 2/6] ovl: respect FIEMAP_FLAG_SYNC flag Amir Goldstein
@ 2018-08-26 16:25 ` Amir Goldstein
  2018-08-27  3:43   ` Dave Chinner
  2018-08-26 16:25 ` [PATCH v2 4/6] vfs: fix readahead syscall on an overlayfs file Amir Goldstein
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2018-08-26 16:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

Since overlayfs implements stacked file operations, the underlying
filesystems are not supposed to be exposed to the overlayfs file,
whose f_inode is an overlayfs inode.

Assigning an overlayfs file to swap_file results in an attempt of xfs
code to dereference an xfs_inode struct from an ovl_inode pointer:

 CPU: 0 PID: 2462 Comm: swapon Not tainted
 4.18.0-xfstests-12721-g33e17876ea4e #3402
 RIP: 0010:xfs_find_bdev_for_inode+0x23/0x2f
 Call Trace:
  xfs_iomap_swapfile_activate+0x1f/0x43
  __se_sys_swapon+0xb1a/0xee9

Fix this by not assigning the real inode mapping to f_mapping, which
will cause swapon() to return an error (-EINVAL). Although it makes
sense not to allow setting swpafile on an overlayfs file, some users
may depend on it, so we may need to fix this up in the future.

Keeping f_mapping pointing to overlay inode mapping will cause O_DIRECT
open to fail. Fix this by installing ovl_aops with noop_direct_IO in
overlay inode mapping.

Keeping f_mapping pointing to overlay inode mapping will cause other
a_ops related operations to fail (e.g. readahead()). Those will be
fixed by follow up patches.

Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: f7c72396d0de ("ovl: add O_DIRECT support")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/file.c  | 3 ---
 fs/overlayfs/inode.c | 6 ++++++
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 24c9e0d70c3b..fd64daf6daa5 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -131,9 +131,6 @@ static int ovl_open(struct inode *inode, struct file *file)
 	if (IS_ERR(realfile))
 		return PTR_ERR(realfile);
 
-	/* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
-	file->f_mapping = realfile->f_mapping;
-
 	file->private_data = realfile;
 	file->f_mode |= FMODE_STACKED;
 
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 5014749fd4b4..b6ac545b5a32 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -504,6 +504,11 @@ static const struct inode_operations ovl_special_inode_operations = {
 	.update_time	= ovl_update_time,
 };
 
+const struct address_space_operations ovl_aops = {
+	/* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
+	.direct_IO		= noop_direct_IO,
+};
+
 /*
  * It is possible to stack overlayfs instance on top of another
  * overlayfs instance as lower layer. We need to annonate the
@@ -575,6 +580,7 @@ static void ovl_fill_inode(struct inode *inode, umode_t mode, dev_t rdev,
 	case S_IFREG:
 		inode->i_op = &ovl_file_inode_operations;
 		inode->i_fop = &ovl_file_operations;
+		inode->i_mapping->a_ops = &ovl_aops;
 		break;
 
 	case S_IFDIR:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 4/6] vfs: fix readahead syscall on an overlayfs file
  2018-08-26 16:25 [PATCH v2 0/6] Overlayfs stacked f_op fixes Amir Goldstein
                   ` (2 preceding siblings ...)
  2018-08-26 16:25 ` [PATCH v2 3/6] ovl: fix GPF in swapfile_activate of file from overlayfs over xfs Amir Goldstein
@ 2018-08-26 16:25 ` Amir Goldstein
  2018-08-26 16:25 ` [PATCH v2 5/6] vfs: fix fadvise64 " Amir Goldstein
  2018-08-26 16:25 ` [PATCH v2 6/6] vfs: fix sync_file_range " Amir Goldstein
  5 siblings, 0 replies; 20+ messages in thread
From: Amir Goldstein @ 2018-08-26 16:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

Overlayfs implements stacked file operations, but does not implement
stacked a_ops, so passing an overlayfs file to do_readahead() results
in an error.

Fix this by passing the real underlying file to do_readahead().

Fixes: d1d04ef8572b ("ovl: stack file ops")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 mm/readahead.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index a59ea70527b9..dc9c64ce6094 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -602,11 +602,16 @@ ssize_t ksys_readahead(int fd, loff_t offset, size_t count)
 	f = fdget(fd);
 	if (f.file) {
 		if (f.file->f_mode & FMODE_READ) {
-			struct address_space *mapping = f.file->f_mapping;
+			/*
+			 * XXX: We need to use file_real(), because overlayfs
+			 * stacked file/inode do not implement page io.
+			 */
+			struct file *file = file_real(f.file);
+			struct address_space *mapping = file->f_mapping;
 			pgoff_t start = offset >> PAGE_SHIFT;
 			pgoff_t end = (offset + count - 1) >> PAGE_SHIFT;
 			unsigned long len = end - start + 1;
-			ret = do_readahead(mapping, f.file, start, len);
+			ret = do_readahead(mapping, file, start, len);
 		}
 		fdput(f);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 5/6] vfs: fix fadvise64 syscall on an overlayfs file
  2018-08-26 16:25 [PATCH v2 0/6] Overlayfs stacked f_op fixes Amir Goldstein
                   ` (3 preceding siblings ...)
  2018-08-26 16:25 ` [PATCH v2 4/6] vfs: fix readahead syscall on an overlayfs file Amir Goldstein
@ 2018-08-26 16:25 ` Amir Goldstein
  2018-08-26 19:30   ` Miklos Szeredi
  2018-08-26 16:25 ` [PATCH v2 6/6] vfs: fix sync_file_range " Amir Goldstein
  5 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2018-08-26 16:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

For an overlayfs file/inode, gage io is operating on the real underlying
file, so the readahead hints set by fadvise64() should also be set on the
real underlying file to take affect.

Fixes: d1d04ef8572b ("ovl: stack file ops")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 mm/fadvise.c | 34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/mm/fadvise.c b/mm/fadvise.c
index 2d8376e3c640..d5528343ce77 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -30,6 +30,7 @@
 int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
 {
 	struct fd f = fdget(fd);
+	struct file *file;
 	struct inode *inode;
 	struct address_space *mapping;
 	struct backing_dev_info *bdi;
@@ -42,13 +43,18 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
 	if (!f.file)
 		return -EBADF;
 
-	inode = file_inode(f.file);
+	/*
+	 * XXX: We need to use file_real() for overlayfs stacked file because
+	 * readahead will be operating on the real underlying file/inode.
+	 */
+	file = file_real(f.file);
+	inode = file_inode(file);
 	if (S_ISFIFO(inode->i_mode)) {
 		ret = -ESPIPE;
 		goto out;
 	}
 
-	mapping = f.file->f_mapping;
+	mapping = file->f_mapping;
 	if (!mapping || len < 0) {
 		ret = -EINVAL;
 		goto out;
@@ -85,21 +91,21 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
 
 	switch (advice) {
 	case POSIX_FADV_NORMAL:
-		f.file->f_ra.ra_pages = bdi->ra_pages;
-		spin_lock(&f.file->f_lock);
-		f.file->f_mode &= ~FMODE_RANDOM;
-		spin_unlock(&f.file->f_lock);
+		file->f_ra.ra_pages = bdi->ra_pages;
+		spin_lock(&file->f_lock);
+		file->f_mode &= ~FMODE_RANDOM;
+		spin_unlock(&file->f_lock);
 		break;
 	case POSIX_FADV_RANDOM:
-		spin_lock(&f.file->f_lock);
-		f.file->f_mode |= FMODE_RANDOM;
-		spin_unlock(&f.file->f_lock);
+		spin_lock(&file->f_lock);
+		file->f_mode |= FMODE_RANDOM;
+		spin_unlock(&file->f_lock);
 		break;
 	case POSIX_FADV_SEQUENTIAL:
-		f.file->f_ra.ra_pages = bdi->ra_pages * 2;
-		spin_lock(&f.file->f_lock);
-		f.file->f_mode &= ~FMODE_RANDOM;
-		spin_unlock(&f.file->f_lock);
+		file->f_ra.ra_pages = bdi->ra_pages * 2;
+		spin_lock(&file->f_lock);
+		file->f_mode &= ~FMODE_RANDOM;
+		spin_unlock(&file->f_lock);
 		break;
 	case POSIX_FADV_WILLNEED:
 		/* First and last PARTIAL page! */
@@ -115,7 +121,7 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
 		 * Ignore return value because fadvise() shall return
 		 * success even if filesystem can't retrieve a hint,
 		 */
-		force_page_cache_readahead(mapping, f.file, start_index,
+		force_page_cache_readahead(mapping, file, start_index,
 					   nrpages);
 		break;
 	case POSIX_FADV_NOREUSE:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 6/6] vfs: fix sync_file_range syscall on an overlayfs file
  2018-08-26 16:25 [PATCH v2 0/6] Overlayfs stacked f_op fixes Amir Goldstein
                   ` (4 preceding siblings ...)
  2018-08-26 16:25 ` [PATCH v2 5/6] vfs: fix fadvise64 " Amir Goldstein
@ 2018-08-26 16:25 ` Amir Goldstein
  2018-08-26 19:34   ` Miklos Szeredi
  5 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2018-08-26 16:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, linux-unionfs, linux-fsdevel

For an overlayfs file/inode, page io is operating on the real underlying
file, so sync_file_range() should operate on the real underlying file
mapping to take affect.

Fixes: d1d04ef8572b ("ovl: stack file ops")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/sync.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index b54e0541ad89..28a26333844d 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -286,6 +286,7 @@ int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes,
 {
 	int ret;
 	struct fd f;
+	struct file *file;
 	struct address_space *mapping;
 	loff_t endbyte;			/* inclusive */
 	umode_t i_mode;
@@ -330,16 +331,21 @@ int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes,
 	if (!f.file)
 		goto out;
 
-	i_mode = file_inode(f.file)->i_mode;
+	/*
+	 * XXX: We need to use file_real() for overlayfs stacked file because
+	 * page io is operating on the real underlying file/inode.
+	 */
+	file = file_real(f.file);
+	i_mode = file_inode(file)->i_mode;
 	ret = -ESPIPE;
 	if (!S_ISREG(i_mode) && !S_ISBLK(i_mode) && !S_ISDIR(i_mode) &&
 			!S_ISLNK(i_mode))
 		goto out_put;
 
-	mapping = f.file->f_mapping;
+	mapping = file->f_mapping;
 	ret = 0;
 	if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
-		ret = file_fdatawait_range(f.file, offset, endbyte);
+		ret = file_fdatawait_range(file, offset, endbyte);
 		if (ret < 0)
 			goto out_put;
 	}
@@ -352,7 +358,7 @@ int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes,
 	}
 
 	if (flags & SYNC_FILE_RANGE_WAIT_AFTER)
-		ret = file_fdatawait_range(f.file, offset, endbyte);
+		ret = file_fdatawait_range(file, offset, endbyte);
 
 out_put:
 	fdput(f);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/6] ovl: respect FIEMAP_FLAG_SYNC flag
  2018-08-26 16:25 ` [PATCH v2 2/6] ovl: respect FIEMAP_FLAG_SYNC flag Amir Goldstein
@ 2018-08-26 19:26   ` Miklos Szeredi
  2018-08-27  3:38   ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Miklos Szeredi @ 2018-08-26 19:26 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Al Viro, Dave Chinner, overlayfs, linux-fsdevel

On Sun, Aug 26, 2018 at 6:25 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> Stacked overlayfs fiemap operation broke xfstests that test delayed
> allocation (with "_test_generic_punch -d"), because ovl_fiemap()
> failed to write dirty pages when requested.

Makes sense.

Thanks,
Miklos

>
> Fixes: 9e142c4102db ("ovl: add ovl_fiemap()")
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/overlayfs/inode.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> index e0bb217c01e2..5014749fd4b4 100644
> --- a/fs/overlayfs/inode.c
> +++ b/fs/overlayfs/inode.c
> @@ -467,6 +467,10 @@ static int ovl_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>                 return -EOPNOTSUPP;
>
>         old_cred = ovl_override_creds(inode->i_sb);
> +
> +       if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC)
> +               filemap_write_and_wait(realinode->i_mapping);
> +
>         err = realinode->i_op->fiemap(realinode, fieinfo, start, len);
>         revert_creds(old_cred);
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/6] vfs: fix fadvise64 syscall on an overlayfs file
  2018-08-26 16:25 ` [PATCH v2 5/6] vfs: fix fadvise64 " Amir Goldstein
@ 2018-08-26 19:30   ` Miklos Szeredi
  2018-08-26 21:23     ` Amir Goldstein
  0 siblings, 1 reply; 20+ messages in thread
From: Miklos Szeredi @ 2018-08-26 19:30 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Al Viro, Dave Chinner, overlayfs, linux-fsdevel

On Sun, Aug 26, 2018 at 6:25 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> For an overlayfs file/inode, gage io is operating on the real underlying
> file, so the readahead hints set by fadvise64() should also be set on the
> real underlying file to take affect.

Hmm, how about making this be an f_op?

Would also fix readahead(2), which can be translated to fadvise(...,
POSIX_FADV_WILLNEED).

And possibly be useful for other filesystems, since there are cases
when things are not done through a_ops, yet the filesystem would
benefit from knowing the intentions of the user.

Thanks,
Miklos

>
> Fixes: d1d04ef8572b ("ovl: stack file ops")
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  mm/fadvise.c | 34 ++++++++++++++++++++--------------
>  1 file changed, 20 insertions(+), 14 deletions(-)
>
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index 2d8376e3c640..d5528343ce77 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -30,6 +30,7 @@
>  int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
>  {
>         struct fd f = fdget(fd);
> +       struct file *file;
>         struct inode *inode;
>         struct address_space *mapping;
>         struct backing_dev_info *bdi;
> @@ -42,13 +43,18 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
>         if (!f.file)
>                 return -EBADF;
>
> -       inode = file_inode(f.file);
> +       /*
> +        * XXX: We need to use file_real() for overlayfs stacked file because
> +        * readahead will be operating on the real underlying file/inode.
> +        */
> +       file = file_real(f.file);
> +       inode = file_inode(file);
>         if (S_ISFIFO(inode->i_mode)) {
>                 ret = -ESPIPE;
>                 goto out;
>         }
>
> -       mapping = f.file->f_mapping;
> +       mapping = file->f_mapping;
>         if (!mapping || len < 0) {
>                 ret = -EINVAL;
>                 goto out;
> @@ -85,21 +91,21 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
>
>         switch (advice) {
>         case POSIX_FADV_NORMAL:
> -               f.file->f_ra.ra_pages = bdi->ra_pages;
> -               spin_lock(&f.file->f_lock);
> -               f.file->f_mode &= ~FMODE_RANDOM;
> -               spin_unlock(&f.file->f_lock);
> +               file->f_ra.ra_pages = bdi->ra_pages;
> +               spin_lock(&file->f_lock);
> +               file->f_mode &= ~FMODE_RANDOM;
> +               spin_unlock(&file->f_lock);
>                 break;
>         case POSIX_FADV_RANDOM:
> -               spin_lock(&f.file->f_lock);
> -               f.file->f_mode |= FMODE_RANDOM;
> -               spin_unlock(&f.file->f_lock);
> +               spin_lock(&file->f_lock);
> +               file->f_mode |= FMODE_RANDOM;
> +               spin_unlock(&file->f_lock);
>                 break;
>         case POSIX_FADV_SEQUENTIAL:
> -               f.file->f_ra.ra_pages = bdi->ra_pages * 2;
> -               spin_lock(&f.file->f_lock);
> -               f.file->f_mode &= ~FMODE_RANDOM;
> -               spin_unlock(&f.file->f_lock);
> +               file->f_ra.ra_pages = bdi->ra_pages * 2;
> +               spin_lock(&file->f_lock);
> +               file->f_mode &= ~FMODE_RANDOM;
> +               spin_unlock(&file->f_lock);
>                 break;
>         case POSIX_FADV_WILLNEED:
>                 /* First and last PARTIAL page! */
> @@ -115,7 +121,7 @@ int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
>                  * Ignore return value because fadvise() shall return
>                  * success even if filesystem can't retrieve a hint,
>                  */
> -               force_page_cache_readahead(mapping, f.file, start_index,
> +               force_page_cache_readahead(mapping, file, start_index,
>                                            nrpages);
>                 break;
>         case POSIX_FADV_NOREUSE:
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/6] vfs: fix sync_file_range syscall on an overlayfs file
  2018-08-26 16:25 ` [PATCH v2 6/6] vfs: fix sync_file_range " Amir Goldstein
@ 2018-08-26 19:34   ` Miklos Szeredi
  2018-08-26 21:55     ` Amir Goldstein
  0 siblings, 1 reply; 20+ messages in thread
From: Miklos Szeredi @ 2018-08-26 19:34 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Al Viro, Dave Chinner, overlayfs, linux-fsdevel

On Sun, Aug 26, 2018 at 6:25 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> For an overlayfs file/inode, page io is operating on the real underlying
> file, so sync_file_range() should operate on the real underlying file
> mapping to take affect.

The man page tells us that this syscall basically gives no guarantees
at all and shouldn't be used in portable programs.

So, I'd just let the non-functionality be for now.   If someone
complains of a regression (unlikely) we can look into it.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/6] vfs: fix fadvise64 syscall on an overlayfs file
  2018-08-26 19:30   ` Miklos Szeredi
@ 2018-08-26 21:23     ` Amir Goldstein
  0 siblings, 0 replies; 20+ messages in thread
From: Amir Goldstein @ 2018-08-26 21:23 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, overlayfs, linux-fsdevel

On Sun, Aug 26, 2018 at 10:30 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Sun, Aug 26, 2018 at 6:25 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> > For an overlayfs file/inode, gage io is operating on the real underlying
> > file, so the readahead hints set by fadvise64() should also be set on the
> > real underlying file to take affect.
>
> Hmm, how about making this be an f_op?
>
> Would also fix readahead(2), which can be translated to fadvise(...,
> POSIX_FADV_WILLNEED).
>

Ok. I guess we could match readahead(2) semantics to an advise
for fs that implements f_op->advise(). Semantics are not currently
the same for the case of !mapping->a_ops->readpages, but they
are already the same (i.e. both return 0) for dax_mapping(mapping).

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/6] vfs: fix sync_file_range syscall on an overlayfs file
  2018-08-26 19:34   ` Miklos Szeredi
@ 2018-08-26 21:55     ` Amir Goldstein
  2018-08-27  4:23       ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2018-08-26 21:55 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Dave Chinner, overlayfs, linux-fsdevel

On Sun, Aug 26, 2018 at 10:34 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Sun, Aug 26, 2018 at 6:25 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> > For an overlayfs file/inode, page io is operating on the real underlying
> > file, so sync_file_range() should operate on the real underlying file
> > mapping to take affect.
>
> The man page tells us that this syscall basically gives no guarantees
> at all and shouldn't be used in portable programs.
>

Oh no. You need to understand the context of this very bold warning.
The warning speaks lengthy about durability and it rightfully states that
you have no way of knowing what data will persist after crash.
This is relevant for application developers looking for durability, but that is
not the only use case for sync_file_range().

I have an application using sync_file_range() for consistency, which is not
the same game as durability.

They will tell you that the only safe way to guaranty consistency of data in a
new file is to do:
open(...O_TMPFILE) or open(TEMPFILE, ...)
write()
fsync()
link() or rename()

Then you don't know if file will exist after crash, but if it will
exist its content
will be consistent.

But the fact is that if you need to do many of those new file writes,
many fsync()
calls cost much more than the cost of syncing the inode pages, because every
new file writes metadata and metadata forces fsync to flush the journal.

Amplify that times number of containers and you have every fsync() on every
file in every overlayfs container all slamming of the underlying fs journal.

The fsync() in the snippet above can be safely replaced with sync_file_range()
eliminating all cost of excessive journal flushes without loosing any
consistency
guaranty on "strictly ordered metadata" filesystems - and all major filesystems
today are.

> So, I'd just let the non-functionality be for now.   If someone
> complains of a regression (unlikely) we can look into it.
>

I would like to place a complaint :-)

I guess we could go for f_op->sync_ranges()?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/6] ovl: respect FIEMAP_FLAG_SYNC flag
  2018-08-26 16:25 ` [PATCH v2 2/6] ovl: respect FIEMAP_FLAG_SYNC flag Amir Goldstein
  2018-08-26 19:26   ` Miklos Szeredi
@ 2018-08-27  3:38   ` Dave Chinner
  2018-08-27  6:20     ` Amir Goldstein
  1 sibling, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-08-27  3:38 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, Al Viro, linux-unionfs, linux-fsdevel

On Sun, Aug 26, 2018 at 07:25:13PM +0300, Amir Goldstein wrote:
> Stacked overlayfs fiemap operation broke xfstests that test delayed
> allocation (with "_test_generic_punch -d"), because ovl_fiemap()
> failed to write dirty pages when requested.
> 
> Fixes: 9e142c4102db ("ovl: add ovl_fiemap()")
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/overlayfs/inode.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> index e0bb217c01e2..5014749fd4b4 100644
> --- a/fs/overlayfs/inode.c
> +++ b/fs/overlayfs/inode.c
> @@ -467,6 +467,10 @@ static int ovl_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  		return -EOPNOTSUPP;
>  
>  	old_cred = ovl_override_creds(inode->i_sb);
> +
> +	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC)
> +		filemap_write_and_wait(realinode->i_mapping);
> +
>  	err = realinode->i_op->fiemap(realinode, fieinfo, start, len);


Where's the fiemap_check_flags() call in the overlay code to
indicate to userspace what functionality ovl supports?

And, further, you can't take action on FIEMAP_FLAG_SYNC for the
lower filesystem file because the lower filesystem first has to
validate the fiemap flags passed in.

So if you need to process FIEMAP_FLAG_SYNC here for the lower
filesystem, that implies that there is a bug in the filesystem
implementations and/or the VFS fiemap behaviour.

e.g. in XFS we call iomap_fiemap(), and it does:

        ret = fiemap_check_flags(fi, FIEMAP_FLAG_SYNC);
        if (ret)
                return ret;

        if (fi->fi_flags & FIEMAP_FLAG_SYNC) {
                ret = filemap_write_and_wait(inode->i_mapping);
                if (ret)
                        return ret;
        }

That means you wouldn't have seen this bug on XFS. Ext4 does not do
this, so it would appear not to observe the FIEMAP_FLAG_SYNC
behaviour as it was asked to perform.

Ah, I see - the problem is ioctl_fiemap() - it assumes that it can
run the flush without first allowing the filesystem to check if that
flag is supported.

So, shouldn't the correct fix be to move the FIEMAP_FLAG_SYNC from
the VFS down into the filesystem implementations after they have
checked the flags field for supported functionality? That way ovl
doesn't need special case hacks to replicate VFS behaviour...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/6] ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
  2018-08-26 16:25 ` [PATCH v2 3/6] ovl: fix GPF in swapfile_activate of file from overlayfs over xfs Amir Goldstein
@ 2018-08-27  3:43   ` Dave Chinner
  2018-08-27  6:34     ` Amir Goldstein
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-08-27  3:43 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, Al Viro, linux-unionfs, linux-fsdevel

On Sun, Aug 26, 2018 at 07:25:14PM +0300, Amir Goldstein wrote:
> Since overlayfs implements stacked file operations, the underlying
> filesystems are not supposed to be exposed to the overlayfs file,
> whose f_inode is an overlayfs inode.
> 
> Assigning an overlayfs file to swap_file results in an attempt of xfs
> code to dereference an xfs_inode struct from an ovl_inode pointer:
> 
>  CPU: 0 PID: 2462 Comm: swapon Not tainted
>  4.18.0-xfstests-12721-g33e17876ea4e #3402
>  RIP: 0010:xfs_find_bdev_for_inode+0x23/0x2f
>  Call Trace:
>   xfs_iomap_swapfile_activate+0x1f/0x43
>   __se_sys_swapon+0xb1a/0xee9
> 
> Fix this by not assigning the real inode mapping to f_mapping, which
> will cause swapon() to return an error (-EINVAL). Although it makes
> sense not to allow setting swpafile on an overlayfs file, some users
> may depend on it, so we may need to fix this up in the future.
> 
> Keeping f_mapping pointing to overlay inode mapping will cause O_DIRECT
> open to fail. Fix this by installing ovl_aops with noop_direct_IO in
> overlay inode mapping.

Ummm - shouldn't ovl be checking the real inode's .direct_IO method
and returning status based on that? i.e. if the underlying fs
doesn't support O_DIRECT, neither should ovl...

> +const struct address_space_operations ovl_aops = {
> +	/* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
> +	.direct_IO		= noop_direct_IO,
> +};
> +
>  /*
>   * It is possible to stack overlayfs instance on top of another
>   * overlayfs instance as lower layer. We need to annonate the
> @@ -575,6 +580,7 @@ static void ovl_fill_inode(struct inode *inode, umode_t mode, dev_t rdev,
>  	case S_IFREG:
>  		inode->i_op = &ovl_file_inode_operations;
>  		inode->i_fop = &ovl_file_operations;
> +		inode->i_mapping->a_ops = &ovl_aops;
>  		break;

So you put an ovl interposer in the way here - it needs to pass
through *everything* to the the real inode's aops, right?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/6] vfs: fix sync_file_range syscall on an overlayfs file
  2018-08-26 21:55     ` Amir Goldstein
@ 2018-08-27  4:23       ` Dave Chinner
  2018-08-27  6:37         ` Amir Goldstein
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-08-27  4:23 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, Al Viro, overlayfs, linux-fsdevel

On Mon, Aug 27, 2018 at 12:55:36AM +0300, Amir Goldstein wrote:
> On Sun, Aug 26, 2018 at 10:34 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Sun, Aug 26, 2018 at 6:25 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> > > For an overlayfs file/inode, page io is operating on the real underlying
> > > file, so sync_file_range() should operate on the real underlying file
> > > mapping to take affect.
> >
> > The man page tells us that this syscall basically gives no guarantees
> > at all and shouldn't be used in portable programs.
> >
> 
> Oh no. You need to understand the context of this very bold warning.
> The warning speaks lengthy about durability and it rightfully states that
> you have no way of knowing what data will persist after crash.
> This is relevant for application developers looking for durability, but that is
> not the only use case for sync_file_range().
> 
> I have an application using sync_file_range() for consistency, which is not
> the same game as durability.
> 
> They will tell you that the only safe way to guaranty consistency of data in a
> new file is to do:
> open(...O_TMPFILE) or open(TEMPFILE, ...)
> write()
> fsync()
> link() or rename()
> 
> Then you don't know if file will exist after crash, but if it will
> exist its content
> will be consistent.
> 
> But the fact is that if you need to do many of those new file writes,
> many fsync()
> calls cost much more than the cost of syncing the inode pages, because every
> new file writes metadata and metadata forces fsync to flush the journal.
> 
> Amplify that times number of containers and you have every fsync() on every
> file in every overlayfs container all slamming of the underlying fs journal.
> 
> The fsync() in the snippet above can be safely replaced with sync_file_range()
> eliminating all cost of excessive journal flushes without loosing any
> consistency
> guaranty on "strictly ordered metadata" filesystems - and all major filesystems
> today are.

Wrong.

Nice story, but wrong.

sync_file_range does this:

	if (flags & SYNC_FILE_RANGE_WRITE) {
		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
                                                 WB_SYNC_NONE);
	......

Note the use of "WB_SYNC_NONE"?

This writeback type provides no guarantees that the entire range is
written back.  Writeback can skip pages for any reason when it is
set - to avoid blocking, lock contention, maybe complex allocation
is required, etc. WB_SYNC_NONE doesn't even tag pages in
write_cache_pages() so there's no way to ensure no pages are missed
or retried when set_page_writeback_keepwrite() is called due to
partial page writeback requiring another writeback call to the page
to finish writeback. It doesn't try to write back newly dirty
pages that are already under writeback. And so on.

sync_file_range() provides *no guarantees* about getting your data
to disk at all and /never has/.

> > So, I'd just let the non-functionality be for now.   If someone
> > complains of a regression (unlikely) we can look into it.
> 
> I would like to place a complaint :-)
> 
> I guess we could go for f_op->sync_ranges()?

No. sync_file_range() needs to die.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/6] ovl: respect FIEMAP_FLAG_SYNC flag
  2018-08-27  3:38   ` Dave Chinner
@ 2018-08-27  6:20     ` Amir Goldstein
  2018-08-27 23:05       ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2018-08-27  6:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Miklos Szeredi, Al Viro, overlayfs, linux-fsdevel

On Mon, Aug 27, 2018 at 6:38 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sun, Aug 26, 2018 at 07:25:13PM +0300, Amir Goldstein wrote:
> > Stacked overlayfs fiemap operation broke xfstests that test delayed
> > allocation (with "_test_generic_punch -d"), because ovl_fiemap()
> > failed to write dirty pages when requested.
> >
> > Fixes: 9e142c4102db ("ovl: add ovl_fiemap()")
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >  fs/overlayfs/inode.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> > index e0bb217c01e2..5014749fd4b4 100644
> > --- a/fs/overlayfs/inode.c
> > +++ b/fs/overlayfs/inode.c
> > @@ -467,6 +467,10 @@ static int ovl_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> >               return -EOPNOTSUPP;
> >
> >       old_cred = ovl_override_creds(inode->i_sb);
> > +
> > +     if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC)
> > +             filemap_write_and_wait(realinode->i_mapping);
> > +
> >       err = realinode->i_op->fiemap(realinode, fieinfo, start, len);
>
>
> Where's the fiemap_check_flags() call in the overlay code to
> indicate to userspace what functionality ovl supports?
>
> And, further, you can't take action on FIEMAP_FLAG_SYNC for the
> lower filesystem file because the lower filesystem first has to
> validate the fiemap flags passed in.
>

The is no law against speculative syncing filesystem file pages ;-)
Overlayfs will also fsync a file after first open for write (post copy up)
for obvious reasons.

> So if you need to process FIEMAP_FLAG_SYNC here for the lower
> filesystem, that implies that there is a bug in the filesystem
> implementations and/or the VFS fiemap behaviour.
>
> e.g. in XFS we call iomap_fiemap(), and it does:
>
>         ret = fiemap_check_flags(fi, FIEMAP_FLAG_SYNC);
>         if (ret)
>                 return ret;
>
>         if (fi->fi_flags & FIEMAP_FLAG_SYNC) {
>                 ret = filemap_write_and_wait(inode->i_mapping);
>                 if (ret)
>                         return ret;
>         }
>
> That means you wouldn't have seen this bug on XFS. Ext4 does not do
> this, so it would appear not to observe the FIEMAP_FLAG_SYNC
> behaviour as it was asked to perform.
>

True. overlay over xfs didn't fail those tests.

> Ah, I see - the problem is ioctl_fiemap() - it assumes that it can
> run the flush without first allowing the filesystem to check if that
> flag is supported.
>
> So, shouldn't the correct fix be to move the FIEMAP_FLAG_SYNC from
> the VFS down into the filesystem implementations after they have
> checked the flags field for supported functionality? That way ovl
> doesn't need special case hacks to replicate VFS behaviour...
>

IMO, one line of replicating VFS behavior is better than duplicating
code that is run 99% of the time from VFS into all fs implementations.
Question is whether syncing file pages can be considered harmfull
when issuing FIEMAP_FLAG_XATTR or FIEMAP_FLAG_CACHE?
It can't be considered DoS, because same user can call fsync().

But hey! I can re-write my story about sync_file_ranges() now,
with fiemap(FIEMAP_FLAG_CACHE) can't I? ;-)

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/6] ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
  2018-08-27  3:43   ` Dave Chinner
@ 2018-08-27  6:34     ` Amir Goldstein
  2018-08-27  9:49       ` Miklos Szeredi
  0 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2018-08-27  6:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Miklos Szeredi, Al Viro, overlayfs, linux-fsdevel

On Mon, Aug 27, 2018 at 6:43 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sun, Aug 26, 2018 at 07:25:14PM +0300, Amir Goldstein wrote:
> > Since overlayfs implements stacked file operations, the underlying
> > filesystems are not supposed to be exposed to the overlayfs file,
> > whose f_inode is an overlayfs inode.
> >
> > Assigning an overlayfs file to swap_file results in an attempt of xfs
> > code to dereference an xfs_inode struct from an ovl_inode pointer:
> >
> >  CPU: 0 PID: 2462 Comm: swapon Not tainted
> >  4.18.0-xfstests-12721-g33e17876ea4e #3402
> >  RIP: 0010:xfs_find_bdev_for_inode+0x23/0x2f
> >  Call Trace:
> >   xfs_iomap_swapfile_activate+0x1f/0x43
> >   __se_sys_swapon+0xb1a/0xee9
> >
> > Fix this by not assigning the real inode mapping to f_mapping, which
> > will cause swapon() to return an error (-EINVAL). Although it makes
> > sense not to allow setting swpafile on an overlayfs file, some users
> > may depend on it, so we may need to fix this up in the future.
> >
> > Keeping f_mapping pointing to overlay inode mapping will cause O_DIRECT
> > open to fail. Fix this by installing ovl_aops with noop_direct_IO in
> > overlay inode mapping.
>
> Ummm - shouldn't ovl be checking the real inode's .direct_IO method
> and returning status based on that? i.e. if the underlying fs
> doesn't support O_DIRECT, neither should ovl...
>

ovl_open_realfile() will take care of that later when overlay actually
tried to open the underlying file.

> > +const struct address_space_operations ovl_aops = {
> > +     /* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
> > +     .direct_IO              = noop_direct_IO,
> > +};
> > +
> >  /*
> >   * It is possible to stack overlayfs instance on top of another
> >   * overlayfs instance as lower layer. We need to annonate the
> > @@ -575,6 +580,7 @@ static void ovl_fill_inode(struct inode *inode, umode_t mode, dev_t rdev,
> >       case S_IFREG:
> >               inode->i_op = &ovl_file_inode_operations;
> >               inode->i_fop = &ovl_file_operations;
> > +             inode->i_mapping->a_ops = &ovl_aops;
> >               break;
>
> So you put an ovl interposer in the way here - it needs to pass
> through *everything* to the the real inode's aops, right?
>

No. it's just a decoy a_ops, so if we miss another spot like swapon()
user will get an error (i.e. for no a_ops->readpages) rather then calling
into underlying filesystem aops with an overlay file and oopsing.

So the trick of this patch is to change the game from whack-an-oops
to whack-an-einval, which is better for everyone.

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/6] vfs: fix sync_file_range syscall on an overlayfs file
  2018-08-27  4:23       ` Dave Chinner
@ 2018-08-27  6:37         ` Amir Goldstein
  0 siblings, 0 replies; 20+ messages in thread
From: Amir Goldstein @ 2018-08-27  6:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Miklos Szeredi, Al Viro, overlayfs, linux-fsdevel

On Mon, Aug 27, 2018 at 7:25 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Aug 27, 2018 at 12:55:36AM +0300, Amir Goldstein wrote:
> > On Sun, Aug 26, 2018 at 10:34 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Sun, Aug 26, 2018 at 6:25 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> > > > For an overlayfs file/inode, page io is operating on the real underlying
> > > > file, so sync_file_range() should operate on the real underlying file
> > > > mapping to take affect.
> > >
> > > The man page tells us that this syscall basically gives no guarantees
> > > at all and shouldn't be used in portable programs.
> > >
> >
> > Oh no. You need to understand the context of this very bold warning.
> > The warning speaks lengthy about durability and it rightfully states that
> > you have no way of knowing what data will persist after crash.
> > This is relevant for application developers looking for durability, but that is
> > not the only use case for sync_file_range().
> >
> > I have an application using sync_file_range() for consistency, which is not
> > the same game as durability.
> >
> > They will tell you that the only safe way to guaranty consistency of data in a
> > new file is to do:
> > open(...O_TMPFILE) or open(TEMPFILE, ...)
> > write()
> > fsync()
> > link() or rename()
> >
> > Then you don't know if file will exist after crash, but if it will
> > exist its content
> > will be consistent.
> >
> > But the fact is that if you need to do many of those new file writes,
> > many fsync()
> > calls cost much more than the cost of syncing the inode pages, because every
> > new file writes metadata and metadata forces fsync to flush the journal.
> >
> > Amplify that times number of containers and you have every fsync() on every
> > file in every overlayfs container all slamming of the underlying fs journal.
> >
> > The fsync() in the snippet above can be safely replaced with sync_file_range()
> > eliminating all cost of excessive journal flushes without loosing any
> > consistency
> > guaranty on "strictly ordered metadata" filesystems - and all major filesystems
> > today are.
>
> Wrong.
>
> Nice story, but wrong.
>
> sync_file_range does this:
>
>         if (flags & SYNC_FILE_RANGE_WRITE) {
>                 ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
>                                                  WB_SYNC_NONE);
>         ......
>
> Note the use of "WB_SYNC_NONE"?
>
> This writeback type provides no guarantees that the entire range is
> written back.  Writeback can skip pages for any reason when it is
> set - to avoid blocking, lock contention, maybe complex allocation
> is required, etc. WB_SYNC_NONE doesn't even tag pages in
> write_cache_pages() so there's no way to ensure no pages are missed
> or retried when set_page_writeback_keepwrite() is called due to
> partial page writeback requiring another writeback call to the page
> to finish writeback. It doesn't try to write back newly dirty
> pages that are already under writeback. And so on.
>
> sync_file_range() provides *no guarantees* about getting your data
> to disk at all and /never has/.
>

Thanks for clarifying that!
I guess we'll need to go and re-fix concurrent _xfs_log_force()
optimization ;-/

> > > So, I'd just let the non-functionality be for now.   If someone
> > > complains of a regression (unlikely) we can look into it.
> >
> > I would like to place a complaint :-)
> >
> > I guess we could go for f_op->sync_ranges()?
>
> No. sync_file_range() needs to die.
>

I guess if we really wanted we could add a new FADV_WILLSYNC...
Anyway, I am withdrawing the complaint.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/6] ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
  2018-08-27  6:34     ` Amir Goldstein
@ 2018-08-27  9:49       ` Miklos Szeredi
  0 siblings, 0 replies; 20+ messages in thread
From: Miklos Szeredi @ 2018-08-27  9:49 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Dave Chinner, Al Viro, overlayfs, linux-fsdevel

On Mon, Aug 27, 2018 at 8:34 AM, Amir Goldstein <amir73il@gmail.com> wrote:

> So the trick of this patch is to change the game from whack-an-oops
> to whack-an-einval, which is better for everyone.

In other words: this overlayfs update might cause regressions in dark
and musty corners of overlayfs usage, but those will not be Oops
inducing regressions.  And whenever someone reports a regression, that
will be solved by making that corner of the kernel less musty and dark
(i.e. properly stackable), so in the end this is a win-win game.

There's one corner case remaining, which is shared mmap consistency
across copy-up.  I think the solution to this will be:

 - private mappings of lower files will directly go to the lower
file's mapping (just like now)
 - anything else (shared mmap of lower files and all mmap of upper
files) will go to the overlay mapping (just like a normal fs)

The flip side is cache duplication between shared mmap and private
mmap of the same lower file.  I hope this would be pretty rare.

I don't see big complexities with the above, it should be solvable
purely in overlayfs.

Then there's getting rid of d_real() for good, for which Al has some
ideas, and we are done.  So yes, I think the end of the tunnel is
pretty well in sight.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/6] ovl: respect FIEMAP_FLAG_SYNC flag
  2018-08-27  6:20     ` Amir Goldstein
@ 2018-08-27 23:05       ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2018-08-27 23:05 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, Al Viro, overlayfs, linux-fsdevel

On Mon, Aug 27, 2018 at 09:20:32AM +0300, Amir Goldstein wrote:
> On Mon, Aug 27, 2018 at 6:38 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Sun, Aug 26, 2018 at 07:25:13PM +0300, Amir Goldstein wrote:
> > > Stacked overlayfs fiemap operation broke xfstests that test delayed
> > > allocation (with "_test_generic_punch -d"), because ovl_fiemap()
> > > failed to write dirty pages when requested.
> > >
> > > Fixes: 9e142c4102db ("ovl: add ovl_fiemap()")
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > ---
> > >  fs/overlayfs/inode.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > >
> > > diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> > > index e0bb217c01e2..5014749fd4b4 100644
> > > --- a/fs/overlayfs/inode.c
> > > +++ b/fs/overlayfs/inode.c
> > > @@ -467,6 +467,10 @@ static int ovl_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> > >               return -EOPNOTSUPP;
> > >
> > >       old_cred = ovl_override_creds(inode->i_sb);
> > > +
> > > +     if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC)
> > > +             filemap_write_and_wait(realinode->i_mapping);
> > > +
> > >       err = realinode->i_op->fiemap(realinode, fieinfo, start, len);
> >
> >
> > Where's the fiemap_check_flags() call in the overlay code to
> > indicate to userspace what functionality ovl supports?
> >
> > And, further, you can't take action on FIEMAP_FLAG_SYNC for the
> > lower filesystem file because the lower filesystem first has to
> > validate the fiemap flags passed in.
> >
> 
> The is no law against speculative syncing filesystem file pages ;-)

No, but it can cause all sorts of performance problems for users.

> Overlayfs will also fsync a file after first open for write (post copy up)
> for obvious reasons.
> 
> > So if you need to process FIEMAP_FLAG_SYNC here for the lower
> > filesystem, that implies that there is a bug in the filesystem
> > implementations and/or the VFS fiemap behaviour.
> >
> > e.g. in XFS we call iomap_fiemap(), and it does:
> >
> >         ret = fiemap_check_flags(fi, FIEMAP_FLAG_SYNC);
> >         if (ret)
> >                 return ret;
> >
> >         if (fi->fi_flags & FIEMAP_FLAG_SYNC) {
> >                 ret = filemap_write_and_wait(inode->i_mapping);
> >                 if (ret)
> >                         return ret;
> >         }
> >
> > That means you wouldn't have seen this bug on XFS. Ext4 does not do
> > this, so it would appear not to observe the FIEMAP_FLAG_SYNC
> > behaviour as it was asked to perform.
> 
> True. overlay over xfs didn't fail those tests.
> 
> > Ah, I see - the problem is ioctl_fiemap() - it assumes that it can
> > run the flush without first allowing the filesystem to check if that
> > flag is supported.
> >
> > So, shouldn't the correct fix be to move the FIEMAP_FLAG_SYNC from
> > the VFS down into the filesystem implementations after they have
> > checked the flags field for supported functionality? That way ovl
> > doesn't need special case hacks to replicate VFS behaviour...
> >
> 
> IMO, one line of replicating VFS behavior is better than duplicating
> code that is run 99% of the time from VFS into all fs implementations.

Except when it violates the documented interface behaviour.

That is: filesystems can choose to reject FIEMAP_FLAG_SYNC with
EBADR when it is set in the request (same as they can reject
FIEMAP_FLAG_XATTR), and that means issuing it unconditionally in any
fiemap code without first checking that it is supported is a *bug*.

FWIW, in looking at this  I note that fiemap grew new flags without
adding them to the COMPAT mask, nor did the filesystems check that
they are valid flags. i.e., the FIEMAP_FLAGS_COMPAT mechanism was
subverted by the addition of FIEMAP_FLAG_CACHE - ext4 uses the flag
and returns before it does it's fiemap_check_flags() call to check
for supported flags because that check would fail if
FIEMAP_FLAG_CACHE is set...

<sigh>

Bugs everywhere. How does any of this shit even work?

> Question is whether syncing file pages can be considered harmfull
> when issuing FIEMAP_FLAG_XATTR or FIEMAP_FLAG_CACHE?

If observing the current state of the system can screw up system
performance (like unnecessarily issuing bulk writeback) then it can
be considered harmful.

> It can't be considered DoS, because same user can call fsync().

It doesn't have to be a DoS to be harmful. Users don't tend to walk
their filesystem and issue fsync on every file (we have sync(1) for
that) but maintenance utilities do walk and map extents for every
file...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-08-28  2:54 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-26 16:25 [PATCH v2 0/6] Overlayfs stacked f_op fixes Amir Goldstein
2018-08-26 16:25 ` [PATCH v2 1/6] vfs: add helper to get "real" overlayfs file Amir Goldstein
2018-08-26 16:25 ` [PATCH v2 2/6] ovl: respect FIEMAP_FLAG_SYNC flag Amir Goldstein
2018-08-26 19:26   ` Miklos Szeredi
2018-08-27  3:38   ` Dave Chinner
2018-08-27  6:20     ` Amir Goldstein
2018-08-27 23:05       ` Dave Chinner
2018-08-26 16:25 ` [PATCH v2 3/6] ovl: fix GPF in swapfile_activate of file from overlayfs over xfs Amir Goldstein
2018-08-27  3:43   ` Dave Chinner
2018-08-27  6:34     ` Amir Goldstein
2018-08-27  9:49       ` Miklos Szeredi
2018-08-26 16:25 ` [PATCH v2 4/6] vfs: fix readahead syscall on an overlayfs file Amir Goldstein
2018-08-26 16:25 ` [PATCH v2 5/6] vfs: fix fadvise64 " Amir Goldstein
2018-08-26 19:30   ` Miklos Szeredi
2018-08-26 21:23     ` Amir Goldstein
2018-08-26 16:25 ` [PATCH v2 6/6] vfs: fix sync_file_range " Amir Goldstein
2018-08-26 19:34   ` Miklos Szeredi
2018-08-26 21:55     ` Amir Goldstein
2018-08-27  4:23       ` Dave Chinner
2018-08-27  6:37         ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).