All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
@ 2019-01-22 12:34 Amir Goldstein
  2019-01-22 12:34 ` [RFC][PATCH v2 1/5] ovl: reorder tests in ovl_open_need_copy_up() Amir Goldstein
                   ` (5 more replies)
  0 siblings, 6 replies; 42+ messages in thread
From: Amir Goldstein @ 2019-01-22 12:34 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, linux-unionfs

Miklos,

This experiment uses very naiive noop aops on overlay inode
for upper data files (i.e. SHARED_UPPER and UPPER).

The patches are available on my github [1] and are based on a bug fix
I posted earlier today ("ovl: fix missing upper fs freeze protection on
copy up for ioctl").

Despite the noop aops, the experiment can demonstrates expected behavior
w.r.t the transition from LOWER -> SHARED_UPPER.

With metacopy=on, file is meta copied up on open O_RDWR and data is
copied up on first data access (not on first write).
This solves the open() latency issue reported by Chengguang, but also
serves as a way to test copy up on page fault.
I havn't written any tests for lazy copy up on mwrite/mread, but I did
test these cases manually.

~50/60 of the overlay xfstests actually pass (those test don't care
about data or about committing writes to storage) and the rest fail on
expected data inconsistencies with no observed crashes nor warnings.

The issue discussed with Vivek on v1 patch of this experiment
(inefficient ovl_iter_read()) becomes moot in v2, because unlike v1,
data is copied up on first read or write of writable file and not on
first write. So there is no LOWER_SHARED state in these patches, but
the generic filemap operations and lazy copy up hooks bring us closer
to an implemention of the LOWER_SHARED -> UPPER_SHARED transition.

Can you please point me in the direction of how to implement proper
aops? Or better yet, send me a demo patch?
I figure we need to call vfs_iter_read/vfs_iter_write of real file
with IOCB_DIRECT for page io? But I am very uncertain about locking
order requirements and whether there is already some infrastructure
that we can use (some splice variant)?

Any other thoughts?

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/ovl-lazy-copyup-wip

Amir Goldstein (5):
  ovl: reorder tests in ovl_open_need_copy_up()
  ovl: prepare for generic filemap file operations
  ovl: lazy copy up of data on first data access
  ovl: lazy copy up data on page fault
  ovl: noop aops to test filemap operations and lazy copy up

 fs/overlayfs/copy_up.c   |  14 +--
 fs/overlayfs/file.c      | 238 +++++++++++++++++++++++++++++++++++++--
 fs/overlayfs/inode.c     |   5 -
 fs/overlayfs/overlayfs.h |  11 +-
 fs/overlayfs/util.c      |   4 +-
 mm/fadvise.c             |   4 +-
 6 files changed, 248 insertions(+), 28 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC][PATCH v2 1/5] ovl: reorder tests in ovl_open_need_copy_up()
  2019-01-22 12:34 [RFC][PATCH v2 0/5] Experiments with overlayfs filemap Amir Goldstein
@ 2019-01-22 12:34 ` Amir Goldstein
  2019-01-22 12:34 ` [RFC][PATCH v2 2/5] ovl: prepare for generic filemap file operations Amir Goldstein
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 42+ messages in thread
From: Amir Goldstein @ 2019-01-22 12:34 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, linux-unionfs

ovl_already_copied_up() has two memory barriers, which could be
avoided for open for read, so move it after open flags test.

special_file() is not expected from ovl_open(), so WARN_ON it,
and remove stale comment about disconnected dentry.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/copy_up.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 0c0e11d529d0..05f853c7db0d 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -872,17 +872,14 @@ int ovl_copy_up_flags(struct dentry *dentry, int flags)
 
 static bool ovl_open_need_copy_up(struct dentry *dentry, int flags)
 {
-	/* Copy up of disconnected dentry does not set upper alias */
-	if (ovl_already_copied_up(dentry, flags))
-		return false;
-
-	if (special_file(d_inode(dentry)->i_mode))
+	/* Not called from regular file ops? */
+	if (WARN_ON(special_file(d_inode(dentry)->i_mode)))
 		return false;
 
 	if (!ovl_open_flags_need_copy_up(flags))
 		return false;
 
-	return true;
+	return !ovl_already_copied_up(dentry, flags);
 }
 
 int ovl_maybe_copy_up(struct dentry *dentry, int flags)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC][PATCH v2 2/5] ovl: prepare for generic filemap file operations
  2019-01-22 12:34 [RFC][PATCH v2 0/5] Experiments with overlayfs filemap Amir Goldstein
  2019-01-22 12:34 ` [RFC][PATCH v2 1/5] ovl: reorder tests in ovl_open_need_copy_up() Amir Goldstein
@ 2019-01-22 12:34 ` Amir Goldstein
  2019-01-22 12:34 ` [RFC][PATCH v2 3/5] ovl: lazy copy up of data on first data access Amir Goldstein
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 42+ messages in thread
From: Amir Goldstein @ 2019-01-22 12:34 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, linux-unionfs

Implement a variant of overlay file operations using generic
filemap helpers.  This implementation will be enabled for copied up
files after overlay address space operations are implemented.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/file.c      | 118 +++++++++++++++++++++++++++++++++++++--
 fs/overlayfs/inode.c     |   5 --
 fs/overlayfs/overlayfs.h |   1 +
 mm/fadvise.c             |   4 +-
 4 files changed, 115 insertions(+), 13 deletions(-)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 50e4407398d8..5f648979adbf 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -11,6 +11,11 @@
 #include <linux/mount.h>
 #include <linux/xattr.h>
 #include <linux/uio.h>
+#include <linux/mm.h>
+#include <linux/iomap.h>
+#include <linux/pagemap.h>
+#include <linux/writeback.h>
+#include <linux/ratelimit.h>
 #include "overlayfs.h"
 
 static char ovl_whatisit(struct inode *inode, struct inode *realinode)
@@ -114,6 +119,36 @@ static int ovl_real_fdget(const struct file *file, struct fd *real)
 	return ovl_real_fdget_meta(file, real, false);
 }
 
+static bool ovl_filemap_support(struct file *file)
+{
+	struct ovl_fs *ofs = file_inode(file)->i_sb->s_fs_info;
+
+	/* TODO: implement aops to upper inode data */
+	return ofs->upper_mnt && ovl_aops.writepage;
+}
+
+static bool ovl_should_use_filemap(struct file *file)
+{
+	if (!ovl_filemap_support(file))
+		return false;
+
+	/*
+	 * Use overlay inode page cache for all inodes that could be dirty,
+	 * including pure upper inodes, so ovl_sync_fs() can sync all dirty
+	 * overlay inodes without having to sync all upper fs dirty inodes.
+	 */
+	return ovl_has_upperdata(file_inode(file));
+}
+
+static int ovl_flush_filemap(struct file *file, loff_t offset, loff_t len)
+{
+	if (!ovl_should_use_filemap(file))
+		return 0;
+
+	return filemap_write_and_wait_range(file_inode(file)->i_mapping,
+					    offset, len);
+}
+
 static int ovl_open(struct inode *inode, struct file *file)
 {
 	struct file *realfile;
@@ -190,7 +225,7 @@ static rwf_t ovl_iocb_to_rwf(struct kiocb *iocb)
 	return flags;
 }
 
-static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+static ssize_t ovl_real_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct file *file = iocb->ki_filp;
 	struct fd real;
@@ -216,7 +251,17 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 	return ret;
 }
 
-static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
+static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *file = iocb->ki_filp;
+
+	if (ovl_should_use_filemap(file))
+		return generic_file_read_iter(iocb, iter);
+
+	return ovl_real_read_iter(iocb, iter);
+}
+
+static ssize_t ovl_real_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
@@ -256,7 +301,19 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	return ret;
 }
 
-static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
+static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *file = iocb->ki_filp;
+
+	if (ovl_should_use_filemap(file))
+		return generic_file_write_iter(iocb, iter);
+
+	return ovl_real_write_iter(iocb, iter);
+}
+
+
+static int ovl_real_fsync(struct file *file, loff_t start, loff_t end,
+			  int datasync)
 {
 	struct fd real;
 	const struct cred *old_cred;
@@ -278,7 +335,20 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	return ret;
 }
 
-static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
+static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
+{
+	if (ovl_should_use_filemap(file))
+		return __generic_file_fsync(file, start, end, datasync);
+
+	return ovl_real_fsync(file, start, end, datasync);
+}
+
+const struct address_space_operations ovl_aops = {
+	/* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
+	.direct_IO		= noop_direct_IO,
+};
+
+static int ovl_real_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct file *realfile = file->private_data;
 	const struct cred *old_cred;
@@ -309,13 +379,27 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
 	return ret;
 }
 
-static long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
+static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if (ovl_should_use_filemap(file))
+		generic_file_mmap(file, vma);
+
+	return ovl_real_mmap(file, vma);
+}
+
+static long ovl_fallocate(struct file *file, int mode, loff_t offset,
+			  loff_t len)
 {
 	struct inode *inode = file_inode(file);
 	struct fd real;
 	const struct cred *old_cred;
 	int ret;
 
+	/* XXX: Different modes need to flush different ranges... */
+	ret = ovl_flush_filemap(file, 0, LLONG_MAX);
+	if (ret)
+		return ret;
+
 	ret = ovl_real_fdget(file, &real);
 	if (ret)
 		return ret;
@@ -332,7 +416,8 @@ static long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len
 	return ret;
 }
 
-static int ovl_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
+static int ovl_real_fadvise(struct file *file, loff_t offset, loff_t len,
+			    int advice)
 {
 	struct fd real;
 	const struct cred *old_cred;
@@ -351,6 +436,18 @@ static int ovl_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 	return ret;
 }
 
+extern int generic_fadvise(struct file *file, loff_t offset, loff_t len,
+			   int advice);
+
+static int ovl_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
+{
+	if (ovl_should_use_filemap(file))
+		return generic_fadvise(file, offset, len, advice);
+
+	/* XXX: Should we allow messing with lower shared page cache? */
+	return ovl_real_fadvise(file, offset, len, advice);
+}
+
 static long ovl_real_ioctl(struct file *file, unsigned int cmd,
 			   unsigned long arg)
 {
@@ -442,6 +539,15 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in,
 	const struct cred *old_cred;
 	loff_t ret;
 
+	/* XXX: For some ops zero length means EOF... */
+	ret = ovl_flush_filemap(file_out, pos_out, len ?: LLONG_MAX);
+	if (ret)
+		return ret;
+
+	ret = ovl_flush_filemap(file_in, pos_in, len ?: LLONG_MAX);
+	if (ret)
+		return ret;
+
 	ret = ovl_real_fdget(file_out, &real_out);
 	if (ret)
 		return ret;
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 3b7ed5d2279c..526ff8d503b0 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -504,11 +504,6 @@ static const struct inode_operations ovl_special_inode_operations = {
 	.update_time	= ovl_update_time,
 };
 
-static const struct address_space_operations ovl_aops = {
-	/* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
-	.direct_IO		= noop_direct_IO,
-};
-
 /*
  * It is possible to stack overlayfs instance on top of another
  * overlayfs instance as lower layer. We need to annonate the
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 767eeaf73bf1..ab61c07e8583 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -414,6 +414,7 @@ struct dentry *ovl_create_temp(struct dentry *workdir, struct ovl_cattr *attr);
 
 /* file.c */
 extern const struct file_operations ovl_file_operations;
+extern const struct address_space_operations ovl_aops;
 
 /* copy_up.c */
 int ovl_copy_up(struct dentry *dentry);
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 467bcd032037..4f17c83db575 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -27,8 +27,7 @@
  * deactivate the pages and clear PG_Referenced.
  */
 
-static int generic_fadvise(struct file *file, loff_t offset, loff_t len,
-			   int advice)
+int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 {
 	struct inode *inode;
 	struct address_space *mapping;
@@ -178,6 +177,7 @@ static int generic_fadvise(struct file *file, loff_t offset, loff_t len,
 	}
 	return 0;
 }
+EXPORT_SYMBOL(generic_fadvise);
 
 int vfs_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC][PATCH v2 3/5] ovl: lazy copy up of data on first data access
  2019-01-22 12:34 [RFC][PATCH v2 0/5] Experiments with overlayfs filemap Amir Goldstein
  2019-01-22 12:34 ` [RFC][PATCH v2 1/5] ovl: reorder tests in ovl_open_need_copy_up() Amir Goldstein
  2019-01-22 12:34 ` [RFC][PATCH v2 2/5] ovl: prepare for generic filemap file operations Amir Goldstein
@ 2019-01-22 12:34 ` Amir Goldstein
  2019-01-22 12:34 ` [RFC][PATCH v2 4/5] ovl: lazy copy up data on page fault Amir Goldstein
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 42+ messages in thread
From: Amir Goldstein @ 2019-01-22 12:34 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, linux-unionfs

When metacopy feature is enabled and filemap operations supported,
copy up only metadata when opening a file O_RDWR and defer copy up
of data until the first data access file operation.

Copy up on first page fault will be handled by a followup patch.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/copy_up.c   |  5 +---
 fs/overlayfs/file.c      | 63 +++++++++++++++++++++++++++++++++++-----
 fs/overlayfs/overlayfs.h | 10 ++++++-
 fs/overlayfs/util.c      |  4 +--
 4 files changed, 68 insertions(+), 14 deletions(-)

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 05f853c7db0d..92a27ba65c78 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -731,10 +731,7 @@ static bool ovl_need_meta_copy_up(struct dentry *dentry, umode_t mode,
 	if (!S_ISREG(mode))
 		return false;
 
-	if (flags && ((OPEN_FMODE(flags) & FMODE_WRITE) || (flags & O_TRUNC)))
-		return false;
-
-	return true;
+	return !ovl_open_flags_need_data_copy_up(flags);
 }
 
 /* Copy up data of an inode which was copied up metadata only in the past. */
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 5f648979adbf..e02b61af9488 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -90,10 +90,24 @@ static int ovl_real_fdget_meta(const struct file *file, struct fd *real,
 {
 	struct inode *inode = file_inode(file);
 	struct inode *realinode;
+	int err;
 
 	real->flags = 0;
 	real->file = file->private_data;
 
+	/*
+	 * Lazy copy up caches the meta copy upper file on open O_RDWR.
+	 * We need to promote upper inode to full data copy up before
+	 * we allow access to real file data on a writable file, otherwise
+	 * we may try to open a lower file O_RDWR or perform data operations
+	 * (e.g. fallocate) on the metacopy inode.
+	 */
+	if (!allow_meta) {
+		err = ovl_maybe_copy_up(file_dentry(file), file->f_flags);
+		if (err)
+			return err;
+	}
+
 	if (allow_meta)
 		realinode = ovl_inode_real(inode);
 	else
@@ -127,11 +141,34 @@ static bool ovl_filemap_support(struct file *file)
 	return ofs->upper_mnt && ovl_aops.writepage;
 }
 
-static bool ovl_should_use_filemap(struct file *file)
+static bool ovl_should_use_filemap_meta(struct file *file, bool meta_only)
 {
+	int err;
+
 	if (!ovl_filemap_support(file))
 		return false;
 
+	/*
+	 * If file was opened O_RDWR with lazy copy up of data, the first
+	 * data access file operation will trigger data copy up. If file was
+	 * opened O_RDWR and @meta_only is true, we use overlay inode filemap
+	 * operations, but defer data copy up further.
+	 *
+	 * For example, mmap() and fsync() are metadata only operations that
+	 * do not trigger lazy copy up of data, but read() (on a file open for
+	 * write) is a data access operation that does trigger data copy up.
+	 */
+	if (!meta_only) {
+		err = ovl_maybe_copy_up(file_dentry(file), file->f_flags);
+		if (err) {
+			pr_warn_ratelimited("overlayfs: failed lazy copy up data (%pd2, err=%i)\n",
+					    file_dentry(file), err);
+			return false;
+		}
+	} else if (file->f_mode & FMODE_WRITE) {
+		return true;
+	}
+
 	/*
 	 * Use overlay inode page cache for all inodes that could be dirty,
 	 * including pure upper inodes, so ovl_sync_fs() can sync all dirty
@@ -140,9 +177,14 @@ static bool ovl_should_use_filemap(struct file *file)
 	return ovl_has_upperdata(file_inode(file));
 }
 
+static bool ovl_should_use_filemap(struct file *file)
+{
+	return ovl_should_use_filemap_meta(file, false);
+}
+
 static int ovl_flush_filemap(struct file *file, loff_t offset, loff_t len)
 {
-	if (!ovl_should_use_filemap(file))
+	if (!ovl_should_use_filemap_meta(file, true))
 		return 0;
 
 	return filemap_write_and_wait_range(file_inode(file)->i_mapping,
@@ -152,16 +194,23 @@ static int ovl_flush_filemap(struct file *file, loff_t offset, loff_t len)
 static int ovl_open(struct inode *inode, struct file *file)
 {
 	struct file *realfile;
+	int metacopy = 0;
 	int err;
 
-	err = ovl_maybe_copy_up(file_dentry(file), file->f_flags);
+	/* (O_RDWR | O_WRONLY) signals that we want meta copy up */
+	if (ovl_filemap_support(file) && (file->f_flags & O_ACCMODE) == O_RDWR)
+		metacopy = (O_RDWR | O_WRONLY);
+
+	/* Allow to copy up meta and defer copy up data to first data access */
+	err = ovl_maybe_copy_up(file_dentry(file), file->f_flags | metacopy);
 	if (err)
 		return err;
 
 	/* No longer need these flags, so don't pass them on to underlying fs */
 	file->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
 
-	realfile = ovl_open_realfile(file, ovl_inode_realdata(inode));
+	realfile = ovl_open_realfile(file, metacopy ? ovl_inode_real(inode) :
+				     ovl_inode_realdata(inode));
 	if (IS_ERR(realfile))
 		return PTR_ERR(realfile);
 
@@ -337,7 +386,7 @@ static int ovl_real_fsync(struct file *file, loff_t start, loff_t end,
 
 static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 {
-	if (ovl_should_use_filemap(file))
+	if (ovl_should_use_filemap_meta(file, true))
 		return __generic_file_fsync(file, start, end, datasync);
 
 	return ovl_real_fsync(file, start, end, datasync);
@@ -381,7 +430,7 @@ static int ovl_real_mmap(struct file *file, struct vm_area_struct *vma)
 
 static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	if (ovl_should_use_filemap(file))
+	if (ovl_should_use_filemap_meta(file, true))
 		generic_file_mmap(file, vma);
 
 	return ovl_real_mmap(file, vma);
@@ -441,7 +490,7 @@ extern int generic_fadvise(struct file *file, loff_t offset, loff_t len,
 
 static int ovl_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 {
-	if (ovl_should_use_filemap(file))
+	if (ovl_should_use_filemap_meta(file, true))
 		return generic_fadvise(file, offset, len, advice);
 
 	/* XXX: Should we allow messing with lower shared page cache? */
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index ab61c07e8583..6baa3460c173 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -195,14 +195,22 @@ static inline struct dentry *ovl_do_tmpfile(struct dentry *dentry, umode_t mode)
 	return ret;
 }
 
-static inline bool ovl_open_flags_need_copy_up(int flags)
+static inline bool ovl_open_flags_need_data_copy_up(int flags)
 {
 	if (!flags)
 		return false;
 
+	/* Either O_RDWR or O_WRONLY will trigger data copy up */
 	return ((OPEN_FMODE(flags) & FMODE_WRITE) || (flags & O_TRUNC));
 }
 
+static inline bool ovl_open_flags_need_copy_up(int flags)
+{
+	/* O_RDWR|O_WRONLY together will trigger meta copy up */
+	return (flags & O_ACCMODE) == (O_RDWR | O_WRONLY) ||
+		ovl_open_flags_need_data_copy_up(flags);
+}
+
 /* util.c */
 int ovl_want_write(struct dentry *dentry);
 void ovl_drop_write(struct dentry *dentry);
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 7c01327b1852..df6d13185f40 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -367,7 +367,7 @@ void ovl_set_upperdata(struct inode *inode)
 /* Caller should hold ovl_inode->lock */
 bool ovl_dentry_needs_data_copy_up_locked(struct dentry *dentry, int flags)
 {
-	if (!ovl_open_flags_need_copy_up(flags))
+	if (!ovl_open_flags_need_data_copy_up(flags))
 		return false;
 
 	return !ovl_test_flag(OVL_UPPERDATA, d_inode(dentry));
@@ -375,7 +375,7 @@ bool ovl_dentry_needs_data_copy_up_locked(struct dentry *dentry, int flags)
 
 bool ovl_dentry_needs_data_copy_up(struct dentry *dentry, int flags)
 {
-	if (!ovl_open_flags_need_copy_up(flags))
+	if (!ovl_open_flags_need_data_copy_up(flags))
 		return false;
 
 	return !ovl_has_upperdata(d_inode(dentry));
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC][PATCH v2 4/5] ovl: lazy copy up data on page fault
  2019-01-22 12:34 [RFC][PATCH v2 0/5] Experiments with overlayfs filemap Amir Goldstein
                   ` (2 preceding siblings ...)
  2019-01-22 12:34 ` [RFC][PATCH v2 3/5] ovl: lazy copy up of data on first data access Amir Goldstein
@ 2019-01-22 12:34 ` Amir Goldstein
  2019-01-22 12:34 ` [RFC][PATCH v2 5/5] ovl: noop aops to test filemap operations and lazy copy up Amir Goldstein
  2019-01-24 17:18 ` [RFC][PATCH v2 0/5] Experiments with overlayfs filemap Amir Goldstein
  5 siblings, 0 replies; 42+ messages in thread
From: Amir Goldstein @ 2019-01-22 12:34 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, linux-unionfs

When overlay filemap operations are supported, all maps, including
private and non writable maps that are created from file that was
opened after copy up, map overlay inode pages. If the file was opened
O_RDWR with lazy copy up of data, the first page fault, read or write,
will trigger data copy up.

NOTE: SHARED maps that are created from file opened O_RDONLY before
data copy up will stay mapped to lower real file also after copy up.
This long standing inconsistency will need to be resolved by future
patches.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/file.c | 59 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 57 insertions(+), 2 deletions(-)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index e02b61af9488..35c75e9d8111 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -397,6 +397,49 @@ const struct address_space_operations ovl_aops = {
 	.direct_IO		= noop_direct_IO,
 };
 
+static vm_fault_t ovl_fault(struct vm_fault *vmf)
+{
+	struct file *file = vmf->vma->vm_file;
+	struct inode *inode = file_inode(file);
+	bool blocking = (vmf->flags & FAULT_FLAG_KILLABLE) ||
+			((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
+			 !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT));
+	int err = 0;
+
+	/*
+	 * Handle fault of pages in maps that were created from file that was
+	 * opened O_RDWR and before data copy up.
+	 */
+	if (!ovl_has_upperdata(inode)) {
+		/* TODO: async copy up data? */
+		if (!blocking)
+			goto out_err;
+
+		up_read(&vmf->vma->vm_mm->mmap_sem);
+		err = ovl_maybe_copy_up(file_dentry(file), O_WRONLY);
+		if (err)
+			goto out_err;
+
+		return VM_FAULT_RETRY;
+	}
+
+	return filemap_fault(vmf);
+
+out_err:
+	pr_warn_ratelimited("overlayfs: %s copy up data on page fault (%pd2, err=%i)\n",
+			    blocking ? "failed to" : "no wait for",
+			    file_dentry(file), err);
+
+	/* We must return VM_FAULT_RETRY if we released mmap_sem */
+	return blocking ? VM_FAULT_NOPAGE : VM_FAULT_RETRY;
+}
+
+static const struct vm_operations_struct ovl_file_vm_ops = {
+	.fault		= ovl_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite   = filemap_page_mkwrite,
+};
+
 static int ovl_real_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct file *realfile = file->private_data;
@@ -430,8 +473,20 @@ static int ovl_real_mmap(struct file *file, struct vm_area_struct *vma)
 
 static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	if (ovl_should_use_filemap_meta(file, true))
-		generic_file_mmap(file, vma);
+	/*
+	 * All maps (also private and non writable) that are created from file
+	 * that was opened after copy up, map overlay inode pages. If the file
+	 * was opened O_RDWR with lazy copy up of data, the first page fault
+	 * will trigger data copy up.
+	 *
+	 * FIXME: SHARED maps that are created from file opened O_RDONLY before
+	 * data copy up will stay mapped to lower real file also after copy up.
+	 */
+	if (ovl_should_use_filemap_meta(file, true)) {
+		vma->vm_ops = &ovl_file_vm_ops;
+		file_accessed(file);
+		return 0;
+	}
 
 	return ovl_real_mmap(file, vma);
 }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC][PATCH v2 5/5] ovl: noop aops to test filemap operations and lazy copy up
  2019-01-22 12:34 [RFC][PATCH v2 0/5] Experiments with overlayfs filemap Amir Goldstein
                   ` (3 preceding siblings ...)
  2019-01-22 12:34 ` [RFC][PATCH v2 4/5] ovl: lazy copy up data on page fault Amir Goldstein
@ 2019-01-22 12:34 ` Amir Goldstein
  2019-01-24 17:18 ` [RFC][PATCH v2 0/5] Experiments with overlayfs filemap Amir Goldstein
  5 siblings, 0 replies; 42+ messages in thread
From: Amir Goldstein @ 2019-01-22 12:34 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, linux-unionfs

WIP - DO NOT MERGE!!!

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/file.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 35c75e9d8111..80d60a33a2a3 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -392,9 +392,21 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	return ovl_real_fsync(file, start, end, datasync);
 }
 
+/* Noop aops to test filemap operations and lazy copy up */
+static int ovl_noop_writepage(struct page *page, struct writeback_control *wbc)
+{
+	unlock_page(page);
+	return 0;
+}
+
 const struct address_space_operations ovl_aops = {
+	.readpage	= simple_readpage,
+	.write_begin	= simple_write_begin,
+	.write_end	= simple_write_end,
+	.set_page_dirty	= __set_page_dirty_no_writeback,
+	.writepage	= ovl_noop_writepage,
 	/* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
-	.direct_IO		= noop_direct_IO,
+	.direct_IO	= noop_direct_IO,
 };
 
 static vm_fault_t ovl_fault(struct vm_fault *vmf)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-22 12:34 [RFC][PATCH v2 0/5] Experiments with overlayfs filemap Amir Goldstein
                   ` (4 preceding siblings ...)
  2019-01-22 12:34 ` [RFC][PATCH v2 5/5] ovl: noop aops to test filemap operations and lazy copy up Amir Goldstein
@ 2019-01-24 17:18 ` Amir Goldstein
  2019-01-24 22:35   ` Amir Goldstein
  5 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-01-24 17:18 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

>
> Can you please point me in the direction of how to implement proper
> aops? Or better yet, send me a demo patch?
> I figure we need to call vfs_iter_read/vfs_iter_write of real file
> with IOCB_DIRECT for page io? But I am very uncertain about locking
> order requirements and whether there is already some infrastructure
> that we can use (some splice variant)?
>
> Any other thoughts?
>

Miklos,

I've made a small progress.

Pushed two more commits to ovl-lazy-copyup-wip:
ea9a0083d1ed - ovl: readahead upper inode page cache on overlay page fault
d1ff6e6366d5 - ovl: implement overlay stacked readpage

That implements readpage by reading and copying from upper page cache.

With that change, the basic unionmount test on tmpfs passes.
With unionmount over xfs, tests fail because partial page write
is not working correctly, although mmap partial write works well.

I will try to follow similar pattern to implement writepage by copy to
upper page cache.

Let me know if you think I am going in the wrong direction.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-24 17:18 ` [RFC][PATCH v2 0/5] Experiments with overlayfs filemap Amir Goldstein
@ 2019-01-24 22:35   ` Amir Goldstein
  2019-01-25  9:54     ` Miklos Szeredi
  0 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-01-24 22:35 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

On Thu, Jan 24, 2019 at 7:18 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> >
> > Can you please point me in the direction of how to implement proper
> > aops? Or better yet, send me a demo patch?
> > I figure we need to call vfs_iter_read/vfs_iter_write of real file
> > with IOCB_DIRECT for page io? But I am very uncertain about locking
> > order requirements and whether there is already some infrastructure
> > that we can use (some splice variant)?
> >
> > Any other thoughts?
> >
>
> Miklos,
>
> I've made a small progress.
>
> Pushed two more commits to ovl-lazy-copyup-wip:
> ea9a0083d1ed - ovl: readahead upper inode page cache on overlay page fault
> d1ff6e6366d5 - ovl: implement overlay stacked readpage

Pushed two more:
838c119934a2 - ovl: implement write_begin/write_end aops
44b9e89154ba - ovl: return overlay inode i_size for stat()

>
> That implements readpage by reading and copying from upper page cache.
>
> With that change, the basic unionmount test on tmpfs passes.

That is not surprising. overlay filemap ops are disabled for upper
tmpfs in current
implementation, because tmpfs has no readpage().

> With unionmount over xfs, tests fail because partial page write
> is not working correctly, although mmap partial write works well.
>

But that is fixed now, so ./run --ov --samefs also passes over xfs.
Which says something bad about our test coverage...

> I will try to follow similar pattern to implement writepage by copy to
> upper page cache.
>

...Luckily, I have implemented ./run --ov --recycle way back, otherwise we
would have had very poor test coverage for persistency of overlayfs writes.

So we currently have 6 xfstests failing on write persistency:
overlay/018 overlay/032 overlay/044 overlay/047 overlay/050 overlay/051

and 2 unionmount tests (over /base xfs):
./run --ov --samefs --recycle rename-new-pop-dir
./run --ov --samefs --recycle rename-mass-dir

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-24 22:35   ` Amir Goldstein
@ 2019-01-25  9:54     ` Miklos Szeredi
  2019-01-25 11:24       ` Amir Goldstein
  0 siblings, 1 reply; 42+ messages in thread
From: Miklos Szeredi @ 2019-01-25  9:54 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

[-- Attachment #1: Type: text/plain, Size: 2437 bytes --]

On Thu, Jan 24, 2019 at 11:35 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Jan 24, 2019 at 7:18 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > >
> > > Can you please point me in the direction of how to implement proper
> > > aops? Or better yet, send me a demo patch?
> > > I figure we need to call vfs_iter_read/vfs_iter_write of real file
> > > with IOCB_DIRECT for page io? But I am very uncertain about locking
> > > order requirements and whether there is already some infrastructure
> > > that we can use (some splice variant)?
> > >
> > > Any other thoughts?
> > >
> >
> > Miklos,
> >
> > I've made a small progress.
> >
> > Pushed two more commits to ovl-lazy-copyup-wip:
> > ea9a0083d1ed - ovl: readahead upper inode page cache on overlay page fault
> > d1ff6e6366d5 - ovl: implement overlay stacked readpage
>
> Pushed two more:
> 838c119934a2 - ovl: implement write_begin/write_end aops
> 44b9e89154ba - ovl: return overlay inode i_size for stat()
>
> >
> > That implements readpage by reading and copying from upper page cache.
> >
> > With that change, the basic unionmount test on tmpfs passes.
>
> That is not surprising. overlay filemap ops are disabled for upper
> tmpfs in current
> implementation, because tmpfs has no readpage().
>
> > With unionmount over xfs, tests fail because partial page write
> > is not working correctly, although mmap partial write works well.
> >
>
> But that is fixed now, so ./run --ov --samefs also passes over xfs.
> Which says something bad about our test coverage...
>
> > I will try to follow similar pattern to implement writepage by copy to
> > upper page cache.
> >
>
> ...Luckily, I have implemented ./run --ov --recycle way back, otherwise we
> would have had very poor test coverage for persistency of overlayfs writes.
>
> So we currently have 6 xfstests failing on write persistency:
> overlay/018 overlay/032 overlay/044 overlay/047 overlay/050 overlay/051
>
> and 2 unionmount tests (over /base xfs):
> ./run --ov --samefs --recycle rename-new-pop-dir
> ./run --ov --samefs --recycle rename-mass-dir

Great progress.  Thanks!

I don't think you need to mess with dcache flushing, since this is all
kernel (page cache) pages without any possible dcache aliasing issues
that plague user pages.  So a simple copy_page() should do.

It would be even better to do it with O_DIRECT to remove cache
duplication.  Something like the attached...

Thanks,
Miklos

[-- Attachment #2: ovl-readpage-with-DIO.patch --]
[-- Type: text/x-patch, Size: 4917 bytes --]

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 8e4056b262db..6f1e50202779 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -35,9 +35,19 @@ static struct file *ovl_open_realfile(const struct file *file,
 	struct inode *inode = file_inode(file);
 	struct file *realfile;
 	const struct cred *old_cred;
+	int flags = file->f_flags | O_NOATIME;
+
+	if (realinode == ovl_inode_upper(inode)) {
+		if (flags & O_WRONLY)
+			flags = (flags & ~O_ACCMODE) | O_RDWR;
+
+		if (realinode->i_mapping->a_ops &&
+		    realinode->i_mapping->a_ops->direct_IO)
+			flags |= O_DIRECT;
+	}
 
 	old_cred = ovl_override_creds(inode->i_sb);
-	realfile = open_with_fake_path(&file->f_path, file->f_flags | O_NOATIME,
+	realfile = open_with_fake_path(&file->f_path, flags,
 				       realinode, current_cred());
 	revert_creds(old_cred);
 
@@ -122,6 +132,9 @@ static int ovl_real_fdget_meta(const struct file *file, struct fd *real,
 		return PTR_ERR_OR_ZERO(real->file);
 	}
 
+	if (realinode == ovl_inode_upper(inode))
+		return 0;
+
 	/* Did the flags change since open? */
 	if (unlikely((file->f_flags ^ real->file->f_flags) & ~O_NOATIME))
 		return ovl_change_flags(real->file, file->f_flags);
@@ -145,7 +158,6 @@ static bool ovl_filemap_support(struct file *file)
 static bool ovl_should_use_filemap_meta(struct file *file, bool meta_only)
 {
 	struct inode *inode = file_inode(file);
-	struct inode *upper;
 	int err;
 
 	if (!ovl_filemap_support(file))
@@ -168,12 +180,6 @@ static bool ovl_should_use_filemap_meta(struct file *file, bool meta_only)
 		}
 	}
 
-	/* Does upper inode support page cache operations? */
-	upper = ovl_inode_upper(inode);
-	if (!upper || unlikely(!upper->i_mapping ||
-			       !upper->i_mapping->a_ops->readpage))
-		return false;
-
 	/*
 	 * If file was opened O_RDWR and @meta_only is true, we use overlay
 	 * inode filemap operations, but defer data copy up further.
@@ -186,7 +192,7 @@ static bool ovl_should_use_filemap_meta(struct file *file, bool meta_only)
 	 * including pure upper inodes, so ovl_sync_fs() can sync all dirty
 	 * overlay inodes without having to sync all upper fs dirty inodes.
 	 */
-	return upper && ovl_has_upperdata(inode);
+	return ovl_inode_upper(inode) && ovl_has_upperdata(inode);
 }
 
 static bool ovl_should_use_filemap(struct file *file)
@@ -404,52 +410,43 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	return ovl_real_fsync(file, start, end, datasync);
 }
 
-static struct page *ovl_real_get_page(struct file *file, pgoff_t index)
-{
-	struct file *realfile = file->private_data;
-	struct page *page;
 
-	page = read_mapping_page(file_inode(realfile)->i_mapping, index, NULL);
-	if (IS_ERR(page))
-		return page;
-	if (!PageUptodate(page)) {
-		put_page(page);
-		return ERR_PTR(-EIO);
-	}
-	lock_page(page);
-
-	return page;
-}
-
-static void ovl_copy_page(struct page *dst_page, struct page *src_page)
+static int ovl_real_readpage(struct file *realfile, struct page *page)
 {
-	void *src_addr = kmap_atomic(src_page);
-	void *dst_addr = kmap_atomic(dst_page);
+	struct bio_vec bvec = {
+		.bv_page = page,
+		.bv_len = PAGE_SIZE,
+		.bv_offset = 0,
+	};
+	loff_t pos = page->index << PAGE_SHIFT;
+	struct iov_iter iter;
+	ssize_t ret;
 
-	flush_dcache_page(src_page);
-	memcpy(dst_addr, src_addr, PAGE_SIZE);
+	iov_iter_bvec(&iter, READ, &bvec, 1, PAGE_SIZE);
 
-	kunmap_atomic(dst_addr);
-	kunmap_atomic(src_addr);
+	ret = vfs_iter_read(realfile, &iter, &pos, 0);
+
+	return ret < 0 ? ret : 0;
 }
 
 static int ovl_do_readpage(struct file *file, struct page *page)
 {
-	struct page *realpage;
-	int ret = 0;
+	int ret;
+	struct fd real;
+	const struct cred *old_cred;
 
-	realpage = ovl_real_get_page(file, page->index);
-	if (!IS_ERR(realpage)) {
-		ovl_copy_page(page, realpage);
-		unlock_page(realpage);
-		put_page(realpage);
-	} else {
-		ret = PTR_ERR(realpage);
-		clear_highpage(page);
-	}
+	ret = ovl_real_fdget(file, &real);
+	if (ret)
+		return ret;
+
+	old_cred = ovl_override_creds(file_inode(file)->i_sb);
+	ret = ovl_real_readpage(real.file, page);
+	revert_creds(old_cred);
+
+	fdput(real);
 
-	flush_dcache_page(page);
-	SetPageUptodate(page);
+	if (!ret)
+		SetPageUptodate(page);
 
 	return ret;
 }
@@ -470,6 +467,7 @@ static int ovl_write_begin(struct file *file, struct address_space *mapping,
 {
 	struct page *page;
 	pgoff_t index;
+	int err;
 
 	index = pos >> PAGE_SHIFT;
 
@@ -477,11 +475,20 @@ static int ovl_write_begin(struct file *file, struct address_space *mapping,
 	if (!page)
 		return -ENOMEM;
 
-	*pagep = page;
 
-	if (!PageUptodate(page) && (len != PAGE_SIZE))
-		return ovl_do_readpage(file, page);
+	if (!PageUptodate(page) && len != PAGE_SIZE) {
+		err = ovl_do_readpage(file, page);
+		if (err) {
+			pr_warn("ovl_do_readpage: %i", err);
+
+			unlock_page(page);
+			put_page(page);
 
+			return -EIO;
+		}
+	}
+
+	*pagep = page;
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-25  9:54     ` Miklos Szeredi
@ 2019-01-25 11:24       ` Amir Goldstein
  2019-01-25 12:21         ` Miklos Szeredi
  0 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-01-25 11:24 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

[-- Attachment #1: Type: text/plain, Size: 2102 bytes --]

> > ...Luckily, I have implemented ./run --ov --recycle way back, otherwise we
> > would have had very poor test coverage for persistency of overlayfs writes.
> >
> > So we currently have 6 xfstests failing on write persistency:
> > overlay/018 overlay/032 overlay/044 overlay/047 overlay/050 overlay/051
> >
> > and 2 unionmount tests (over /base xfs):
> > ./run --ov --samefs --recycle rename-new-pop-dir
> > ./run --ov --samefs --recycle rename-mass-dir
>
> Great progress.  Thanks!
>
> I don't think you need to mess with dcache flushing, since this is all
> kernel (page cache) pages without any possible dcache aliasing issues
> that plague user pages.  So a simple copy_page() should do.
>

Pardon my ignorance, but couldn't pages be mapped to user?
Anyway, it seems that generic_file_buffered_read() does
flush_dcache_page() if needed, so we need not worry about it.

> It would be even better to do it with O_DIRECT to remove cache
> duplication.  Something like the attached...
>

Yes, definitely better. Thanks for the patch!

You'll be happy to know that the two unionmount tests fails on tmpfs
as expected (or maybe you already checked that).

I do get one new lockdep splat from xfstest, something we need to
look into (attached 013.dmesg).

One questions about your patch:
Skipping the ovl_change_flags() for upper realfile, that's just a
quick hack, right?
I'll need to fix this up properly.

About the way forward, implementing writepage() should be quite straight forward
from this, so I will add write support re-work the series and re-post.

I think I can now see how to implement the LOWER -> LOWER_SHARED ->
UPPER_SHARED transitions, but that will require some more juggling of
realfile flags etc.

With the current implementation of UPPER_SHARED only we can gain:
- No copy up latency on open() with metacopy=on
- Very improved ovl_sync_fs() (no longer sync all containers on syncfs(2))

Do you think we should aim for merging just UPPER_SHARED in dev
cycle to hash out the bugs of overlay page cache or aim for merging
LOWER_SHARED+UPPER_SHARED when it is ready?

Thanks,
Amir.

[-- Attachment #2: 013.dmesg --]
[-- Type: application/octet-stream, Size: 9576 bytes --]

[  119.448268] run fstests overlay/013 at 2019-01-25 10:55:14
[  119.755174] XFS (vde): Mounting V5 Filesystem
[  119.774400] XFS (vde): Ending clean mount
[  119.925944] XFS (vdf): Unmounting Filesystem
[  119.981734] XFS (vdf): Mounting V5 Filesystem
[  120.008546] XFS (vdf): Ending clean mount

[  120.133970] ======================================================
[  120.135229] WARNING: possible circular locking dependency detected
[  120.136520] 5.0.0-rc1-xfstests-00011-g8c152042ba8e-dirty #4047 Not tainted
[  120.137917] ------------------------------------------------------
[  120.139182] test_upper/9289 is trying to acquire lock:
[  120.140250] 00000000c03c6afd (&inode->i_rwsem){++++}, at: xfs_ilock+0x18b/0x208
[  120.141745] 
               but task is already holding lock:
[  120.142929] 000000002e44e1ff (&mm->mmap_sem){++++}, at: __do_page_fault+0x1b4/0x3ff
[  120.144388] 
               which lock already depends on the new lock.

[  120.146072] 
               the existing dependency chain (in reverse order) is:
[  120.147652] 
               -> #1 (&mm->mmap_sem){++++}:
[  120.148847]        __might_fault+0x61/0x89
[  120.149727]        copy_page_to_iter+0x11f/0x2c1
[  120.150474]        generic_file_buffered_read+0x279/0x660
[  120.151544]        xfs_file_buffered_aio_read+0x117/0x135
[  120.152557]        xfs_file_read_iter+0x81/0xc8
[  120.153538]        __vfs_read+0x136/0x16e
[  120.154359]        vfs_read+0xb5/0x141
[  120.155137]        ksys_read+0x60/0xb1
[  120.155920]        do_syscall_64+0x57/0x14e
[  120.156783]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  120.157920] 
               -> #0 (&inode->i_rwsem){++++}:
[  120.159055]        lock_acquire+0x141/0x16c
[  120.159921]        down_read_nested+0x45/0x9c
[  120.160818]        xfs_ilock+0x18b/0x208
[  120.161522]        xfs_file_dio_aio_read+0x110/0x145
[  120.162489]        xfs_file_read_iter+0x7a/0xc8
[  120.163380]        do_iter_readv_writev+0x14b/0x182
[  120.164203]        do_iter_read+0x87/0xfb
[  120.164763]        ovl_do_readpage+0x9e/0xed
[  120.165341]        ovl_readpage+0xf/0x22
[  120.165915]        filemap_fault+0x311/0x425
[  120.166620]        __do_fault+0x1d/0x67
[  120.167343]        __handle_mm_fault+0x823/0xae5
[  120.168300]        __do_page_fault+0x2fb/0x3ff
[  120.169366]        async_page_fault+0x1e/0x30
[  120.170361]        __clear_user+0x36/0x5f
[  120.171305]        padzero+0x1c/0x2b
[  120.172034]        load_elf_binary+0x795/0xdfe
[  120.172870]        search_binary_handler+0x72/0xdb
[  120.173812]        __do_execve_file.isra.11+0x708/0x9b2
[  120.174837]        do_execve+0x21/0x24
[  120.175588]        __x64_sys_execve+0x26/0x2b
[  120.176451]        do_syscall_64+0x57/0x14e
[  120.177274]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  120.178364] 
               other info that might help us debug this:

[  120.179929]  Possible unsafe locking scenario:

[  120.181086]        CPU0                    CPU1
[  120.181974]        ----                    ----
[  120.182866]   lock(&mm->mmap_sem);
[  120.183549]                                lock(&inode->i_rwsem);
[  120.184735]                                lock(&mm->mmap_sem);
[  120.185746]   lock(&inode->i_rwsem);
[  120.186235] 
                *** DEADLOCK ***

[  120.187019] 1 lock held by test_upper/9289:
[  120.187728]  #0: 000000002e44e1ff (&mm->mmap_sem){++++}, at: __do_page_fault+0x1b4/0x3ff
[  120.189372] 
               stack backtrace:
[  120.190268] CPU: 0 PID: 9289 Comm: test_upper Not tainted 5.0.0-rc1-xfstests-00011-g8c152042ba8e-dirty #4047
[  120.192277] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  120.194139] Call Trace:
[  120.194669]  dump_stack+0x67/0x8e
[  120.195539]  print_circular_bug.isra.18+0x26e/0x27b
[  120.196585]  check_prev_add.constprop.26+0x12c/0x5f7
[  120.197499]  ? __lock_acquire+0xd80/0xee7
[  120.198259]  __lock_acquire+0xd80/0xee7
[  120.199062]  lock_acquire+0x141/0x16c
[  120.199723]  ? xfs_ilock+0x18b/0x208
[  120.200352]  ? xfs_file_dio_aio_read+0x110/0x145
[  120.201158]  down_read_nested+0x45/0x9c
[  120.201829]  ? xfs_ilock+0x18b/0x208
[  120.202456]  xfs_ilock+0x18b/0x208
[  120.203057]  xfs_file_dio_aio_read+0x110/0x145
[  120.203933]  xfs_file_read_iter+0x7a/0xc8
[  120.204729]  do_iter_readv_writev+0x14b/0x182
[  120.205514]  do_iter_read+0x87/0xfb
[  120.206018]  ovl_do_readpage+0x9e/0xed
[  120.206676]  ovl_readpage+0xf/0x22
[  120.207177]  ? ovl_write_begin+0xab/0xab
[  120.208028]  filemap_fault+0x311/0x425
[  120.208802]  ? ovl_fault+0xc9/0x16d
[  120.209494]  __do_fault+0x1d/0x67
[  120.210152]  __handle_mm_fault+0x823/0xae5
[  120.210962]  __do_page_fault+0x2fb/0x3ff
[  120.211747]  async_page_fault+0x1e/0x30
[  120.212507] RIP: 0010:__clear_user+0x36/0x5f
[  120.213202] Code: 71 20 37 82 53 48 89 f3 be 13 00 00 00 e8 a2 58 77 ff 0f 01 cb 48 89 d8 48 c1 eb 03 48 89 ef 83 e0 07 48 89 d9 48 85 c9 74 0f <48> c7 07 00 00 00 00 48 83 c7 08 ff c9 75 f1 48 89 c1 85 c9 74 0a
[  120.217055] RSP: 0018:ffffc90002f37d88 EFLAGS: 00050206
[  120.218119] RAX: 0000000000000000 RBX: 00000000000001f5 RCX: 00000000000001f5
[  120.219217] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000563302b8b058
[  120.220578] RBP: 0000563302b8b058 R08: 00000000df1fc5d3 R09: 0000000000000006
[  120.221953] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888079794900
[  120.223331] R13: 0000000000000000 R14: 0000563302b8b070 R15: 0000563302b8b058
[  120.224675]  ? __clear_user+0x1e/0x5f
[  120.225192]  padzero+0x1c/0x2b
[  120.225614]  load_elf_binary+0x795/0xdfe
[  120.226172]  search_binary_handler+0x72/0xdb
[  120.226828]  __do_execve_file.isra.11+0x708/0x9b2
[  120.227685]  do_execve+0x21/0x24
[  120.228452]  __x64_sys_execve+0x26/0x2b
[  120.229447]  do_syscall_64+0x57/0x14e
[  120.230161]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  120.231137] RIP: 0033:0x7fc14348a647
[  120.231846] Code: Bad RIP value.
[  120.232476] RSP: 002b:00007ffc28f4b638 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
[  120.233529] RAX: ffffffffffffffda RBX: 00000000013cf568 RCX: 00007fc14348a647
[  120.234450] RDX: 000000000162f008 RSI: 00000000013d4f68 RDI: 00000000013cf568
[  120.235382] RBP: 0000000000000000 R08: 0000000000710710 R09: 0000000000000000
[  120.236840] R10: 000000000000059a R11: 0000000000000202 R12: 0000000000000000
[  120.238271] R13: 00000000013d4f68 R14: 000000000162f008 R15: 0000000000000000
[  120.246570] ------------[ cut here ]------------
[  120.247604] downgrading a read lock
[  120.247618] WARNING: CPU: 1 PID: 9289 at /home/amir/build/src/linux/kernel/locking/lockdep.c:3553 lock_downgrade+0xe3/0x17b
[  120.250592] CPU: 1 PID: 9289 Comm: test_upper Not tainted 5.0.0-rc1-xfstests-00011-g8c152042ba8e-dirty #4047
[  120.252718] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  120.254537] RIP: 0010:lock_downgrade+0xe3/0x17b
[  120.255537] Code: 00 00 00 8b 44 24 04 89 83 60 08 00 00 48 8b 45 00 48 89 83 58 08 00 00 f6 45 32 03 74 0e 48 c7 c7 79 7a 28 82 e8 77 d8 fa ff <0f> 0b 8a 45 32 4c 89 6d 08 44 89 e6 48 89 df 83 e0 fc 83 c8 01 88
[  120.259903] RSP: 0018:ffffc90002f37e28 EFLAGS: 00010082
[  120.261047] RAX: 0000000000000000 RBX: ffff88807a3e02c0 RCX: 0000000000000000
[  120.262509] RDX: ffffffff810fb24f RSI: 0000000000000001 RDI: ffffffff810fb261
[  120.263887] RBP: ffff88807a3e0b28 R08: 0000000000000017 R09: 0000000000000004
[  120.265338] R10: 0000000000000000 R11: ffffc90002f37c47 R12: 0000000000000001
[  120.266666] R13: ffffffff811e4bb3 R14: 0000000000000246 R15: ffff88805ca38168
[  120.268122] FS:  00007f371215ebc0(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
[  120.269597] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  120.270361] CR2: 00007f37115fd1b0 CR3: 000000007c470006 CR4: 00000000003606e0
[  120.271299] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  120.272776] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  120.274323] Call Trace:
[  120.274863]  downgrade_write+0x17/0x81
[  120.275698]  __do_munmap+0x26f/0x2fe
[  120.276437]  __vm_munmap+0x74/0xbf
[  120.277138]  __x64_sys_munmap+0x27/0x2c
[  120.277931]  do_syscall_64+0x57/0x14e
[  120.278697]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  120.279738] RIP: 0033:0x7f3711f5c447
[  120.280475] Code: b4 eb 91 f7 d8 89 05 f8 bc 20 00 48 c7 c0 ff ff ff ff eb a1 f7 d8 89 05 e7 bc 20 00 e9 73 ff ff ff 66 90 b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8d 0d c9 bc 20 00 f7 d8 89 01 48 83
[  120.284277] RSP: 002b:00007ffe35bb37c8 EFLAGS: 00000202 ORIG_RAX: 000000000000000b
[  120.285603] RAX: ffffffffffffffda RBX: 0000004060eef67d RCX: 00007f3711f5c447
[  120.286542] RDX: 0000004000000000 RSI: 0000000000002739 RDI: 00007f3712160000
[  120.287592] RBP: 00007ffe35bb39e0 R08: 0000000000000000 R09: 00007f3711583b98
[  120.288936] R10: 00007f3712167030 R11: 0000000000000202 R12: 0000000000000000
[  120.289887] R13: 00007f3712168170 R14: 00007f371215ebc0 R15: 00007f3712168170
[  120.290881] irq event stamp: 1547
[  120.291374] hardirqs last  enabled at (1547): [<ffffffff81a73a36>] _raw_spin_unlock_irq+0x29/0x33
[  120.293412] hardirqs last disabled at (1546): [<ffffffff81a737ba>] _raw_spin_lock_irq+0x10/0x67
[  120.295143] softirqs last  enabled at (1182): [<ffffffff81e0035e>] __do_softirq+0x35e/0x39a
[  120.296779] softirqs last disabled at (873): [<ffffffff810a7902>] irq_exit+0x62/0xb1
[  120.298291] ---[ end trace c1aa87c93f028edf ]---
[  120.320405] XFS (vde): Unmounting Filesystem

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-25 11:24       ` Amir Goldstein
@ 2019-01-25 12:21         ` Miklos Szeredi
  2019-01-25 13:04           ` Amir Goldstein
  0 siblings, 1 reply; 42+ messages in thread
From: Miklos Szeredi @ 2019-01-25 12:21 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

On Fri, Jan 25, 2019 at 12:24 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > ...Luckily, I have implemented ./run --ov --recycle way back, otherwise we
> > > would have had very poor test coverage for persistency of overlayfs writes.
> > >
> > > So we currently have 6 xfstests failing on write persistency:
> > > overlay/018 overlay/032 overlay/044 overlay/047 overlay/050 overlay/051
> > >
> > > and 2 unionmount tests (over /base xfs):
> > > ./run --ov --samefs --recycle rename-new-pop-dir
> > > ./run --ov --samefs --recycle rename-mass-dir
> >
> > Great progress.  Thanks!
> >
> > I don't think you need to mess with dcache flushing, since this is all
> > kernel (page cache) pages without any possible dcache aliasing issues
> > that plague user pages.  So a simple copy_page() should do.
> >
>
> Pardon my ignorance, but couldn't pages be mapped to user?

Well, yes, but we don't really care about coherency with upper shared
mappings (upper is *our* backing store)

> Anyway, it seems that generic_file_buffered_read() does
> flush_dcache_page() if needed, so we need not worry about it.
>
> > It would be even better to do it with O_DIRECT to remove cache
> > duplication.  Something like the attached...
> >
>
> Yes, definitely better. Thanks for the patch!
>
> You'll be happy to know that the two unionmount tests fails on tmpfs
> as expected (or maybe you already checked that).
>
> I do get one new lockdep splat from xfstest, something we need to
> look into (attached 013.dmesg).

Yeah, I suspected we were going to hit mmap sem lock ordering issue in
readpage.  The easy way out is to submit reads through a workqueue.
Doing the actual read in an async manner is the right way anyway,
otherwise we'd be holding up mmap sem way too long.

> One questions about your patch:
> Skipping the ovl_change_flags() for upper realfile, that's just a
> quick hack, right?
> I'll need to fix this up properly.

Yes.

Speaking of hacks: I don't really like the O_RDWR | O_WRONLY thing.  I
understand why it works, but I think handling it explicitly would be
much cleaner.

> About the way forward, implementing writepage() should be quite straight forward
> from this, so I will add write support re-work the series and re-post.

The one issue is, we don't have the open file available from
writepage(s) like we do in readpage(s).   In fact leaving the
writeback till after close is an important optimization (e.g.
compilers write temporary files, that are then immediately deleted, so
no need to actually do any writeback).

That opens some interesting questions about how to do the writeback...

>
> I think I can now see how to implement the LOWER -> LOWER_SHARED ->
> UPPER_SHARED transitions, but that will require some more juggling of
> realfile flags etc.
>
> With the current implementation of UPPER_SHARED only we can gain:
> - No copy up latency on open() with metacopy=on
> - Very improved ovl_sync_fs() (no longer sync all containers on syncfs(2))
>
> Do you think we should aim for merging just UPPER_SHARED in dev
> cycle to hash out the bugs of overlay page cache or aim for merging
> LOWER_SHARED+UPPER_SHARED when it is ready?

Good plan.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-25 12:21         ` Miklos Szeredi
@ 2019-01-25 13:04           ` Amir Goldstein
  2019-01-25 13:31             ` Miklos Szeredi
  0 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-01-25 13:04 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

> > One questions about your patch:
> > Skipping the ovl_change_flags() for upper realfile, that's just a
> > quick hack, right?
> > I'll need to fix this up properly.
>
> Yes.
>
> Speaking of hacks: I don't really like the O_RDWR | O_WRONLY thing.  I
> understand why it works, but I think handling it explicitly would be
> much cleaner.
>

I figured... in fact, it is not at all interesting to propagate f_flags
to copy up helpers, so I was thinking of translating them to simple
copy_up_flags *before* the call to ovl_maybe_copy_up(), i.e.:

OVL_COPY_UP = 1,
OVL_COPY_UP_DATA = 2,
OVL_COPY_UP_WITH_DATA = 3,

> > About the way forward, implementing writepage() should be quite straight forward
> > from this, so I will add write support re-work the series and re-post.
>
> The one issue is, we don't have the open file available from
> writepage(s) like we do in readpage(s).   In fact leaving the
> writeback till after close is an important optimization (e.g.
> compilers write temporary files, that are then immediately deleted, so
> no need to actually do any writeback).
>
> That opens some interesting questions about how to do the writeback...
>

Mmm.... so now using upper page cache for write does sound so bad.
If only we had shared cache pages ;-)
We can treat upper page cache as a vessel for writeback and invalidate
it upon completion, since it is *ours*.
I'll see what I can come up with.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-25 13:04           ` Amir Goldstein
@ 2019-01-25 13:31             ` Miklos Szeredi
  2019-01-25 15:56               ` Miklos Szeredi
  0 siblings, 1 reply; 42+ messages in thread
From: Miklos Szeredi @ 2019-01-25 13:31 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

On Fri, Jan 25, 2019 at 2:04 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > One questions about your patch:
> > > Skipping the ovl_change_flags() for upper realfile, that's just a
> > > quick hack, right?
> > > I'll need to fix this up properly.
> >
> > Yes.
> >
> > Speaking of hacks: I don't really like the O_RDWR | O_WRONLY thing.  I
> > understand why it works, but I think handling it explicitly would be
> > much cleaner.
> >
>
> I figured... in fact, it is not at all interesting to propagate f_flags
> to copy up helpers, so I was thinking of translating them to simple
> copy_up_flags *before* the call to ovl_maybe_copy_up(), i.e.:
>
> OVL_COPY_UP = 1,
> OVL_COPY_UP_DATA = 2,
> OVL_COPY_UP_WITH_DATA = 3,

Much better.

>
> > > About the way forward, implementing writepage() should be quite straight forward
> > > from this, so I will add write support re-work the series and re-post.
> >
> > The one issue is, we don't have the open file available from
> > writepage(s) like we do in readpage(s).   In fact leaving the
> > writeback till after close is an important optimization (e.g.
> > compilers write temporary files, that are then immediately deleted, so
> > no need to actually do any writeback).
> >
> > That opens some interesting questions about how to do the writeback...
> >
>
> Mmm.... so now using upper page cache for write does sound so bad.
> If only we had shared cache pages ;-)
> We can treat upper page cache as a vessel for writeback and invalidate
> it upon completion, since it is *ours*.
> I'll see what I can come up with.

Just calling ->writepage() of upper fs won't work, filesystems do
preparatory work in ->prepare_write(), etc, and those need the struct
file...  I don't think we are better off, than calling
vfs_write_iter().

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-25 13:31             ` Miklos Szeredi
@ 2019-01-25 15:56               ` Miklos Szeredi
  2019-01-25 21:18                 ` Amir Goldstein
  0 siblings, 1 reply; 42+ messages in thread
From: Miklos Szeredi @ 2019-01-25 15:56 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

On Fri, Jan 25, 2019 at 2:31 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Fri, Jan 25, 2019 at 2:04 PM Amir Goldstein <amir73il@gmail.com> wrote:

> > Mmm.... so now using upper page cache for write does sound so bad.
> > If only we had shared cache pages ;-)
> > We can treat upper page cache as a vessel for writeback and invalidate
> > it upon completion, since it is *ours*.
> > I'll see what I can come up with.
>
> Just calling ->writepage() of upper fs won't work, filesystems do
> preparatory work in ->prepare_write(), etc, and those need the struct
> file...  I don't think we are better off, than calling
> vfs_write_iter().

With the cache on the overlay, I think we need to anchor the real file
in the overlay inode instead of the overlay file.  That applies to
reads as well as writes.  Not sure about lifetimes, obviously it's not
good to keep the real file open if there's no I/O being done.  Some
heuristics will be needed, but first version could just keep the file
open until the inode is evicted.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-25 15:56               ` Miklos Szeredi
@ 2019-01-25 21:18                 ` Amir Goldstein
  2019-01-27 18:22                   ` Amir Goldstein
  0 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-01-25 21:18 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

On Fri, Jan 25, 2019 at 5:57 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Fri, Jan 25, 2019 at 2:31 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Fri, Jan 25, 2019 at 2:04 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > Mmm.... so now using upper page cache for write does sound so bad.
> > > If only we had shared cache pages ;-)
> > > We can treat upper page cache as a vessel for writeback and invalidate
> > > it upon completion, since it is *ours*.
> > > I'll see what I can come up with.
> >
> > Just calling ->writepage() of upper fs won't work, filesystems do
> > preparatory work in ->prepare_write(), etc, and those need the struct
> > file...  I don't think we are better off, than calling
> > vfs_write_iter().
>
> With the cache on the overlay, I think we need to anchor the real file
> in the overlay inode instead of the overlay file.  That applies to
> reads as well as writes.  Not sure about lifetimes, obviously it's not
> good to keep the real file open if there's no I/O being done.  Some
> heuristics will be needed, but first version could just keep the file
> open until the inode is evicted.
>

I may be way off, but could you be over complicating this?
We can call vfs_write_iter() from ovl_write_end() after data has been
copied to overlay pages and while we still have a reference to realfile.
Then we copy over the dirt from overlay page cache to upper page cache.
When writeback comes along to overlay inode (maybe after close) we
only need to writeback the already dirty pages of upper.
Once not dirty, we can invalidate upper pages to reduce page cache
usage.

And what about readpage? why do you say that this applies to reads
as well? Do we not have the file during readpage()? Besides, AFAIKT
readahead() necessitates that ->readpage() can be called on an inode
so we don't need more than the upper inode to for ovl_readpage().
The direct IO read is just an optimization, isn't it?

I'll stick with the simple copy_page() implementation (back and forth)
unless you come up with a better suggestion. We will probably want
something more efficient later on, but I rather start with something
simple and working.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-25 21:18                 ` Amir Goldstein
@ 2019-01-27 18:22                   ` Amir Goldstein
  2019-01-28 19:22                     ` Vivek Goyal
  0 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-01-27 18:22 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

On Fri, Jan 25, 2019 at 11:18 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Fri, Jan 25, 2019 at 5:57 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Fri, Jan 25, 2019 at 2:31 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Fri, Jan 25, 2019 at 2:04 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > > > Mmm.... so now using upper page cache for write does sound so bad.
> > > > If only we had shared cache pages ;-)
> > > > We can treat upper page cache as a vessel for writeback and invalidate
> > > > it upon completion, since it is *ours*.
> > > > I'll see what I can come up with.
> > >
> > > Just calling ->writepage() of upper fs won't work, filesystems do
> > > preparatory work in ->prepare_write(), etc, and those need the struct
> > > file...  I don't think we are better off, than calling
> > > vfs_write_iter().
> >
> > With the cache on the overlay, I think we need to anchor the real file
> > in the overlay inode instead of the overlay file.  That applies to
> > reads as well as writes.  Not sure about lifetimes, obviously it's not
> > good to keep the real file open if there's no I/O being done.  Some
> > heuristics will be needed, but first version could just keep the file
> > open until the inode is evicted.
> >
>
> I may be way off, but could you be over complicating this?
> We can call vfs_write_iter() from ovl_write_end() after data has been
> copied to overlay pages and while we still have a reference to realfile.
> Then we copy over the dirt from overlay page cache to upper page cache.
> When writeback comes along to overlay inode (maybe after close) we
> only need to writeback the already dirty pages of upper.
> Once not dirty, we can invalidate upper pages to reduce page cache
> usage.
>
> And what about readpage? why do you say that this applies to reads
> as well? Do we not have the file during readpage()? Besides, AFAIKT
> readahead() necessitates that ->readpage() can be called on an inode
> so we don't need more than the upper inode to for ovl_readpage().
> The direct IO read is just an optimization, isn't it?
>
> I'll stick with the simple copy_page() implementation (back and forth)
> unless you come up with a better suggestion. We will probably want
> something more efficient later on, but I rather start with something
> simple and working.
>

Like this:
https://github.com/amir73il/linux/commits/ovl-aops-wip

For the moment, I used part of your patch just for tmpfs readpage(),
so I dropped the O_DIRECT part.
I used vfs_write_iter() to write data to upper page cache on ovl_write_end().
It works in the sense that unionmount --recycle tests pass and so do all
xfstests overlay/quick tests, besides ro/rw mmap read/write (overlay/061).

Alas, writepage() is not being called, so my implementation is missing
something. I am hoping that you can tell me what it is, or what is wrong
with the design, correctness-wise.
Naturally, we can do things better with writepages(), reducing double page
cache usage, etc.

In fact, I enhanced test overlay/061 to cover the fact that mmap write is
NOT being persisted with current patches.

No need to worry about lack of test coverage for overlay writeback though,
there are many generic/quick tests failing with these patches.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-27 18:22                   ` Amir Goldstein
@ 2019-01-28 19:22                     ` Vivek Goyal
  2019-01-28 20:57                       ` Amir Goldstein
  0 siblings, 1 reply; 42+ messages in thread
From: Vivek Goyal @ 2019-01-28 19:22 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, cgxu519, overlayfs

On Sun, Jan 27, 2019 at 08:22:50PM +0200, Amir Goldstein wrote:
> On Fri, Jan 25, 2019 at 11:18 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Fri, Jan 25, 2019 at 5:57 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Fri, Jan 25, 2019 at 2:31 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > On Fri, Jan 25, 2019 at 2:04 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > > > Mmm.... so now using upper page cache for write does sound so bad.
> > > > > If only we had shared cache pages ;-)
> > > > > We can treat upper page cache as a vessel for writeback and invalidate
> > > > > it upon completion, since it is *ours*.
> > > > > I'll see what I can come up with.
> > > >
> > > > Just calling ->writepage() of upper fs won't work, filesystems do
> > > > preparatory work in ->prepare_write(), etc, and those need the struct
> > > > file...  I don't think we are better off, than calling
> > > > vfs_write_iter().
> > >
> > > With the cache on the overlay, I think we need to anchor the real file
> > > in the overlay inode instead of the overlay file.  That applies to
> > > reads as well as writes.  Not sure about lifetimes, obviously it's not
> > > good to keep the real file open if there's no I/O being done.  Some
> > > heuristics will be needed, but first version could just keep the file
> > > open until the inode is evicted.
> > >
> >
> > I may be way off, but could you be over complicating this?
> > We can call vfs_write_iter() from ovl_write_end() after data has been
> > copied to overlay pages and while we still have a reference to realfile.
> > Then we copy over the dirt from overlay page cache to upper page cache.
> > When writeback comes along to overlay inode (maybe after close) we
> > only need to writeback the already dirty pages of upper.
> > Once not dirty, we can invalidate upper pages to reduce page cache
> > usage.
> >
> > And what about readpage? why do you say that this applies to reads
> > as well? Do we not have the file during readpage()? Besides, AFAIKT
> > readahead() necessitates that ->readpage() can be called on an inode
> > so we don't need more than the upper inode to for ovl_readpage().
> > The direct IO read is just an optimization, isn't it?
> >
> > I'll stick with the simple copy_page() implementation (back and forth)
> > unless you come up with a better suggestion. We will probably want
> > something more efficient later on, but I rather start with something
> > simple and working.
> >
> 
> Like this:
> https://github.com/amir73il/linux/commits/ovl-aops-wip
> 

Am I understanding it right, that we will use page cache at overlay inode
as well as upper file system. So there is duplication and memory usage
for upper files will double? If yes, this will be reason to provide
separate option for lazy copy up (Or switch to using O_DIRECT).

Vivek

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-28 19:22                     ` Vivek Goyal
@ 2019-01-28 20:57                       ` Amir Goldstein
  2019-01-28 21:17                         ` Miklos Szeredi
  0 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-01-28 20:57 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Miklos Szeredi, cgxu519, overlayfs

> > > I'll stick with the simple copy_page() implementation (back and forth)
> > > unless you come up with a better suggestion. We will probably want
> > > something more efficient later on, but I rather start with something
> > > simple and working.
> > >
> >
> > Like this:
> > https://github.com/amir73il/linux/commits/ovl-aops-wip
> >
>
> Am I understanding it right, that we will use page cache at overlay inode
> as well as upper file system. So there is duplication and memory usage
> for upper files will double?

That won't be acceptable, but using upper page cache as temporary pages
for the duration of writeback makes overlayfs code much simpler IMO and
avoids vma -> vfs locking order issues.

My idea was to get rid of the upper pages once writeback is done
(or periodically), but that wasn't implemented yet, because I am still
waiting to hear if the idea is inherently bad.

Think of upper inode page cache as a storage device cache.
Filesystem writes to storage device cache (not to media) and cache
is flushed to media on explicit from filesystem or periodically.

Once overlay pages have been copied to upper page cache, they
are "clean" and can be reclaimed at any time, even before upper
inode writeback has completed.

> If yes, this will be reason to provide
> separate option for lazy copy up (Or switch to using O_DIRECT).
>

Although I started this series with lazy copy up, it is lazy copy up that
depends on overlay aops and not the other way around.
The plan (as Miklos depicted it) is to use overlay page cache for copied
up file (UPPER_SHARED) as well as for lower files that were open
for write or mmap shared (LOWER_SHARED).

Lazy copy up is just a very simple stepping stone on the way there,
because it demonstrates that we can defer the decision of switching
from LOWER state (no page cache) to LOWER_SHARED until mmap
time and the decision to switch to UPPER_SHARED state (copy up data)
until the first page fault (later it will be on writable page fault).

In the end, "lazy copy up" will mean no copy up at all on open and it will
not depend on metacopy. There is also no expected reason why users
would need to opt-out of overlay page cache that solves shared mmap
inconsistency and improved efficiency of syncfs.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-28 20:57                       ` Amir Goldstein
@ 2019-01-28 21:17                         ` Miklos Szeredi
  2019-01-28 21:22                           ` Miklos Szeredi
  0 siblings, 1 reply; 42+ messages in thread
From: Miklos Szeredi @ 2019-01-28 21:17 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

On Mon, Jan 28, 2019 at 9:58 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > > I'll stick with the simple copy_page() implementation (back and forth)
> > > > unless you come up with a better suggestion. We will probably want
> > > > something more efficient later on, but I rather start with something
> > > > simple and working.
> > > >
> > >
> > > Like this:
> > > https://github.com/amir73il/linux/commits/ovl-aops-wip
> > >
> >
> > Am I understanding it right, that we will use page cache at overlay inode
> > as well as upper file system. So there is duplication and memory usage
> > for upper files will double?
>
> That won't be acceptable, but using upper page cache as temporary pages
> for the duration of writeback makes overlayfs code much simpler IMO and
> avoids vma -> vfs locking order issues.
>
> My idea was to get rid of the upper pages once writeback is done
> (or periodically), but that wasn't implemented yet, because I am still
> waiting to hear if the idea is inherently bad.

The problem with that is unnecessary memory copies.   If done right,
direct I/O will be simpler, faster, and won't have locking issues,
beacuse it can be done asynchronously.

Trying to put some code together...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-28 21:17                         ` Miklos Szeredi
@ 2019-01-28 21:22                           ` Miklos Szeredi
  2019-01-28 22:14                             ` Amir Goldstein
  0 siblings, 1 reply; 42+ messages in thread
From: Miklos Szeredi @ 2019-01-28 21:22 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

[-- Attachment #1: Type: text/plain, Size: 363 bytes --]

On Mon, Jan 28, 2019 at 10:17 PM Miklos Szeredi <miklos@szeredi.hu> wrote:

> Trying to put some code together...

Very rudimentary patch attached.  It doens't do direct IO yet, but
demonstrates what I meant about achoring the upper file in the inode.

Also found the trick to actually make writeback work:
super_setup_bdi() call in fill_super...

Thanks,
Miklos

[-- Attachment #2: d.patch --]
[-- Type: text/x-patch, Size: 13827 bytes --]

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index c922682ff1b2..7a8a6cd55e1f 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -37,14 +37,6 @@ static struct file *ovl_open_realfile(const struct file *file,
 	const struct cred *old_cred;
 	int flags = file->f_flags | O_NOATIME;
 
-	if (realinode == ovl_inode_upper(inode)) {
-		/* tmpfs has no readpage a_op, so need to read realfile */
-		if ((flags & O_WRONLY) &&
-		    (!realinode->i_mapping ||
-		     !realinode->i_mapping->a_ops->readpage))
-			flags = (flags & ~O_ACCMODE) | O_RDWR;
-	}
-
 	old_cred = ovl_override_creds(inode->i_sb);
 	realfile = open_with_fake_path(&file->f_path, flags,
 				       realinode, current_cred());
@@ -57,6 +49,43 @@ static struct file *ovl_open_realfile(const struct file *file,
 	return realfile;
 }
 
+static struct file *ovl_open_upper(const struct file *file,
+				   struct inode *realinode)
+{
+	struct inode *inode = file_inode(file);
+	struct ovl_inode *oi = OVL_I(inode);
+	struct file *realfile;
+	const struct cred *old_cred;
+	struct path realpath;
+
+	realfile = READ_ONCE(oi->upper_file);
+	if (realfile)
+		return realfile;
+
+	ovl_path_upper(file_dentry(file), &realpath);
+
+	old_cred = ovl_override_creds(inode->i_sb);
+	realfile = open_with_fake_path(&realpath, O_RDWR | O_NOATIME,
+				       realinode, current_cred());
+	revert_creds(old_cred);
+
+	pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
+		 file, file, ovl_whatisit(inode, realinode), file->f_flags,
+		 realfile, IS_ERR(realfile) ? 0 : realfile->f_flags);
+
+	if (!IS_ERR(realfile)) {
+		struct file *old = cmpxchg(&oi->upper_file, NULL, realfile);
+
+		if (old) {
+			fput(realfile);
+			realfile = old;
+		}
+	}
+
+	return realfile;
+}
+
+#if 0
 #define OVL_SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT)
 
 static int ovl_change_flags(struct file *file, unsigned int flags)
@@ -94,13 +123,14 @@ static int ovl_change_flags(struct file *file, unsigned int flags)
 
 	return 0;
 }
+#endif
 
 static bool ovl_filemap_support(const struct file *file)
 {
-	struct ovl_fs *ofs = file_inode(file)->i_sb->s_fs_info;
+	//struct ovl_fs *ofs = file_inode(file)->i_sb->s_fs_info;
 
 	/* TODO: implement aops to upper inode data */
-	return ofs->upper_mnt && ovl_aops.writepage;
+	return true;
 }
 
 static int ovl_file_maybe_copy_up(const struct file *file, bool allow_meta)
@@ -119,15 +149,17 @@ static int ovl_file_maybe_copy_up(const struct file *file, bool allow_meta)
 	return ovl_maybe_copy_up(file_dentry(file), copy_up_flags);
 }
 
-static int ovl_real_fdget_meta(const struct file *file, struct fd *real,
-			       bool allow_meta)
+static int ovl_real_fdget(const struct file *file, struct fd *real)
 {
 	struct inode *inode = file_inode(file);
+	struct ovl_inode *oi = OVL_I(inode);
 	struct inode *realinode;
 	int err;
 
 	real->flags = 0;
-	real->file = file->private_data;
+	real->file = READ_ONCE(oi->upper_file);
+	if (!real->file)
+		real->file = file->private_data;
 
 	/*
 	 * Lazy copy up caches the meta copy upper file on open O_RDWR.
@@ -136,35 +168,28 @@ static int ovl_real_fdget_meta(const struct file *file, struct fd *real,
 	 * we may try to open a lower file O_RDWR or perform data operations
 	 * (e.g. fallocate) on the metacopy inode.
 	 */
-	err = ovl_file_maybe_copy_up(file, allow_meta);
+	err = ovl_file_maybe_copy_up(file, false);
 	if (err)
 		return err;
 
-	if (allow_meta)
-		realinode = ovl_inode_real(inode);
-	else
-		realinode = ovl_inode_realdata(inode);
+	realinode = ovl_inode_realdata(inode);
 
 	/* Has it been copied up since we'd opened it? */
 	if (unlikely(file_inode(real->file) != realinode)) {
-		real->flags = FDPUT_FPUT;
-		real->file = ovl_open_realfile(file, realinode);
+		real->file = ovl_open_upper(file, realinode);
 
 		return PTR_ERR_OR_ZERO(real->file);
 	}
 
+#if 0
 	/* Did the flags change since open? */
 	if (unlikely((file->f_flags ^ real->file->f_flags) & ~O_NOATIME))
 		return ovl_change_flags(real->file, file->f_flags);
+#endif
 
 	return 0;
 }
 
-static int ovl_real_fdget(const struct file *file, struct fd *real)
-{
-	return ovl_real_fdget_meta(file, real, false);
-}
-
 static bool ovl_should_use_filemap_meta(struct file *file, bool allow_meta)
 {
 	struct inode *inode = file_inode(file);
@@ -220,6 +245,7 @@ static int ovl_flush_filemap(struct file *file, loff_t offset, loff_t len)
 static int ovl_open(struct inode *inode, struct file *file)
 {
 	struct file *realfile;
+	struct inode *realinode;
 	int err;
 	bool allow_meta = (file->f_mode & FMODE_WRITE) &&
 			ovl_filemap_support(file);
@@ -228,22 +254,28 @@ static int ovl_open(struct inode *inode, struct file *file)
 	if (err)
 		return err;
 
-	/* No longer need these flags, so don't pass them on to underlying fs */
-	file->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
+	realinode = allow_meta ?
+		ovl_inode_real(inode) : ovl_inode_realdata(inode);
 
-	realfile = ovl_open_realfile(file, allow_meta ? ovl_inode_real(inode) :
-				     ovl_inode_realdata(inode));
-	if (IS_ERR(realfile))
-		return PTR_ERR(realfile);
+	if (realinode == ovl_inode_upper(inode)) {
+		realfile = ovl_open_upper(file, realinode);
+		if (IS_ERR(realfile))
+			return PTR_ERR(realfile);
+	} else {
+		realfile = ovl_open_realfile(file, realinode);
+		if (IS_ERR(realfile))
+			return PTR_ERR(realfile);
 
-	file->private_data = realfile;
+		file->private_data = realfile;
+	}
 
 	return 0;
 }
 
 static int ovl_release(struct inode *inode, struct file *file)
 {
-	fput(file->private_data);
+	if (file->private_data)
+		fput(file->private_data);
 
 	return 0;
 }
@@ -284,6 +316,8 @@ static rwf_t ovl_iocb_to_rwf(struct kiocb *iocb)
 	int ifl = iocb->ki_flags;
 	rwf_t flags = 0;
 
+	WARN_ON(ifl & IOCB_DIRECT);
+
 	if (ifl & IOCB_NOWAIT)
 		flags |= RWF_NOWAIT;
 	if (ifl & IOCB_HIPRI)
@@ -292,6 +326,8 @@ static rwf_t ovl_iocb_to_rwf(struct kiocb *iocb)
 		flags |= RWF_DSYNC;
 	if (ifl & IOCB_SYNC)
 		flags |= RWF_SYNC;
+	if (ifl & IOCB_APPEND)
+		flags |= RWF_APPEND;
 
 	return flags;
 }
@@ -386,6 +422,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 static int ovl_real_fsync(struct file *file, loff_t start, loff_t end,
 			  int datasync)
 {
+#if 0
 	struct fd real;
 	const struct cred *old_cred;
 	int ret;
@@ -404,6 +441,9 @@ static int ovl_real_fsync(struct file *file, loff_t start, loff_t end,
 	fdput(real);
 
 	return ret;
+#endif
+
+	return 0;
 }
 
 static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
@@ -770,39 +810,6 @@ const struct file_operations ovl_file_operations = {
 	.remap_file_range	= ovl_remap_file_range,
 };
 
-static struct page *ovl_real_get_page(struct file *realfile, pgoff_t index)
-{
-	struct page *page;
-
-	page = read_mapping_page(file_inode(realfile)->i_mapping, index, NULL);
-	if (IS_ERR(page))
-		return page;
-
-	if (!PageUptodate(page)) {
-		put_page(page);
-		return ERR_PTR(-EIO);
-	}
-
-	lock_page(page);
-
-	return page;
-}
-
-static int ovl_real_copy_page(struct file *realfile, struct page *page)
-{
-	struct page *realpage;
-
-	realpage = ovl_real_get_page(realfile, page->index);
-	if (IS_ERR(realpage))
-		return PTR_ERR(realpage);
-
-	copy_highpage(page, realpage);
-	unlock_page(realpage);
-	put_page(realpage);
-
-	return 0;
-}
-
 static int ovl_real_readpage(struct file *realfile, struct page *page)
 {
 	struct bio_vec bvec = {
@@ -823,16 +830,17 @@ static int ovl_real_readpage(struct file *realfile, struct page *page)
 
 static int ovl_do_readpage(struct file *file, struct page *page)
 {
-	struct file *realfile = file->private_data;
+	struct inode *inode = file_inode(file);
+	struct ovl_inode *oi = OVL_I(inode);
+	struct file *realfile = READ_ONCE(oi->upper_file);
 	const struct cred *old_cred;
 	int ret;
 
-	/* tmpfs has no readpage a_op, so need to read with f_op */
+	if (!realfile)
+		realfile = file->private_data;
+
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
-	if (!realfile->f_mapping || !realfile->f_mapping->a_ops->readpage)
-		ret = ovl_real_readpage(realfile, page);
-	else
-		ret = ovl_real_copy_page(realfile, page);
+	ret = ovl_real_readpage(realfile, page);
 	revert_creds(old_cred);
 
 	if (!ret)
@@ -881,76 +889,57 @@ static int ovl_write_begin(struct file *file, struct address_space *mapping,
 	return 0;
 }
 
-static int ovl_real_write_end(struct file *file, loff_t pos,
-			      unsigned int copied, struct page *page)
-{
-	struct file *realfile = file->private_data;
-	unsigned int offset = (pos & (PAGE_SIZE - 1));
-	struct bio_vec bvec = {
-		.bv_page = page,
-		.bv_len = copied,
-		.bv_offset = offset,
-	};
-	struct iov_iter iter;
-	const struct cred *old_cred;
-	ssize_t ret;
-
-	iov_iter_bvec(&iter, WRITE, &bvec, 1, copied);
-
-	old_cred = ovl_override_creds(file_inode(file)->i_sb);
-	ret = vfs_iter_write(realfile, &iter, &pos, 0);
-	revert_creds(old_cred);
-
-	return ret < 0 ? ret : 0;
-}
-
-extern int __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
-			       struct page *page);
-
 static int ovl_write_end(struct file *file, struct address_space *mapping,
 			 loff_t pos, unsigned len, unsigned copied,
 			 struct page *page, void *fsdata)
 {
-	int err;
+	int res;
 
-	err = ovl_real_write_end(file, pos, copied, page);
-	if (err) {
-		pr_warn("ovl_write_end: %i", err);
-		unlock_page(page);
-		put_page(page);
+	res = simple_write_end(file, mapping, pos, len, copied, page, fsdata);
 
-		return -EIO;
-	}
+	pr_info("ovl_write_end: page->flags: %lx\n", page->flags);
 
-	return __generic_write_end(file_inode(file), pos, copied, page);
+	return res;
 }
 
 static int ovl_real_writepage(struct page *page, struct writeback_control *wbc)
 {
-	struct inode *realinode = ovl_inode_real(page->mapping->host);
-	struct page *realpage;
-	int ret;
+	struct ovl_inode *oi = OVL_I(page->mapping->host);
+	struct file *realfile = oi->upper_file;
+	loff_t pos = page->index << PAGE_SHIFT;
+	loff_t len = i_size_read(page->mapping->host) - pos;
+	struct bio_vec bvec = {
+		.bv_page = page,
+		.bv_len = PAGE_SIZE,
+		.bv_offset = 0,
+	};
+	struct iov_iter iter;
+	ssize_t ret;
 
-	if (!realinode->i_mapping || !realinode->i_mapping->a_ops->writepage)
-		return -EIO;
+	WARN_ON(len <= 0);
+	if (len < PAGE_SIZE)
+		bvec.bv_len = len;
 
-	realpage = grab_cache_page(realinode->i_mapping, page->index);
-	copy_highpage(realpage, page);
-	set_page_dirty(realpage);
+	pr_info("ovl_real_writepage(%lli)\n", pos);
+	iov_iter_bvec(&iter, WRITE, &bvec, 1, bvec.bv_len);
 
-	/* Start writeback on and unlock real page */
-	ret = realinode->i_mapping->a_ops->writepage(realpage, wbc);
-	put_page(realpage);
+	ret = vfs_iter_write(realfile, &iter, &pos, 0);
+	pr_info("ovl_real_writepage(%lli) = %li\n", pos, ret);
 
-	return ret;
+	return ret < 0 ? ret : 0;
 }
 
 static int ovl_writepage(struct page *page, struct writeback_control *wbc)
 {
 	int ret;
+	const struct cred *old_cred;
 
 	set_page_writeback(page);
+
+	old_cred = ovl_override_creds(page->mapping->host->i_sb);
 	ret = ovl_real_writepage(page, wbc);
+	revert_creds(old_cred);
+
 	unlock_page(page);
 
 	/*
@@ -969,5 +958,5 @@ const struct address_space_operations ovl_aops = {
 	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.writepage	= ovl_writepage,
 	/* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
-	.direct_IO	= noop_direct_IO,
+//	.direct_IO	= noop_direct_IO,
 };
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 766505598ed1..d1f04f5526f8 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -148,7 +148,8 @@ int ovl_getattr(const struct path *path, struct kstat *stat,
 	enum ovl_path_type type;
 	struct path realpath;
 	const struct cred *old_cred;
-	bool is_dir = S_ISDIR(dentry->d_inode->i_mode);
+	struct inode *inode = d_inode(dentry);
+	bool is_dir = S_ISDIR(inode->i_mode);
 	bool samefs = ovl_same_sb(dentry->d_sb);
 	struct ovl_layer *lower_layer = NULL;
 	int err;
@@ -162,6 +163,13 @@ int ovl_getattr(const struct path *path, struct kstat *stat,
 	if (err)
 		goto out;
 
+	/*
+	 * With overlay page cache, overlay inode i_size is more uptodate than
+	 * real inode i_size. Perhaps we should generic_fillattr() and only
+	 * update individual stats from real inode?
+	 */
+	stat->size = i_size_read(inode);
+
 	/*
 	 * For non-dir or same fs, we use st_ino of the copy up origin.
 	 * This guaranties constant st_dev/st_ino across copy up.
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index ec237035333a..ab10c75c4b98 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -99,6 +99,7 @@ struct ovl_inode {
 	struct inode vfs_inode;
 	struct dentry *__upperdentry;
 	struct inode *lower;
+	struct file *upper_file;
 
 	/* synchronize copy up and more */
 	struct mutex lock;
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 0116735cc321..0973a7b8abfc 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -9,6 +9,7 @@
 
 #include <uapi/linux/magic.h>
 #include <linux/fs.h>
+#include <linux/file.h>
 #include <linux/namei.h>
 #include <linux/xattr.h>
 #include <linux/mount.h>
@@ -185,6 +186,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb)
 	oi->__upperdentry = NULL;
 	oi->lower = NULL;
 	oi->lowerdata = NULL;
+	oi->upper_file = NULL;
 	mutex_init(&oi->lock);
 
 	return &oi->vfs_inode;
@@ -201,6 +203,8 @@ static void ovl_destroy_inode(struct inode *inode)
 {
 	struct ovl_inode *oi = OVL_I(inode);
 
+	if (oi->upper_file)
+		fput(oi->upper_file);
 	dput(oi->__upperdentry);
 	iput(oi->lower);
 	if (S_ISDIR(inode->i_mode))
@@ -1427,6 +1431,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	struct cred *cred;
 	int err;
 
+	err = super_setup_bdi(sb);
+	if (err)
+		goto out;
+
 	err = -ENOMEM;
 	ofs = kzalloc(sizeof(struct ovl_fs), GFP_KERNEL);
 	if (!ofs)

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-28 21:22                           ` Miklos Szeredi
@ 2019-01-28 22:14                             ` Amir Goldstein
  2019-01-29  7:17                               ` Miklos Szeredi
  0 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-01-28 22:14 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

On Mon, Jan 28, 2019 at 11:22 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, Jan 28, 2019 at 10:17 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > Trying to put some code together...
>
> Very rudimentary patch attached.  It doens't do direct IO yet, but
> demonstrates what I meant about achoring the upper file in the inode.
>

Code looks good, but blows up on xfstests, so I'll wait for a baked patch
before testing.

> Also found the trick to actually make writeback work:
> super_setup_bdi() call in fill_super...
>

Cool, but when adding only setup_bdi() this to my code, I still don't see any
writeback. No writeback observed with your patch as well.

Speaking of which, did you read the warning sign above
simple_write_end()? Aren't you missing the mark_inode_dirty() call?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-28 22:14                             ` Amir Goldstein
@ 2019-01-29  7:17                               ` Miklos Szeredi
  2019-01-29  8:54                                 ` Amir Goldstein
  0 siblings, 1 reply; 42+ messages in thread
From: Miklos Szeredi @ 2019-01-29  7:17 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

On Mon, Jan 28, 2019 at 11:14 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Jan 28, 2019 at 11:22 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Mon, Jan 28, 2019 at 10:17 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > > Trying to put some code together...
> >
> > Very rudimentary patch attached.  It doens't do direct IO yet, but
> > demonstrates what I meant about achoring the upper file in the inode.
> >
>
> Code looks good, but blows up on xfstests, so I'll wait for a baked patch
> before testing.
>
> > Also found the trick to actually make writeback work:
> > super_setup_bdi() call in fill_super...
> >
>
> Cool, but when adding only setup_bdi() this to my code, I still don't see any
> writeback. No writeback observed with your patch as well.

I did observe writeback, but haven't done much testing.  Will
hopefully get this into a better shape today.

> Speaking of which, did you read the warning sign above
> simple_write_end()? Aren't you missing the mark_inode_dirty() call?

That's in __set_page_dirty_nobuffers().

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-29  7:17                               ` Miklos Szeredi
@ 2019-01-29  8:54                                 ` Amir Goldstein
  2019-01-29  8:58                                   ` Miklos Szeredi
  0 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-01-29  8:54 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

On Tue, Jan 29, 2019 at 9:17 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, Jan 28, 2019 at 11:14 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, Jan 28, 2019 at 11:22 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Mon, Jan 28, 2019 at 10:17 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > > Trying to put some code together...
> > >
> > > Very rudimentary patch attached.  It doens't do direct IO yet, but
> > > demonstrates what I meant about achoring the upper file in the inode.
> > >
> >
> > Code looks good, but blows up on xfstests, so I'll wait for a baked patch
> > before testing.
> >
> > > Also found the trick to actually make writeback work:
> > > super_setup_bdi() call in fill_super...
> > >
> >
> > Cool, but when adding only setup_bdi() this to my code, I still don't see any
> > writeback. No writeback observed with your patch as well.
>
> I did observe writeback, but haven't done much testing.  Will
> hopefully get this into a better shape today.
>

OK, I see the writebacks. Didn't see them with test:

# ./run --ov --samefs --recycle rename-new-pop-dir
...
./run --open-file /mnt/a/dir100-new/a -w -c -W aaaa
[   59.092942] ovl_write_end: page->flags: 10000000000000c
...
/mnt/a/dir100-new/a: File size wrong (got 0, want 4)

Missing sync_filesystem(sb) in ovl_remount()?
No flush yet. What else?

There is also a NULL pointer deference in ovl_fault()
of file opened as upper with vfs_fadvise(NULL, ...
Need to remove vfs_fadvise.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-29  8:54                                 ` Amir Goldstein
@ 2019-01-29  8:58                                   ` Miklos Szeredi
  2019-01-29  9:12                                     ` Amir Goldstein
  0 siblings, 1 reply; 42+ messages in thread
From: Miklos Szeredi @ 2019-01-29  8:58 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

On Tue, Jan 29, 2019 at 9:54 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Jan 29, 2019 at 9:17 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Mon, Jan 28, 2019 at 11:14 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Mon, Jan 28, 2019 at 11:22 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > On Mon, Jan 28, 2019 at 10:17 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > > Trying to put some code together...
> > > >
> > > > Very rudimentary patch attached.  It doens't do direct IO yet, but
> > > > demonstrates what I meant about achoring the upper file in the inode.
> > > >
> > >
> > > Code looks good, but blows up on xfstests, so I'll wait for a baked patch
> > > before testing.
> > >
> > > > Also found the trick to actually make writeback work:
> > > > super_setup_bdi() call in fill_super...
> > > >
> > >
> > > Cool, but when adding only setup_bdi() this to my code, I still don't see any
> > > writeback. No writeback observed with your patch as well.
> >
> > I did observe writeback, but haven't done much testing.  Will
> > hopefully get this into a better shape today.
> >
>
> OK, I see the writebacks. Didn't see them with test:
>
> # ./run --ov --samefs --recycle rename-new-pop-dir

Which branch is this?  Master doesn't have --recycle.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-29  8:58                                   ` Miklos Szeredi
@ 2019-01-29  9:12                                     ` Amir Goldstein
  2019-01-29 12:44                                       ` Miklos Szeredi
  0 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-01-29  9:12 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

On Tue, Jan 29, 2019 at 10:58 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, Jan 29, 2019 at 9:54 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Tue, Jan 29, 2019 at 9:17 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Mon, Jan 28, 2019 at 11:14 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > On Mon, Jan 28, 2019 at 11:22 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > >
> > > > > On Mon, Jan 28, 2019 at 10:17 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > >
> > > > > > Trying to put some code together...
> > > > >
> > > > > Very rudimentary patch attached.  It doens't do direct IO yet, but
> > > > > demonstrates what I meant about achoring the upper file in the inode.
> > > > >
> > > >
> > > > Code looks good, but blows up on xfstests, so I'll wait for a baked patch
> > > > before testing.
> > > >
> > > > > Also found the trick to actually make writeback work:
> > > > > super_setup_bdi() call in fill_super...
> > > > >
> > > >
> > > > Cool, but when adding only setup_bdi() this to my code, I still don't see any
> > > > writeback. No writeback observed with your patch as well.
> > >
> > > I did observe writeback, but haven't done much testing.  Will
> > > hopefully get this into a better shape today.
> > >
> >
> > OK, I see the writebacks. Didn't see them with test:
> >
> > # ./run --ov --samefs --recycle rename-new-pop-dir
>
> Which branch is this?  Master doesn't have --recycle.
>

Sorry, this flag is from branch ovl_snapshot, but exact same
functionality on master with syntax:

./run --ov=0 --samefs rename-new-pop-dir

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-29  9:12                                     ` Amir Goldstein
@ 2019-01-29 12:44                                       ` Miklos Szeredi
  2019-01-29 16:47                                         ` Amir Goldstein
  0 siblings, 1 reply; 42+ messages in thread
From: Miklos Szeredi @ 2019-01-29 12:44 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

[-- Attachment #1: Type: text/plain, Size: 1891 bytes --]

On Tue, Jan 29, 2019 at 10:13 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Jan 29, 2019 at 10:58 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Tue, Jan 29, 2019 at 9:54 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Tue, Jan 29, 2019 at 9:17 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >
> > > > On Mon, Jan 28, 2019 at 11:14 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > >
> > > > > On Mon, Jan 28, 2019 at 11:22 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > > >
> > > > > > On Mon, Jan 28, 2019 at 10:17 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > > >
> > > > > > > Trying to put some code together...
> > > > > >
> > > > > > Very rudimentary patch attached.  It doens't do direct IO yet, but
> > > > > > demonstrates what I meant about achoring the upper file in the inode.
> > > > > >
> > > > >
> > > > > Code looks good, but blows up on xfstests, so I'll wait for a baked patch
> > > > > before testing.
> > > > >
> > > > > > Also found the trick to actually make writeback work:
> > > > > > super_setup_bdi() call in fill_super...
> > > > > >
> > > > >
> > > > > Cool, but when adding only setup_bdi() this to my code, I still don't see any
> > > > > writeback. No writeback observed with your patch as well.
> > > >
> > > > I did observe writeback, but haven't done much testing.  Will
> > > > hopefully get this into a better shape today.
> > > >
> > >
> > > OK, I see the writebacks. Didn't see them with test:
> > >
> > > # ./run --ov --samefs --recycle rename-new-pop-dir
> >
> > Which branch is this?  Master doesn't have --recycle.
> >
>
> Sorry, this flag is from branch ovl_snapshot, but exact same
> functionality on master with syntax:
>
> ./run --ov=0 --samefs rename-new-pop-dir

Fix attached.

Need to work on other stuff, so xfstests left for tomorrow (or if you
have time, feel free).

Thanks,
Miklos

[-- Attachment #2: d.patch --]
[-- Type: text/x-patch, Size: 399 bytes --]

--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -378,7 +382,6 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
 static const struct super_operations ovl_super_operations = {
 	.alloc_inode	= ovl_alloc_inode,
 	.destroy_inode	= ovl_destroy_inode,
-	.drop_inode	= generic_delete_inode,
 	.put_super	= ovl_put_super,
 	.sync_fs	= ovl_sync_fs,
 	.statfs		= ovl_statfs,

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-29 12:44                                       ` Miklos Szeredi
@ 2019-01-29 16:47                                         ` Amir Goldstein
  2019-01-31 16:23                                           ` Miklos Szeredi
  0 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-01-29 16:47 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

> > ./run --ov=0 --samefs rename-new-pop-dir
>
> Fix attached.

So, is sync_filesystem(sb); in ovl_remount() needed or not?

Note that ./run --ov=0 rename-mass-4
does pass but hits a WARN_ON:
[  735.666817] WARNING: CPU: 1 PID: 7 at
/home/amir/build/src/linux/fs/overlayfs/file.c:913
ovl_writepage+0x71/0x113

The WARN_ON is also hit by xfstest overlay/030.

In both cases its an extending write where pos == size

It seems to me that simple_write_end() is indeed too simple
and compared to __generic_write_end() it is missing at least
pagecache_isize_extended() which seems related to the WARN_ON.
I tried several fixes, but gave up.

I found one fix for overlay/001 failure:

@@ -65,7 +65,8 @@ static struct file *ovl_open_upper(const struct file *file,
        ovl_path_upper(file_dentry(file), &realpath);

        old_cred = ovl_override_creds(inode->i_sb);
-       realfile = open_with_fake_path(&realpath, O_RDWR | O_NOATIME,
+       realfile = open_with_fake_path(&realpath,
+                                      O_RDWR | O_LARGEFILE | O_NOATIME,
                                       realinode, current_cred());

Other xfstests failures:
overlay/032 overlay/047 overlay/050 overlay/051

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-29 16:47                                         ` Amir Goldstein
@ 2019-01-31 16:23                                           ` Miklos Szeredi
  2019-01-31 21:54                                             ` Amir Goldstein
  0 siblings, 1 reply; 42+ messages in thread
From: Miklos Szeredi @ 2019-01-31 16:23 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

[-- Attachment #1: Type: text/plain, Size: 509 bytes --]

There's some progress: it makes it through xfstests without crashing
or deadlocking.  There are still some regressing test cases, but
things definitely look better.

What remains is ctime/mtime support.

Also fallocate/copyfile/reflink/etc will need some thought as we need
to not only flush out dirty pages, but in some cases prevent new
faults while the operation is in progress.

And after that it needs to be optimized (->readpages() and
->writepages()), but I think that's the easy part.

Thanks,
Miklos

[-- Attachment #2: ovl-aops-wip-diff.patch --]
[-- Type: text/x-patch, Size: 28615 bytes --]

commit 248c44a527ca1a88c76cac90b1127ae2d6b09b75
Author: Miklos Szeredi <mszeredi@redhat.com>
Date:   Thu Jan 31 10:14:03 2019 +0100

    ovl-page-cache-fix.patch

diff --git a/fs/buffer.c b/fs/buffer.c
index d80b9f821dea..52d024bfdbc1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2117,7 +2117,6 @@ int __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
 		mark_inode_dirty(inode);
 	return copied;
 }
-EXPORT_SYMBOL(__generic_write_end);
 
 int block_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index c922682ff1b2..19ff61362a14 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -17,6 +17,7 @@
 #include <linux/fadvise.h>
 #include <linux/writeback.h>
 #include <linux/ratelimit.h>
+#include <linux/workqueue.h>
 #include "overlayfs.h"
 
 static char ovl_whatisit(struct inode *inode, struct inode *realinode)
@@ -35,15 +36,7 @@ static struct file *ovl_open_realfile(const struct file *file,
 	struct inode *inode = file_inode(file);
 	struct file *realfile;
 	const struct cred *old_cred;
-	int flags = file->f_flags | O_NOATIME;
-
-	if (realinode == ovl_inode_upper(inode)) {
-		/* tmpfs has no readpage a_op, so need to read realfile */
-		if ((flags & O_WRONLY) &&
-		    (!realinode->i_mapping ||
-		     !realinode->i_mapping->a_ops->readpage))
-			flags = (flags & ~O_ACCMODE) | O_RDWR;
-	}
+	int flags = (file->f_flags & O_ACCMODE) | O_NOATIME | O_LARGEFILE;
 
 	old_cred = ovl_override_creds(inode->i_sb);
 	realfile = open_with_fake_path(&file->f_path, flags,
@@ -57,6 +50,44 @@ static struct file *ovl_open_realfile(const struct file *file,
 	return realfile;
 }
 
+static struct file *ovl_open_upper(const struct file *file,
+				   struct inode *realinode)
+{
+	struct inode *inode = file_inode(file);
+	struct ovl_inode *oi = OVL_I(inode);
+	struct file *realfile;
+	const struct cred *old_cred;
+	struct path realpath;
+	int flags = O_RDWR | O_NOATIME | O_LARGEFILE;
+
+	realfile = READ_ONCE(oi->upper_file);
+	if (realfile)
+		return realfile;
+
+	ovl_path_upper(file_dentry(file), &realpath);
+
+	old_cred = ovl_override_creds(inode->i_sb);
+	realfile = open_with_fake_path(&realpath, flags,
+				       realinode, current_cred());
+	revert_creds(old_cred);
+
+	pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
+		 file, file, ovl_whatisit(inode, realinode), file->f_flags,
+		 realfile, IS_ERR(realfile) ? 0 : realfile->f_flags);
+
+	if (!IS_ERR(realfile)) {
+		struct file *old = cmpxchg(&oi->upper_file, NULL, realfile);
+
+		if (old) {
+			fput(realfile);
+			realfile = old;
+		}
+	}
+
+	return realfile;
+}
+
+#if 0
 #define OVL_SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT)
 
 static int ovl_change_flags(struct file *file, unsigned int flags)
@@ -94,13 +125,14 @@ static int ovl_change_flags(struct file *file, unsigned int flags)
 
 	return 0;
 }
+#endif
 
 static bool ovl_filemap_support(const struct file *file)
 {
-	struct ovl_fs *ofs = file_inode(file)->i_sb->s_fs_info;
+	//struct ovl_fs *ofs = file_inode(file)->i_sb->s_fs_info;
 
 	/* TODO: implement aops to upper inode data */
-	return ofs->upper_mnt && ovl_aops.writepage;
+	return true;
 }
 
 static int ovl_file_maybe_copy_up(const struct file *file, bool allow_meta)
@@ -119,15 +151,17 @@ static int ovl_file_maybe_copy_up(const struct file *file, bool allow_meta)
 	return ovl_maybe_copy_up(file_dentry(file), copy_up_flags);
 }
 
-static int ovl_real_fdget_meta(const struct file *file, struct fd *real,
-			       bool allow_meta)
+static int ovl_real_fdget(const struct file *file, struct fd *real)
 {
 	struct inode *inode = file_inode(file);
+	struct ovl_inode *oi = OVL_I(inode);
 	struct inode *realinode;
 	int err;
 
 	real->flags = 0;
-	real->file = file->private_data;
+	real->file = READ_ONCE(oi->upper_file);
+	if (!real->file)
+		real->file = file->private_data;
 
 	/*
 	 * Lazy copy up caches the meta copy upper file on open O_RDWR.
@@ -136,33 +170,41 @@ static int ovl_real_fdget_meta(const struct file *file, struct fd *real,
 	 * we may try to open a lower file O_RDWR or perform data operations
 	 * (e.g. fallocate) on the metacopy inode.
 	 */
-	err = ovl_file_maybe_copy_up(file, allow_meta);
+	err = ovl_file_maybe_copy_up(file, false);
 	if (err)
 		return err;
 
-	if (allow_meta)
-		realinode = ovl_inode_real(inode);
-	else
-		realinode = ovl_inode_realdata(inode);
+	realinode = ovl_inode_realdata(inode);
 
 	/* Has it been copied up since we'd opened it? */
 	if (unlikely(file_inode(real->file) != realinode)) {
-		real->flags = FDPUT_FPUT;
-		real->file = ovl_open_realfile(file, realinode);
+		real->file = ovl_open_upper(file, realinode);
 
 		return PTR_ERR_OR_ZERO(real->file);
 	}
 
+#if 0
 	/* Did the flags change since open? */
 	if (unlikely((file->f_flags ^ real->file->f_flags) & ~O_NOATIME))
 		return ovl_change_flags(real->file, file->f_flags);
+#endif
 
 	return 0;
 }
 
-static int ovl_real_fdget(const struct file *file, struct fd *real)
+static int ovl_real_fdget_meta(const struct file *file, struct fd *real,
+			       bool allow_meta)
 {
-	return ovl_real_fdget_meta(file, real, false);
+	struct inode *inode = file_inode(file);
+	struct inode *upperinode = ovl_inode_upper(inode);
+
+	if (!allow_meta || !upperinode || ovl_has_upperdata(inode))
+		return ovl_real_fdget(file, real);
+
+	real->flags = FDPUT_FPUT;
+	real->file = ovl_open_realfile(file, upperinode);
+
+	return PTR_ERR_OR_ZERO(real->file);
 }
 
 static bool ovl_should_use_filemap_meta(struct file *file, bool allow_meta)
@@ -220,6 +262,7 @@ static int ovl_flush_filemap(struct file *file, loff_t offset, loff_t len)
 static int ovl_open(struct inode *inode, struct file *file)
 {
 	struct file *realfile;
+	struct inode *realinode;
 	int err;
 	bool allow_meta = (file->f_mode & FMODE_WRITE) &&
 			ovl_filemap_support(file);
@@ -228,22 +271,28 @@ static int ovl_open(struct inode *inode, struct file *file)
 	if (err)
 		return err;
 
-	/* No longer need these flags, so don't pass them on to underlying fs */
-	file->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
+	realinode = allow_meta ?
+		ovl_inode_real(inode) : ovl_inode_realdata(inode);
 
-	realfile = ovl_open_realfile(file, allow_meta ? ovl_inode_real(inode) :
-				     ovl_inode_realdata(inode));
-	if (IS_ERR(realfile))
-		return PTR_ERR(realfile);
+	if (realinode == ovl_inode_upper(inode)) {
+		realfile = ovl_open_upper(file, realinode);
+		if (IS_ERR(realfile))
+			return PTR_ERR(realfile);
+	} else {
+		realfile = ovl_open_realfile(file, realinode);
+		if (IS_ERR(realfile))
+			return PTR_ERR(realfile);
 
-	file->private_data = realfile;
+		file->private_data = realfile;
+	}
 
 	return 0;
 }
 
 static int ovl_release(struct inode *inode, struct file *file)
 {
-	fput(file->private_data);
+	if (file->private_data)
+		fput(file->private_data);
 
 	return 0;
 }
@@ -254,7 +303,7 @@ static loff_t ovl_llseek(struct file *file, loff_t offset, int whence)
 
 	return generic_file_llseek_size(file, offset, whence,
 					realinode->i_sb->s_maxbytes,
-					i_size_read(realinode));
+					i_size_read(file_inode(file)));
 }
 
 static void ovl_file_accessed(struct file *file)
@@ -292,6 +341,10 @@ static rwf_t ovl_iocb_to_rwf(struct kiocb *iocb)
 		flags |= RWF_DSYNC;
 	if (ifl & IOCB_SYNC)
 		flags |= RWF_SYNC;
+	if (ifl & IOCB_APPEND)
+		flags |= RWF_APPEND;
+	if (ifl & IOCB_DIRECT)
+		flags |= RWF_DIRECT;
 
 	return flags;
 }
@@ -362,7 +415,7 @@ static ssize_t ovl_real_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	revert_creds(old_cred);
 
 	/* Update size */
-	ovl_copyattr(ovl_inode_real(inode), inode);
+	ovl_copyattr_size(ovl_inode_real(inode), inode);
 
 	fdput(real);
 
@@ -375,9 +428,14 @@ static ssize_t ovl_real_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+
+	if (ovl_should_use_filemap(file)) {
+		/* Update mode for file_remove_privs() */
+		ovl_copyattr(ovl_inode_real(inode), inode);
 
-	if (ovl_should_use_filemap(file))
 		return generic_file_write_iter(iocb, iter);
+	}
 
 	return ovl_real_write_iter(iocb, iter);
 }
@@ -387,6 +445,7 @@ static int ovl_real_fsync(struct file *file, loff_t start, loff_t end,
 			  int datasync)
 {
 	struct fd real;
+	struct inode *inode = file_inode(file);
 	const struct cred *old_cred;
 	int ret;
 
@@ -395,12 +454,11 @@ static int ovl_real_fsync(struct file *file, loff_t start, loff_t end,
 		return ret;
 
 	/* Don't sync lower file for fear of receiving EROFS error */
-	if (file_inode(real.file) == ovl_inode_upper(file_inode(file))) {
-		old_cred = ovl_override_creds(file_inode(file)->i_sb);
+	if (file_inode(real.file) == ovl_inode_upper(inode)) {
+		old_cred = ovl_override_creds(inode->i_sb);
 		ret = vfs_fsync_range(real.file, start, end, datasync);
 		revert_creds(old_cred);
 	}
-
 	fdput(real);
 
 	return ret;
@@ -408,8 +466,13 @@ static int ovl_real_fsync(struct file *file, loff_t start, loff_t end,
 
 static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 {
-	if (ovl_should_use_filemap_meta(file, true))
-		return __generic_file_fsync(file, start, end, datasync);
+	int err;
+
+	if (ovl_should_use_filemap_meta(file, true)) {
+		err = file_write_and_wait_range(file, start, end);
+		if (err)
+			return err;
+	}
 
 	return ovl_real_fsync(file, start, end, datasync);
 }
@@ -417,7 +480,6 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 static vm_fault_t ovl_fault(struct vm_fault *vmf)
 {
 	struct file *file = vmf->vma->vm_file;
-	struct file *realfile = file->private_data;
 	struct inode *inode = file_inode(file);
 	bool blocking = (vmf->flags & FAULT_FLAG_KILLABLE) ||
 			((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
@@ -443,13 +505,7 @@ static vm_fault_t ovl_fault(struct vm_fault *vmf)
 			goto out_err;
 
 		return ret;
-	} else {
-		err = vfs_fadvise(realfile, vmf->pgoff << PAGE_SHIFT,
-				  file->f_ra.ra_pages, POSIX_FADV_WILLNEED);
-		if (err)
-			goto out_err;
 	}
-
 	return filemap_fault(vmf);
 
 out_err:
@@ -526,23 +582,30 @@ static long ovl_fallocate(struct file *file, int mode, loff_t offset,
 	const struct cred *old_cred;
 	int ret;
 
+	inode_lock(inode);
+
 	/* XXX: Different modes need to flush different ranges... */
 	ret = ovl_flush_filemap(file, 0, LLONG_MAX);
 	if (ret)
-		return ret;
+		goto unlock;
 
 	ret = ovl_real_fdget(file, &real);
 	if (ret)
-		return ret;
+		goto unlock;
 
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
 	ret = vfs_fallocate(real.file, mode, offset, len);
 	revert_creds(old_cred);
 
 	/* Update size */
-	ovl_copyattr(ovl_inode_real(inode), inode);
+	ovl_copyattr_size(ovl_inode_real(inode), inode);
+
+	/* and invalidate mappings and page cache */
+	truncate_pagecache_range(inode, 0, LLONG_MAX);
 
 	fdput(real);
+unlock:
+	inode_unlock(inode);
 
 	return ret;
 }
@@ -670,24 +733,23 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in,
 	const struct cred *old_cred;
 	loff_t ret;
 
-	/* XXX: For some ops zero length means EOF... */
-	ret = ovl_flush_filemap(file_out, pos_out, len ?: LLONG_MAX);
+	inode_lock(inode_out);
+
+	ret = ovl_flush_filemap(file_out, pos_out, LLONG_MAX);
 	if (ret)
-		return ret;
+		goto unlock;
 
-	ret = ovl_flush_filemap(file_in, pos_in, len ?: LLONG_MAX);
+	ret = ovl_flush_filemap(file_in, pos_in, LLONG_MAX);
 	if (ret)
-		return ret;
+		goto unlock;
 
 	ret = ovl_real_fdget(file_out, &real_out);
 	if (ret)
-		return ret;
+		goto unlock;
 
 	ret = ovl_real_fdget(file_in, &real_in);
-	if (ret) {
-		fdput(real_out);
-		return ret;
-	}
+	if (ret)
+		goto fdput_out;
 
 	old_cred = ovl_override_creds(file_inode(file_out)->i_sb);
 	switch (op) {
@@ -710,10 +772,18 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in,
 	revert_creds(old_cred);
 
 	/* Update size */
-	ovl_copyattr(ovl_inode_real(inode_out), inode_out);
+	ovl_copyattr_size(ovl_inode_real(inode_out), inode_out);
+
+	if (op != OVL_DEDUPE)
+		truncate_pagecache_range(inode_out,
+					 round_down(pos_out, PAGE_SIZE),
+					 LLONG_MAX);
 
 	fdput(real_in);
+fdput_out:
 	fdput(real_out);
+unlock:
+	inode_unlock(inode_out);
 
 	return ret;
 }
@@ -770,39 +840,6 @@ const struct file_operations ovl_file_operations = {
 	.remap_file_range	= ovl_remap_file_range,
 };
 
-static struct page *ovl_real_get_page(struct file *realfile, pgoff_t index)
-{
-	struct page *page;
-
-	page = read_mapping_page(file_inode(realfile)->i_mapping, index, NULL);
-	if (IS_ERR(page))
-		return page;
-
-	if (!PageUptodate(page)) {
-		put_page(page);
-		return ERR_PTR(-EIO);
-	}
-
-	lock_page(page);
-
-	return page;
-}
-
-static int ovl_real_copy_page(struct file *realfile, struct page *page)
-{
-	struct page *realpage;
-
-	realpage = ovl_real_get_page(realfile, page->index);
-	if (IS_ERR(realpage))
-		return PTR_ERR(realpage);
-
-	copy_highpage(page, realpage);
-	unlock_page(realpage);
-	put_page(realpage);
-
-	return 0;
-}
-
 static int ovl_real_readpage(struct file *realfile, struct page *page)
 {
 	struct bio_vec bvec = {
@@ -817,22 +854,25 @@ static int ovl_real_readpage(struct file *realfile, struct page *page)
 	iov_iter_bvec(&iter, READ, &bvec, 1, PAGE_SIZE);
 
 	ret = vfs_iter_read(realfile, &iter, &pos, 0);
+	if (ret >= 0 && ret < PAGE_SIZE)
+		zero_user_segment(page, ret, PAGE_SIZE);
 
 	return ret < 0 ? ret : 0;
 }
 
 static int ovl_do_readpage(struct file *file, struct page *page)
 {
-	struct file *realfile = file->private_data;
+	struct inode *inode = file_inode(file);
+	struct ovl_inode *oi = OVL_I(inode);
+	struct file *realfile = READ_ONCE(oi->upper_file);
 	const struct cred *old_cred;
 	int ret;
 
-	/* tmpfs has no readpage a_op, so need to read with f_op */
+	if (!realfile)
+		realfile = file->private_data;
+
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
-	if (!realfile->f_mapping || !realfile->f_mapping->a_ops->readpage)
-		ret = ovl_real_readpage(realfile, page);
-	else
-		ret = ovl_real_copy_page(realfile, page);
+	ret = ovl_real_readpage(realfile, page);
 	revert_creds(old_cred);
 
 	if (!ret)
@@ -841,14 +881,37 @@ static int ovl_do_readpage(struct file *file, struct page *page)
 	return 0;
 }
 
+struct ovl_readpage_work {
+	struct file *file;
+	struct page *page;
+	struct work_struct work;
+};
+
+static void ovl_readpage_work_fn(struct work_struct *work)
+{
+	struct ovl_readpage_work *ow;
+
+	ow = container_of(work, struct ovl_readpage_work, work);
+
+	ovl_do_readpage(ow->file, ow->page);
+	unlock_page(ow->page);
+}
+
 static int ovl_readpage(struct file *file, struct page *page)
 {
-	int ret;
+	struct ovl_readpage_work *ow;
 
-	ret = ovl_do_readpage(file, page);
-	unlock_page(page);
+	ow = kmalloc(sizeof(*ow), GFP_KERNEL);
+	if (!ow) {
+		unlock_page(page);
+		return -ENOMEM;
+	}
+	ow->file = file;
+	ow->page = page;
+	INIT_WORK(&ow->work, ovl_readpage_work_fn);
+	queue_work(ovl_wq, &ow->work);
 
-	return ret;
+	return 0;
 }
 
 static int ovl_write_begin(struct file *file, struct address_space *mapping,
@@ -881,76 +944,74 @@ static int ovl_write_begin(struct file *file, struct address_space *mapping,
 	return 0;
 }
 
-static int ovl_real_write_end(struct file *file, loff_t pos,
-			      unsigned int copied, struct page *page)
-{
-	struct file *realfile = file->private_data;
-	unsigned int offset = (pos & (PAGE_SIZE - 1));
-	struct bio_vec bvec = {
-		.bv_page = page,
-		.bv_len = copied,
-		.bv_offset = offset,
-	};
-	struct iov_iter iter;
-	const struct cred *old_cred;
-	ssize_t ret;
-
-	iov_iter_bvec(&iter, WRITE, &bvec, 1, copied);
-
-	old_cred = ovl_override_creds(file_inode(file)->i_sb);
-	ret = vfs_iter_write(realfile, &iter, &pos, 0);
-	revert_creds(old_cred);
-
-	return ret < 0 ? ret : 0;
-}
-
-extern int __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
-			       struct page *page);
-
 static int ovl_write_end(struct file *file, struct address_space *mapping,
 			 loff_t pos, unsigned len, unsigned copied,
 			 struct page *page, void *fsdata)
 {
-	int err;
+	int res;
 
-	err = ovl_real_write_end(file, pos, copied, page);
-	if (err) {
-		pr_warn("ovl_write_end: %i", err);
-		unlock_page(page);
-		put_page(page);
+	res = simple_write_end(file, mapping, pos, len, copied, page, fsdata);
 
-		return -EIO;
-	}
+	//pr_info("ovl_write_end: page->flags: %lx\n", page->flags);
 
-	return __generic_write_end(file_inode(file), pos, copied, page);
+	return res;
 }
 
 static int ovl_real_writepage(struct page *page, struct writeback_control *wbc)
 {
-	struct inode *realinode = ovl_inode_real(page->mapping->host);
-	struct page *realpage;
-	int ret;
+	struct ovl_inode *oi = OVL_I(page->mapping->host);
+	struct file *realfile = oi->upper_file;
+	loff_t pos = page->index << PAGE_SHIFT;
+	loff_t size = i_size_read(page->mapping->host);
+	size_t len = PAGE_SIZE;
+	struct bio_vec bvec = {
+		.bv_page = page,
+		.bv_len = PAGE_SIZE,
+		.bv_offset = 0,
+	};
+	struct iov_iter iter;
+	ssize_t ret;
+	rwf_t flags = 0;
 
-	if (!realinode->i_mapping || !realinode->i_mapping->a_ops->writepage)
-		return -EIO;
+	if (size <= pos) {
+		pr_info("ovl_writepage: size = %lli pos = %lli\n", size, pos);
+		return 0;
+	}
+	
+	if (size < pos + PAGE_SIZE) {
+		/* Can't do direct I/O for tail pages */
+		len = size - pos;
+	} else if (realfile->f_mapping->a_ops &&
+		   realfile->f_mapping->a_ops->direct_IO) {
+		flags |= RWF_DIRECT;
+	}
 
-	realpage = grab_cache_page(realinode->i_mapping, page->index);
-	copy_highpage(realpage, page);
-	set_page_dirty(realpage);
+	//pr_info("ovl_real_writepage(%lli)\n", pos);
+	iov_iter_bvec(&iter, WRITE, &bvec, 1, len);
 
-	/* Start writeback on and unlock real page */
-	ret = realinode->i_mapping->a_ops->writepage(realpage, wbc);
-	put_page(realpage);
+	ret = vfs_iter_write(realfile, &iter, &pos, flags);
+	if (ret != len)
+		pr_warn("ovl_read_writepage: write returned %zi (not %zi)\n",
+			ret, len);
 
-	return ret;
+	/* FADV_DONTNEED for tail pages*/
+
+	//pr_info("ovl_real_writepage(%lli) = %li\n", pos, ret);
+
+	return ret < 0 ? ret : 0;
 }
 
 static int ovl_writepage(struct page *page, struct writeback_control *wbc)
 {
 	int ret;
+	const struct cred *old_cred;
 
 	set_page_writeback(page);
+
+	old_cred = ovl_override_creds(page->mapping->host->i_sb);
 	ret = ovl_real_writepage(page, wbc);
+	revert_creds(old_cred);
+
 	unlock_page(page);
 
 	/*
@@ -962,6 +1023,36 @@ static int ovl_writepage(struct page *page, struct writeback_control *wbc)
 	return ret;
 }
 
+static ssize_t ovl_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	rwf_t flags = ovl_iocb_to_rwf(iocb);
+	const struct cred *old_cred;
+	struct fd real;
+	ssize_t ret;
+
+	ret = ovl_real_fdget(file, &real);
+	if (ret)
+		goto out;
+
+	old_cred = ovl_override_creds(file_inode(file)->i_sb);
+	if (iov_iter_rw(iter) == READ) {
+		return vfs_iter_read(real.file,iter, &iocb->ki_pos, flags);
+	} else {
+		file_start_write(real.file);
+		ret = vfs_iter_write(real.file, iter, &iocb->ki_pos, flags);
+		file_end_write(real.file);
+
+		/* Update size */
+		ovl_copyattr_size(ovl_inode_real(inode), inode);
+	}
+	revert_creds(old_cred);
+	fdput(real);
+out:
+	return ret;
+}
+
 const struct address_space_operations ovl_aops = {
 	.readpage	= ovl_readpage,
 	.write_begin	= ovl_write_begin,
@@ -969,5 +1060,5 @@ const struct address_space_operations ovl_aops = {
 	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.writepage	= ovl_writepage,
 	/* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
-	.direct_IO	= noop_direct_IO,
+	.direct_IO	= ovl_direct_IO,
 };
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 766505598ed1..e139f92b64ba 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -8,6 +8,7 @@
  */
 
 #include <linux/fs.h>
+#include <linux/mm.h>
 #include <linux/slab.h>
 #include <linux/cred.h>
 #include <linux/xattr.h>
@@ -18,10 +19,12 @@
 
 int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 {
+	struct inode *upperinode, *inode = d_inode(dentry);
 	int err;
 	bool full_copy_up = false;
 	struct dentry *upperdentry;
 	const struct cred *old_cred;
+	bool is_truncate = attr->ia_valid & ATTR_SIZE;
 
 	err = setattr_prepare(dentry, attr);
 	if (err)
@@ -31,47 +34,57 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 	if (err)
 		goto out;
 
-	if (attr->ia_valid & ATTR_SIZE) {
+	if (is_truncate) {
 		struct inode *realinode = d_inode(ovl_dentry_real(dentry));
 
 		err = -ETXTBSY;
 		if (atomic_read(&realinode->i_writecount) < 0)
 			goto out_drop_write;
 
+		err = filemap_write_and_wait_range(inode->i_mapping,
+						   0, LLONG_MAX);
+		if (err)
+			goto out_drop_write;
+
+
 		/* Truncate should trigger data copy up as well */
 		full_copy_up = true;
+
 	}
 
 	if (!full_copy_up)
 		err = ovl_copy_up(dentry);
 	else
 		err = ovl_copy_up_with_data(dentry);
-	if (!err) {
-		struct inode *winode = NULL;
 
-		upperdentry = ovl_dentry_upper(dentry);
+	if (err)
+		goto out_drop_write;
 
-		if (attr->ia_valid & ATTR_SIZE) {
-			winode = d_inode(upperdentry);
-			err = get_write_access(winode);
-			if (err)
-				goto out_drop_write;
-		}
+	upperdentry = ovl_dentry_upper(dentry);
+	upperinode = d_inode(upperdentry);
 
-		if (attr->ia_valid & (ATTR_KILL_SUID|ATTR_KILL_SGID))
-			attr->ia_valid &= ~ATTR_MODE;
+	if (is_truncate) {
+		err = get_write_access(upperinode);
+		if (err)
+			goto out_drop_write;
+	}
 
-		inode_lock(upperdentry->d_inode);
-		old_cred = ovl_override_creds(dentry->d_sb);
-		err = notify_change(upperdentry, attr, NULL);
-		revert_creds(old_cred);
-		if (!err)
-			ovl_copyattr(upperdentry->d_inode, dentry->d_inode);
-		inode_unlock(upperdentry->d_inode);
+	if (attr->ia_valid & (ATTR_KILL_SUID|ATTR_KILL_SGID))
+		attr->ia_valid &= ~ATTR_MODE;
 
-		if (winode)
-			put_write_access(winode);
+	inode_lock(upperinode);
+	old_cred = ovl_override_creds(dentry->d_sb);
+	err = notify_change(upperdentry, attr, NULL);
+	revert_creds(old_cred);
+	if (!err) {
+		if (is_truncate)
+			truncate_setsize(inode, i_size_read(upperinode));
+		ovl_copyattr(upperinode, inode);
 	}
+	inode_unlock(upperinode);
+
+	if (is_truncate)
+		put_write_access(upperinode);
 out_drop_write:
 	ovl_drop_write(dentry);
 out:
@@ -148,7 +161,8 @@ int ovl_getattr(const struct path *path, struct kstat *stat,
 	enum ovl_path_type type;
 	struct path realpath;
 	const struct cred *old_cred;
-	bool is_dir = S_ISDIR(dentry->d_inode->i_mode);
+	struct inode *inode = d_inode(dentry);
+	bool is_dir = S_ISDIR(inode->i_mode);
 	bool samefs = ovl_same_sb(dentry->d_sb);
 	struct ovl_layer *lower_layer = NULL;
 	int err;
@@ -162,6 +176,13 @@ int ovl_getattr(const struct path *path, struct kstat *stat,
 	if (err)
 		goto out;
 
+	/*
+	 * With overlay page cache, overlay inode i_size is more uptodate than
+	 * real inode i_size. Perhaps we should generic_fillattr() and only
+	 * update individual stats from real inode?
+	 */
+	stat->size = i_size_read(inode);
+
 	/*
 	 * For non-dir or same fs, we use st_ino of the copy up origin.
 	 * This guaranties constant st_dev/st_ino across copy up.
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index b5155ed6d516..576488162257 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -12,6 +12,9 @@
 #include <linux/fs.h>
 #include "ovl_entry.h"
 
+struct workqueue_struct;
+extern struct workqueue_struct *ovl_wq;
+
 enum ovl_path_type {
 	__OVL_PATH_UPPER	= (1 << 0),
 	__OVL_PATH_MERGE	= (1 << 1),
@@ -376,6 +379,11 @@ static inline void ovl_copyattr(struct inode *from, struct inode *to)
 	to->i_atime = from->i_atime;
 	to->i_mtime = from->i_mtime;
 	to->i_ctime = from->i_ctime;
+}
+
+static inline void ovl_copyattr_size(struct inode *from, struct inode *to)
+{
+	ovl_copyattr(from, to);
 	i_size_write(to, i_size_read(from));
 }
 
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index ec237035333a..ab10c75c4b98 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -99,6 +99,7 @@ struct ovl_inode {
 	struct inode vfs_inode;
 	struct dentry *__upperdentry;
 	struct inode *lower;
+	struct file *upper_file;
 
 	/* synchronize copy up and more */
 	struct mutex lock;
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 0116735cc321..cfbf1d8994c0 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -9,6 +9,7 @@
 
 #include <uapi/linux/magic.h>
 #include <linux/fs.h>
+#include <linux/file.h>
 #include <linux/namei.h>
 #include <linux/xattr.h>
 #include <linux/mount.h>
@@ -16,6 +17,7 @@
 #include <linux/module.h>
 #include <linux/statfs.h>
 #include <linux/seq_file.h>
+#include <linux/workqueue.h>
 #include <linux/posix_acl_xattr.h>
 #include <linux/exportfs.h>
 #include "overlayfs.h"
@@ -27,6 +29,8 @@ MODULE_LICENSE("GPL");
 
 struct ovl_dir_cache;
 
+struct workqueue_struct *ovl_wq;
+
 #define OVL_MAX_STACK 500
 
 static bool ovl_redirect_dir_def = IS_ENABLED(CONFIG_OVERLAY_FS_REDIRECT_DIR);
@@ -185,6 +189,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb)
 	oi->__upperdentry = NULL;
 	oi->lower = NULL;
 	oi->lowerdata = NULL;
+	oi->upper_file = NULL;
 	mutex_init(&oi->lock);
 
 	return &oi->vfs_inode;
@@ -201,6 +206,8 @@ static void ovl_destroy_inode(struct inode *inode)
 {
 	struct ovl_inode *oi = OVL_I(inode);
 
+	if (oi->upper_file)
+		fput(oi->upper_file);
 	dput(oi->__upperdentry);
 	iput(oi->lower);
 	if (S_ISDIR(inode->i_mode))
@@ -378,7 +385,6 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
 static const struct super_operations ovl_super_operations = {
 	.alloc_inode	= ovl_alloc_inode,
 	.destroy_inode	= ovl_destroy_inode,
-	.drop_inode	= generic_delete_inode,
 	.put_super	= ovl_put_super,
 	.sync_fs	= ovl_sync_fs,
 	.statfs		= ovl_statfs,
@@ -1427,6 +1433,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	struct cred *cred;
 	int err;
 
+	err = super_setup_bdi(sb);
+	if (err)
+		goto out;
+
 	err = -ENOMEM;
 	ofs = kzalloc(sizeof(struct ovl_fs), GFP_KERNEL);
 	if (!ofs)
@@ -1585,7 +1595,11 @@ static void ovl_inode_init_once(void *foo)
 
 static int __init ovl_init(void)
 {
-	int err;
+	int err = -ENOMEM;
+
+	ovl_wq = alloc_workqueue("ovl_writeback", 0, 0);
+	if (!ovl_wq)
+		goto out;
 
 	ovl_inode_cachep = kmem_cache_create("ovl_inode",
 					     sizeof(struct ovl_inode), 0,
@@ -1593,12 +1607,19 @@ static int __init ovl_init(void)
 					      SLAB_MEM_SPREAD|SLAB_ACCOUNT),
 					     ovl_inode_init_once);
 	if (ovl_inode_cachep == NULL)
-		return -ENOMEM;
+		goto out_destroy_wq;
 
 	err = register_filesystem(&ovl_fs_type);
 	if (err)
-		kmem_cache_destroy(ovl_inode_cachep);
+		goto out_cache_destroy;
+
+	return 0;
 
+out_cache_destroy:
+	kmem_cache_destroy(ovl_inode_cachep);
+out_destroy_wq:
+	destroy_workqueue(ovl_wq);
+out:
 	return err;
 }
 
@@ -1613,6 +1634,8 @@ static void __exit ovl_exit(void)
 	rcu_barrier();
 	kmem_cache_destroy(ovl_inode_cachep);
 
+	destroy_workqueue(ovl_wq);
+
 }
 
 module_init(ovl_init);
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 54ed62a1d9d6..b1989c479947 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -413,7 +413,7 @@ void ovl_inode_init(struct inode *inode, struct dentry *upperdentry,
 	if (lowerdata)
 		OVL_I(inode)->lowerdata = igrab(d_inode(lowerdata));
 
-	ovl_copyattr(realinode, inode);
+	ovl_copyattr_size(realinode, inode);
 	ovl_copyflags(realinode, inode);
 	if (!inode->i_ino)
 		inode->i_ino = realinode->i_ino;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 811c77743dad..2901ffc43c11 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3339,6 +3339,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 		ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
 	if (flags & RWF_APPEND)
 		ki->ki_flags |= IOCB_APPEND;
+	if (flags & RWF_DIRECT)
+		ki->ki_flags |= IOCB_DIRECT;
 	return 0;
 }
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 121e82ce296b..b6e821ddb066 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -342,8 +342,11 @@ typedef int __bitwise __kernel_rwf_t;
 /* per-IO O_APPEND */
 #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
 
+/* per-IO O_DIRECT */
+#define RWF_DIRECT	((__force __kernel_rwf_t)0x00000020)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND)
+			 RWF_APPEND | RWF_DIRECT)
 
 #endif /* _UAPI_LINUX_FS_H */

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-31 16:23                                           ` Miklos Szeredi
@ 2019-01-31 21:54                                             ` Amir Goldstein
  2019-02-01  9:14                                               ` Miklos Szeredi
  0 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-01-31 21:54 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

[-- Attachment #1: Type: text/plain, Size: 1070 bytes --]

On Thu, Jan 31, 2019 at 6:24 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> There's some progress: it makes it through xfstests without crashing
> or deadlocking.  There are still some regressing test cases, but
> things definitely look better.
>

They do!

kmemleak detected leaking ovl_readpage_work in ovl_readpage_work_fn()
and creator_cred in ovl_direct_IO().
I pushed your patch with the leak fixes to  ovl-aops-wip as a testing point.

> What remains is ctime/mtime support.
>
> Also fallocate/copyfile/reflink/etc will need some thought as we need
> to not only flush out dirty pages, but in some cases prevent new
> faults while the operation is in progress.

Well, generic/503 which exercises these corners hangs
(503.dmesg attached)

>
> And after that it needs to be optimized (->readpages() and
> ->writepages()), but I think that's the easy part.
>

And we also need to flush pages on non direct IO cases
of ->writepages() and then we can revert:
e8d4bfe3a715 ovl: Sync upper dirty data when syncing overlayfs
and swoop the grand prize :)

Thanks,
Amir.

[-- Attachment #2: 503.demg --]
[-- Type: application/octet-stream, Size: 4712 bytes --]

[ 2211.624509] INFO: task kworker/0:1:27923 blocked for more than 120 seconds.
[ 2211.629034]       Not tainted 5.0.0-rc1-xfstests-00012-g60d58209fbdd-dirty #4105
[ 2211.633470] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2211.637586] kworker/0:1     D    0 27923      2 0x80000000
[ 2211.638720] Workqueue: ovl_writeback ovl_readpage_work_fn
[ 2211.639654] Call Trace:
[ 2211.640095]  ? __schedule+0x939/0x9ae
[ 2211.640734]  ? rwsem_down_read_failed+0x121/0x196
[ 2211.641681]  schedule+0x6e/0x86
[ 2211.642462]  rwsem_down_read_failed+0x155/0x196
[ 2211.643533]  ? xfs_file_buffered_aio_read+0x10c/0x135
[ 2211.644704]  ? call_rwsem_down_read_failed+0x14/0x30
[ 2211.645541]  call_rwsem_down_read_failed+0x14/0x30
[ 2211.646475]  down_read_nested+0x79/0x9c
[ 2211.647808]  xfs_ilock+0x18b/0x208
[ 2211.648638]  xfs_file_buffered_aio_read+0x10c/0x135
[ 2211.649649]  xfs_file_read_iter+0x81/0xc8
[ 2211.650417]  do_iter_readv_writev+0x15a/0x191
[ 2211.651230]  do_iter_read+0x87/0xfb
[ 2211.651709]  ovl_do_readpage+0x8b/0x139
[ 2211.652235]  ovl_readpage_work_fn+0x16/0x29
[ 2211.652820]  process_one_work+0x2d3/0x4f5
[ 2211.653393]  ? worker_thread+0x23e/0x2b6
[ 2211.653947]  worker_thread+0x1e2/0x2b6
[ 2211.654685]  ? process_scheduled_works+0x2c/0x2c
[ 2211.655641]  kthread+0x124/0x12c
[ 2211.656331]  ? __kthread_create_on_node+0x196/0x196
[ 2211.657354]  ret_from_fork+0x3a/0x50
[ 2211.658113] INFO: task t_mmap_collisio:8210 blocked for more than 120 seconds.
[ 2211.659601]       Not tainted 5.0.0-rc1-xfstests-00012-g60d58209fbdd-dirty #4105
[ 2211.661051] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2211.662468] t_mmap_collisio D    0  8210   7956 0x00000000
[ 2211.663412] Call Trace:
[ 2211.663850]  ? __schedule+0x939/0x9ae
[ 2211.664528]  schedule+0x6e/0x86
[ 2211.665197]  io_schedule+0x12/0x33
[ 2211.665902]  __lock_page+0x186/0x20f
[ 2211.666539]  ? __add_to_page_cache_locked+0x264/0x264
[ 2211.667422]  truncate_inode_pages_range+0x51e/0x5d1
[ 2211.668311]  ? tlb_flush_mmu_free+0x15/0x42
[ 2211.669190]  ? arch_tlb_finish_mmu+0x8f/0xb2
[ 2211.670003]  ? tlb_finish_mmu+0x1e/0x2a
[ 2211.670752]  ? zap_page_range_single+0xbf/0xe3
[ 2211.671622]  ? kvm_clock_read+0x14/0x1c
[ 2211.672468]  ? kvm_sched_clock_read+0x5/0xd
[ 2211.673211]  ? uncore_event_cpu_offline+0x81/0x117
[ 2211.673880]  ? sched_clock_cpu+0x10/0xad
[ 2211.674641]  ? up_write+0x1c/0x79
[ 2211.675415]  ? unmap_mapping_pages+0x63/0x110
[ 2211.676325]  truncate_pagecache+0x3b/0x51
[ 2211.677162]  ovl_setattr+0x18c/0x1f9
[ 2211.677946]  notify_change+0x273/0x353
[ 2211.678742]  ? ovl_other_xattr_set+0x1e/0x1e
[ 2211.679628]  do_truncate+0x81/0xb4
[ 2211.680349]  ? rcu_read_lock_sched_held+0x5d/0x63
[ 2211.681361]  do_sys_ftruncate+0xcb/0x14b
[ 2211.682128]  do_syscall_64+0x57/0x14e
[ 2211.682869]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2211.683884] RIP: 0033:0x7fe1ebf7cc87
[ 2211.684773] Code: Bad RIP value.
[ 2211.685534] RSP: 002b:00007fe1eb697f38 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
[ 2211.687288] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe1ebf7cc87
[ 2211.688724] RDX: 9837804094cc53b7 RSI: 0000000000000000 RDI: 0000000000000003
[ 2211.689734] RBP: 0000000010624dd3 R08: 00007fe1eb698700 R09: 00007fe1eb698700
[ 2211.690985] R10: 00007fe1eb6989d0 R11: 0000000000000246 R12: 0000000000000000
[ 2211.692491] R13: 00007ffdf0ead29f R14: 00007fe1eb698700 R15: 000055ff026a2240
[ 2211.693741] 
[ 2211.693741] Showing all locks held in the system:
[ 2211.694870] 1 lock held by khungtaskd/261:
[ 2211.695728]  #0: 0000000058dadfef (rcu_read_lock){....}, at: debug_show_all_locks+0x14/0x17b
[ 2211.697195] 3 locks held by kworker/0:1/27923:
[ 2211.697995]  #0: 00000000c1a41107 ((wq_completion)"ovl_writeback"){+.+.}, at: process_one_work+0x1af/0x4f5
[ 2211.699789]  #1: 000000009a0e6fe7 ((work_completion)(&ow->work)){+.+.}, at: process_one_work+0x1af/0x4f5
[ 2211.701457]  #2: 00000000f4e30bc0 (&sb->s_type->i_mutex_key#13){++++}, at: xfs_ilock+0x18b/0x208
[ 2211.702909] 1 lock held by t_mmap_collisio/8209:
[ 2211.703656]  #0: 0000000078d2625e (&inode->i_rwsem){++++}, at: xfs_ilock+0x18b/0x208
[ 2211.705398] 4 locks held by t_mmap_collisio/8210:
[ 2211.706454]  #0: 000000006b6c5f2a (sb_writers#11){.+.+}, at: do_sys_ftruncate+0xae/0x14b
[ 2211.708008]  #1: 000000005e6e3d55 (&ovl_i_mutex_key[depth]){+.+.}, at: do_truncate+0x72/0xb4
[ 2211.709460]  #2: 0000000070704028 (sb_writers#9){.+.+}, at: mnt_want_write+0x1d/0x42
[ 2211.710791]  #3: 00000000f4e30bc0 (&sb->s_type->i_mutex_key#13){++++}, at: ovl_setattr+0xfe/0x1f9
[ 2211.712530] 
[ 2211.712851] =============================================
[ 2211.712851] 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-01-31 21:54                                             ` Amir Goldstein
@ 2019-02-01  9:14                                               ` Miklos Szeredi
  2019-02-01 13:22                                                 ` Amir Goldstein
  2019-02-01 13:25                                                 ` Amir Goldstein
  0 siblings, 2 replies; 42+ messages in thread
From: Miklos Szeredi @ 2019-02-01  9:14 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

[-- Attachment #1: Type: text/plain, Size: 1575 bytes --]

On Thu, Jan 31, 2019 at 10:54 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Jan 31, 2019 at 6:24 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > There's some progress: it makes it through xfstests without crashing
> > or deadlocking.  There are still some regressing test cases, but
> > things definitely look better.
> >
>
> They do!
>
> kmemleak detected leaking ovl_readpage_work in ovl_readpage_work_fn()
> and creator_cred in ovl_direct_IO().
> I pushed your patch with the leak fixes to  ovl-aops-wip as a testing point.

Great, thanks for testing.

> > What remains is ctime/mtime support.
> >
> > Also fallocate/copyfile/reflink/etc will need some thought as we need
> > to not only flush out dirty pages, but in some cases prevent new
> > faults while the operation is in progress.
>
> Well, generic/503 which exercises these corners hangs
> (503.dmesg attached)

That was just a stupid bug in truncate (truncating ovl page with upper
inode lock held -> ABBA deadlock).

Attached the fix for that one.

> >
> > And after that it needs to be optimized (->readpages() and
> > ->writepages()), but I think that's the easy part.
> >
>
> And we also need to flush pages on non direct IO cases
> of ->writepages() and then we can revert:
> e8d4bfe3a715 ovl: Sync upper dirty data when syncing overlayfs
> and swoop the grand prize :)

Yeah.

I think the behavior should be tunable (ovl page cache vs. upper page
cache), at least initially, so it can be benchmarked in various
situations to see if there are no slowdowns with the new code.

Thanks,
Miklos

[-- Attachment #2: d.patch --]
[-- Type: text/x-patch, Size: 1195 bytes --]

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index aecd288f7fe2..f954f1425186 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -853,7 +853,7 @@ static int ovl_real_readpage(struct file *realfile, struct page *page)
 
 	iov_iter_bvec(&iter, READ, &bvec, 1, PAGE_SIZE);
 
-	ret = vfs_iter_read(realfile, &iter, &pos, 0);
+	ret = vfs_iter_read(realfile, &iter, &pos, RWF_DIRECT);
 	if (ret >= 0 && ret < PAGE_SIZE)
 		zero_user_segment(page, ret, PAGE_SIZE);
 
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index e139f92b64ba..dabcc92dd178 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -76,15 +76,15 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 	old_cred = ovl_override_creds(dentry->d_sb);
 	err = notify_change(upperdentry, attr, NULL);
 	revert_creds(old_cred);
-	if (!err) {
-		if (is_truncate)
-			truncate_setsize(inode, i_size_read(upperinode));
+	if (!err)
 		ovl_copyattr(upperinode, inode);
-	}
 	inode_unlock(upperinode);
 
-	if (is_truncate)
+	if (is_truncate) {
+		if (!err)
+			truncate_setsize(inode, i_size_read(upperinode));
 		put_write_access(upperinode);
+	}
 out_drop_write:
 	ovl_drop_write(dentry);
 out:

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-02-01  9:14                                               ` Miklos Szeredi
@ 2019-02-01 13:22                                                 ` Amir Goldstein
  2019-02-01 16:29                                                   ` Vivek Goyal
  2019-02-01 13:25                                                 ` Amir Goldstein
  1 sibling, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-02-01 13:22 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

> > > What remains is ctime/mtime support.
> > >
> > > Also fallocate/copyfile/reflink/etc will need some thought as we need
> > > to not only flush out dirty pages, but in some cases prevent new
> > > faults while the operation is in progress.
> >
> > Well, generic/503 which exercises these corners hangs
> > (503.dmesg attached)
>
> That was just a stupid bug in truncate (truncating ovl page with upper
> inode lock held -> ABBA deadlock).
>
> Attached the fix for that one.
>

Force pushed and re-tested.
Summary of regressions of "quick" xfstests since v5.0-rc1:

Wrong st_mtime/st_ctime:
generic/003 generic/080 generic/215

EIO on clone/dedupe:
generic/303 generic/304

Wrong st_blocks/fiemap:
generic/353 generic/392 generic/422

New lazy copy up semantics:
overlay/060

> > >
> > > And after that it needs to be optimized (->readpages() and
> > > ->writepages()), but I think that's the easy part.
> > >
> >
> > And we also need to flush pages on non direct IO cases
> > of ->writepages() and then we can revert:
> > e8d4bfe3a715 ovl: Sync upper dirty data when syncing overlayfs
> > and swoop the grand prize :)
>
> Yeah.
>
> I think the behavior should be tunable (ovl page cache vs. upper page
> cache), at least initially, so it can be benchmarked in various
> situations to see if there are no slowdowns with the new code.
>

Now the real fun begins... choosing the config option name ;-)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-02-01  9:14                                               ` Miklos Szeredi
  2019-02-01 13:22                                                 ` Amir Goldstein
@ 2019-02-01 13:25                                                 ` Amir Goldstein
  2019-02-01 16:24                                                   ` Miklos Szeredi
  1 sibling, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-02-01 13:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

> That was just a stupid bug in truncate (truncating ovl page with upper
> inode lock held -> ABBA deadlock).
>
> Attached the fix for that one.
>

BTW, your patch includes another change, but:
RWF_DIRECT flag for vfs_read_iter() should depend on upper
direct_IO() support.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-02-01 13:25                                                 ` Amir Goldstein
@ 2019-02-01 16:24                                                   ` Miklos Szeredi
  2019-02-01 16:47                                                     ` Vivek Goyal
  2019-02-02 16:51                                                     ` Amir Goldstein
  0 siblings, 2 replies; 42+ messages in thread
From: Miklos Szeredi @ 2019-02-01 16:24 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

[-- Attachment #1: Type: text/plain, Size: 726 bytes --]

On Fri, Feb 1, 2019 at 2:25 PM Amir Goldstein <amir73il@gmail.com> wrote:

> Wrong st_mtime/st_ctime:
> generic/003 generic/080 generic/215

Fixed.  Albeit, I feel it's pretty hackish with various forms of
ovl_copyattr sprinkled all over the place.  Need to think about proper
management of attributes between upper and overlay inode.

>
> EIO on clone/dedupe:
> generic/303 generic/304
>
> Wrong st_blocks/fiemap:
> generic/353 generic/392 generic/422

Hmm, generic/353 passes for me...

>
> New lazy copy up semantics:
> overlay/060

Need to fix the testcase?

> BTW, your patch includes another change, but:
> RWF_DIRECT flag for vfs_read_iter() should depend on upper
> direct_IO() support.

Right, fixed.

Thanks,
Miklos

[-- Attachment #2: d.patch --]
[-- Type: text/x-patch, Size: 11297 bytes --]

diff --git a/fs/attr.c b/fs/attr.c
index d22e8187477f..2832024b5665 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -262,7 +262,8 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 
 	now = current_time(inode);
 
-	attr->ia_ctime = now;
+	if (!(ia_valid & ATTR_CTIME_SET))
+		attr->ia_ctime = now;
 	if (!(ia_valid & ATTR_ATIME_SET))
 		attr->ia_atime = now;
 	if (!(ia_valid & ATTR_MTIME_SET))
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 82c129bfe58d..e7a8ad995ae4 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -699,8 +699,11 @@ static int ovl_link(struct dentry *old, struct inode *newdir,
 			ovl_type_origin(old));
 	if (err)
 		iput(inode);
+	else
+		ovl_copy_ctime(ovl_inode_upper(inode), inode);
 
 	ovl_nlink_end(old);
+
 out_drop_write:
 	ovl_drop_write(old);
 out:
@@ -871,7 +874,7 @@ static int ovl_do_remove(struct dentry *dentry, bool is_dir)
 	 */
 	upperdentry = ovl_dentry_upper(dentry);
 	if (upperdentry)
-		ovl_copyattr(d_inode(upperdentry), d_inode(dentry));
+		ovl_copy_ctime(d_inode(upperdentry), d_inode(dentry));
 
 out_drop_write:
 	ovl_drop_write(dentry);
@@ -1213,9 +1216,9 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
 			 (d_inode(new) && ovl_type_origin(new)));
 
 	/* copy ctime: */
-	ovl_copyattr(d_inode(olddentry), d_inode(old));
+	ovl_copy_ctime(d_inode(olddentry), d_inode(old));
 	if (d_inode(new) && ovl_dentry_upper(new))
-		ovl_copyattr(d_inode(newdentry), d_inode(new));
+		ovl_copy_ctime(d_inode(newdentry), d_inode(new));
 
 out_dput:
 	dput(newdentry);
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index f954f1425186..9427fbb93cef 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -414,7 +414,7 @@ static ssize_t ovl_real_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	file_end_write(real.file);
 	revert_creds(old_cred);
 
-	/* Update size */
+	/* Update size/c/mtime */
 	ovl_copyattr_size(ovl_inode_real(inode), inode);
 
 	fdput(real);
@@ -597,7 +597,7 @@ static long ovl_fallocate(struct file *file, int mode, loff_t offset,
 	ret = vfs_fallocate(real.file, mode, offset, len);
 	revert_creds(old_cred);
 
-	/* Update size */
+	/* Update size/c/mtime */
 	ovl_copyattr_size(ovl_inode_real(inode), inode);
 
 	/* and invalidate mappings and page cache */
@@ -681,14 +681,21 @@ static long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 			return ret;
 
 		ret = ovl_maybe_copy_up(file_dentry(file), OVL_COPY_UP_DATA);
-		if (!ret) {
-			ret = ovl_real_ioctl(file, cmd, arg);
+		if (ret)
+			goto drop_write;
 
-			inode_lock(inode);
-			ovl_copyflags(ovl_inode_real(inode), inode);
-			inode_unlock(inode);
-		}
+		ret = ovl_flush_filemap(file, 0, LLONG_MAX);
+		if (ret)
+			goto drop_write;
 
+		ret = ovl_real_ioctl(file, cmd, arg);
+
+		inode_lock(inode);
+		ovl_copyflags(ovl_inode_real(inode), inode);
+		if (!ret)
+			ovl_copy_ctime(ovl_inode_real(inode), inode);
+		inode_unlock(inode);
+drop_write:
 		mnt_drop_write_file(file);
 		break;
 
@@ -771,7 +778,7 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in,
 	}
 	revert_creds(old_cred);
 
-	/* Update size */
+	/* Update size/c/mtime */
 	ovl_copyattr_size(ovl_inode_real(inode_out), inode_out);
 
 	if (op != OVL_DEDUPE)
@@ -850,10 +857,16 @@ static int ovl_real_readpage(struct file *realfile, struct page *page)
 	loff_t pos = page->index << PAGE_SHIFT;
 	struct iov_iter iter;
 	ssize_t ret;
+	rwf_t flags = 0;
+
+	if (realfile->f_mapping->a_ops &&
+	    realfile->f_mapping->a_ops->direct_IO) {
+		flags |= RWF_DIRECT;
+	}
 
 	iov_iter_bvec(&iter, READ, &bvec, 1, PAGE_SIZE);
 
-	ret = vfs_iter_read(realfile, &iter, &pos, RWF_DIRECT);
+	ret = vfs_iter_read(realfile, &iter, &pos, flags);
 	if (ret >= 0 && ret < PAGE_SIZE)
 		zero_user_segment(page, ret, PAGE_SIZE);
 
@@ -960,10 +973,11 @@ static int ovl_write_end(struct file *file, struct address_space *mapping,
 
 static int ovl_real_writepage(struct page *page, struct writeback_control *wbc)
 {
-	struct ovl_inode *oi = OVL_I(page->mapping->host);
+	struct inode *inode = page->mapping->host;
+	struct ovl_inode *oi = OVL_I(inode);
 	struct file *realfile = oi->upper_file;
 	loff_t pos = page->index << PAGE_SHIFT;
-	loff_t size = i_size_read(page->mapping->host);
+	loff_t size = i_size_read(inode);
 	size_t len = PAGE_SIZE;
 	struct bio_vec bvec = {
 		.bv_page = page,
@@ -978,7 +992,7 @@ static int ovl_real_writepage(struct page *page, struct writeback_control *wbc)
 		pr_info("ovl_writepage: size = %lli pos = %lli\n", size, pos);
 		return 0;
 	}
-	
+
 	if (size < pos + PAGE_SIZE) {
 		/* Can't do direct I/O for tail pages */
 		len = size - pos;
@@ -987,7 +1001,7 @@ static int ovl_real_writepage(struct page *page, struct writeback_control *wbc)
 		flags |= RWF_DIRECT;
 	}
 
-	//pr_info("ovl_real_writepage(%lli)\n", pos);
+	pr_info("ovl_real_writepage(%p/%lli)...\n", inode, pos);
 	iov_iter_bvec(&iter, WRITE, &bvec, 1, len);
 
 	ret = vfs_iter_write(realfile, &iter, &pos, flags);
@@ -997,7 +1011,7 @@ static int ovl_real_writepage(struct page *page, struct writeback_control *wbc)
 
 	/* FADV_DONTNEED for tail pages*/
 
-	//pr_info("ovl_real_writepage(%lli) = %li\n", pos, ret);
+	pr_info("ovl_real_writepage(%p/%lli) = %li\n", inode, pos, ret);
 
 	return ret < 0 ? ret : 0;
 }
@@ -1021,6 +1035,9 @@ static int ovl_writepage(struct page *page, struct writeback_control *wbc)
 	 */
 	end_page_writeback(page);
 
+	/* Redirty ovl inode due to just having modified upper c/mtime */
+	__mark_inode_dirty(page->mapping->host, I_DIRTY_TIME | I_DIRTY_SYNC);
+
 	return ret;
 }
 
@@ -1045,7 +1062,7 @@ static ssize_t ovl_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		ret = vfs_iter_write(real.file, iter, &iocb->ki_pos, flags);
 		file_end_write(real.file);
 
-		/* Update size */
+		/* Update size/c/mtime */
 		ovl_copyattr_size(ovl_inode_real(inode), inode);
 	}
 	revert_creds(old_cred);
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index dabcc92dd178..0d959c26c414 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -77,7 +77,7 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 	err = notify_change(upperdentry, attr, NULL);
 	revert_creds(old_cred);
 	if (!err)
-		ovl_copyattr(upperinode, inode);
+		ovl_copyattr_time(upperinode, inode);
 	inode_unlock(upperinode);
 
 	if (is_truncate) {
@@ -181,7 +181,11 @@ int ovl_getattr(const struct path *path, struct kstat *stat,
 	 * real inode i_size. Perhaps we should generic_fillattr() and only
 	 * update individual stats from real inode?
 	 */
-	stat->size = i_size_read(inode);
+	if (d_inode(realpath.dentry) == ovl_inode_upper(inode)) {
+		stat->size = i_size_read(inode);
+		stat->ctime = inode->i_ctime;
+		stat->mtime = inode->i_mtime;
+	}
 
 	/*
 	 * For non-dir or same fs, we use st_ino of the copy up origin.
@@ -378,7 +382,7 @@ int ovl_xattr_set(struct dentry *dentry, struct inode *inode, const char *name,
 	revert_creds(old_cred);
 
 	/* copy c/mtime */
-	ovl_copyattr(d_inode(realdentry), inode);
+	ovl_copyattr_time(d_inode(realdentry), inode);
 
 out_drop_write:
 	ovl_drop_write(dentry);
@@ -473,7 +477,10 @@ int ovl_update_time(struct inode *inode, struct timespec64 *ts, int flags)
 			touch_atime(&upperpath);
 			inode->i_atime = d_inode(upperpath.dentry)->i_atime;
 		}
+		return 0;
 	}
+	pr_info("ovl_update_time(%p/%lli.%09lu/%i)\n",
+		inode, ts->tv_sec, ts->tv_nsec, flags);
 	return generic_update_time(inode, ts, flags);
 }
 
@@ -585,7 +592,8 @@ static void ovl_fill_inode(struct inode *inode, umode_t mode, dev_t rdev,
 		inode->i_ino = get_next_ino();
 	}
 	inode->i_mode = mode;
-	inode->i_flags |= S_NOCMTIME;
+	if (!S_ISREG(mode))
+		inode->i_flags |= S_NOCMTIME;
 #ifdef CONFIG_FS_POSIX_ACL
 	inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
 #endif
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 576488162257..e52385700aa3 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -377,16 +377,26 @@ static inline void ovl_copyattr(struct inode *from, struct inode *to)
 	to->i_gid = from->i_gid;
 	to->i_mode = from->i_mode;
 	to->i_atime = from->i_atime;
+}
+
+static inline void ovl_copyattr_time(struct inode *from, struct inode *to)
+{
+	ovl_copyattr(from, to);
 	to->i_mtime = from->i_mtime;
 	to->i_ctime = from->i_ctime;
 }
 
 static inline void ovl_copyattr_size(struct inode *from, struct inode *to)
 {
-	ovl_copyattr(from, to);
+	ovl_copyattr_time(from, to);
 	i_size_write(to, i_size_read(from));
 }
 
+static inline void ovl_copy_ctime(struct inode *from, struct inode *to)
+{
+	to->i_ctime = from->i_ctime;
+}
+
 static inline void ovl_copyflags(struct inode *from, struct inode *to)
 {
 	unsigned int mask = S_SYNC | S_IMMUTABLE | S_APPEND | S_NOATIME;
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index cfbf1d8994c0..fe01fcc667c2 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -220,6 +220,32 @@ static void ovl_destroy_inode(struct inode *inode)
 	call_rcu(&inode->i_rcu, ovl_i_callback);
 }
 
+static int ovl_write_inode(struct inode *inode, struct writeback_control *wbc)
+{
+	int err;
+	const struct cred *old_cred;
+	struct dentry *upperdentry = ovl_i_dentry_upper(inode);
+	struct inode *upperinode = d_inode(upperdentry);
+	struct iattr attr = {
+		.ia_valid = ATTR_CTIME | ATTR_CTIME_SET |
+			    ATTR_MTIME | ATTR_MTIME_SET,
+		.ia_ctime = inode->i_ctime,
+		.ia_mtime = inode->i_mtime,
+	};
+
+	pr_info("ovl_write_inode(%p)...\n", inode);
+
+	inode_lock(upperinode);
+	old_cred = ovl_override_creds(inode->i_sb);
+	err = notify_change(upperdentry, &attr, NULL);
+	revert_creds(old_cred);
+	inode_unlock(upperinode);
+
+	pr_info("ovl_write_inode(%p): %i\n", inode, err);
+
+	return err;
+}
+
 static void ovl_free_fs(struct ovl_fs *ofs)
 {
 	unsigned i;
@@ -385,6 +411,7 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
 static const struct super_operations ovl_super_operations = {
 	.alloc_inode	= ovl_alloc_inode,
 	.destroy_inode	= ovl_destroy_inode,
+	.write_inode	= ovl_write_inode,
 	.put_super	= ovl_put_super,
 	.sync_fs	= ovl_sync_fs,
 	.statfs		= ovl_statfs,
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index b1989c479947..2aa9efdc869e 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -456,7 +456,7 @@ static void ovl_dentry_version_inc(struct dentry *dentry, bool impurity)
 void ovl_dir_modified(struct dentry *dentry, bool impurity)
 {
 	/* Copy mtime/ctime */
-	ovl_copyattr(d_inode(ovl_dentry_upper(dentry)), d_inode(dentry));
+	ovl_copyattr_time(d_inode(ovl_dentry_upper(dentry)), d_inode(dentry));
 
 	ovl_dentry_version_inc(dentry, impurity);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2901ffc43c11..902af18f52ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -183,7 +183,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define ATTR_CTIME	(1 << 6)
 #define ATTR_ATIME_SET	(1 << 7)
 #define ATTR_MTIME_SET	(1 << 8)
-#define ATTR_FORCE	(1 << 9) /* Not a change, but a change it */
+#define ATTR_CTIME_SET  (1 << 9) /* kernel internal, for now... */
+#define ATTR_FORCE	(1 << 10) /* Not a change, but a change it */
 #define ATTR_KILL_SUID	(1 << 11)
 #define ATTR_KILL_SGID	(1 << 12)
 #define ATTR_FILE	(1 << 13)

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-02-01 13:22                                                 ` Amir Goldstein
@ 2019-02-01 16:29                                                   ` Vivek Goyal
  0 siblings, 0 replies; 42+ messages in thread
From: Vivek Goyal @ 2019-02-01 16:29 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, cgxu519, overlayfs

On Fri, Feb 01, 2019 at 03:22:34PM +0200, Amir Goldstein wrote:

[..]
> New lazy copy up semantics:
> overlay/060

Right now we open file with O_APPEND to make sure data is copied up. We
could probably change that to write some data to file and then make sure
file is copied up? Anyway, we are not promising that file will definitely
get copied up after opening with O_RDWR or O_APPEND.

And could add some tests for lazy copy up to make sure opening file
with O_RDWR alone does not do data copy up. (If lazy copy up is
enabled).

Vivek

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-02-01 16:24                                                   ` Miklos Szeredi
@ 2019-02-01 16:47                                                     ` Vivek Goyal
  2019-02-02 16:51                                                     ` Amir Goldstein
  1 sibling, 0 replies; 42+ messages in thread
From: Vivek Goyal @ 2019-02-01 16:47 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Amir Goldstein, cgxu519, overlayfs

On Fri, Feb 01, 2019 at 05:24:26PM +0100, Miklos Szeredi wrote:
> On Fri, Feb 1, 2019 at 2:25 PM Amir Goldstein <amir73il@gmail.com> wrote:
> 
> > Wrong st_mtime/st_ctime:
> > generic/003 generic/080 generic/215
> 
> Fixed.  Albeit, I feel it's pretty hackish with various forms of
> ovl_copyattr sprinkled all over the place.  Need to think about proper
> management of attributes between upper and overlay inode.
> 
> >
> > EIO on clone/dedupe:
> > generic/303 generic/304
> >
> > Wrong st_blocks/fiemap:
> > generic/353 generic/392 generic/422
> 
> Hmm, generic/353 passes for me...
> 
> >
> > New lazy copy up semantics:
> > overlay/060
> 
> Need to fix the testcase?

This seems to fix xfstest overlay/060 for me.

Vivek

Subject: Fix metacopy test cases for data copy up on first read/write

Overlayfs might copy up data of file on first read/write of file (and
not necessarily upon open of file). So read few bytes from file opened
with O_RDWR and after that data must have been copied up.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 tests/overlay/060 |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: xfstests-dev/tests/overlay/060
===================================================================
--- xfstests-dev.orig/tests/overlay/060	2018-08-27 11:21:51.811847671 -0400
+++ xfstests-dev/tests/overlay/060	2019-02-01 11:43:43.354762812 -0500
@@ -170,7 +170,7 @@ test_common()
 
 	# Trigger data copy up and check absence of metacopy xattr.
 	mount_overlay $_lowerdir
-	$XFS_IO_PROG -c "open -a $SCRATCH_MNT/$_target"
+	$XFS_IO_PROG -c "pread 0 1" $SCRATCH_MNT/$_target >> $seqres.full
 	echo "check properties of data copied up file"
 	check_file_size_contents $SCRATCH_MNT/$_target $_size "$_data"
 	umount_overlay

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-02-01 16:24                                                   ` Miklos Szeredi
  2019-02-01 16:47                                                     ` Vivek Goyal
@ 2019-02-02 16:51                                                     ` Amir Goldstein
  2019-02-05  7:42                                                       ` Amir Goldstein
  1 sibling, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-02-02 16:51 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

On Fri, Feb 1, 2019 at 6:24 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Fri, Feb 1, 2019 at 2:25 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > Wrong st_mtime/st_ctime:
> > generic/003 generic/080 generic/215
>
> Fixed.  Albeit, I feel it's pretty hackish with various forms of
> ovl_copyattr sprinkled all over the place.  Need to think about proper
> management of attributes between upper and overlay inode.
>

Pushed this one to the test branch as well.

> >
> > EIO on clone/dedupe:
> > generic/303 generic/304
> >
> > Wrong st_blocks/fiemap:
> > generic/353 generic/392 generic/422
>
> Hmm, generic/353 passes for me...
>

So for me:
running generic/353 after generic/303 it fails
and running it again it passes.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-02-02 16:51                                                     ` Amir Goldstein
@ 2019-02-05  7:42                                                       ` Amir Goldstein
  2019-02-05  7:50                                                         ` Miklos Szeredi
  2019-02-05 13:03                                                         ` Amir Goldstein
  0 siblings, 2 replies; 42+ messages in thread
From: Amir Goldstein @ 2019-02-05  7:42 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

> > >
> > > EIO on clone/dedupe:
> > > generic/303 generic/304
> > >
> > > Wrong st_blocks/fiemap:
> > > generic/353 generic/392 generic/422
> >
> > Hmm, generic/353 passes for me...
> >
>
> So for me:
> running generic/353 after generic/303 it fails
> and running it again it passes.
>

The failed line in 303,304 is:

xfs_io  -c "pread -v -q 9223372036854775806 1" /mnt/test/foo
pread: Input/output error

And failure seems not related to clone/dedupe at all.
Easily reproduced on any file (i.e. touch /mnt/test/foo),
so its probably just an arithmetic len/offset bug.

I may have more time to look into this later today.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-02-05  7:42                                                       ` Amir Goldstein
@ 2019-02-05  7:50                                                         ` Miklos Szeredi
  2019-02-05 13:03                                                         ` Amir Goldstein
  1 sibling, 0 replies; 42+ messages in thread
From: Miklos Szeredi @ 2019-02-05  7:50 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

On Tue, Feb 5, 2019 at 8:42 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > >
> > > > EIO on clone/dedupe:
> > > > generic/303 generic/304
> > > >
> > > > Wrong st_blocks/fiemap:
> > > > generic/353 generic/392 generic/422
> > >
> > > Hmm, generic/353 passes for me...
> > >
> >
> > So for me:
> > running generic/353 after generic/303 it fails
> > and running it again it passes.
> >
>
> The failed line in 303,304 is:
>
> xfs_io  -c "pread -v -q 9223372036854775806 1" /mnt/test/foo
> pread: Input/output error
>
> And failure seems not related to clone/dedupe at all.
> Easily reproduced on any file (i.e. touch /mnt/test/foo),
> so its probably just an arithmetic len/offset bug.
>
> I may have more time to look into this later today.

Okay.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-02-05  7:42                                                       ` Amir Goldstein
  2019-02-05  7:50                                                         ` Miklos Szeredi
@ 2019-02-05 13:03                                                         ` Amir Goldstein
  2019-02-06 15:36                                                           ` Miklos Szeredi
  1 sibling, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-02-05 13:03 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

On Tue, Feb 5, 2019 at 9:42 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > >
> > > > EIO on clone/dedupe:
> > > > generic/303 generic/304
> > > >
> > > > Wrong st_blocks/fiemap:
> > > > generic/353 generic/392 generic/422
> > >
> > > Hmm, generic/353 passes for me...
> > >
> >
> > So for me:
> > running generic/353 after generic/303 it fails
> > and running it again it passes.
> >
>
> The failed line in 303,304 is:
>
> xfs_io  -c "pread -v -q 9223372036854775806 1" /mnt/test/foo
> pread: Input/output error
>
> And failure seems not related to clone/dedupe at all.
> Easily reproduced on any file (i.e. touch /mnt/test/foo),
> so its probably just an arithmetic len/offset bug.
>

Yes, pos+PAGE_SIZE < 0 (overrun).

Pushed a fix to truncate upper read to i_size.
It's more general although maybe non needed
for the common case, where pos+PAGE_SIZE > 0.

Interesting side effect. Now generic/353 fails always (or more often).
It appears as though a 64K reflinked file is broken to 2 extents
while the test expects it to have a single extent.

I am not sure why the test makes this expectation, maybe because
it was very probable for buffered write and made it easier to write the test.
The fact is that if we issue buffered write to upper file, the test passes
as expected, but when using direct IO to upper file the test fails.

I guess that direct IO results in different block reservation than
buffered IO in xfs. So it appears that converting buffered IO to direct IO
on upper file may have some implications.

Its quite likely that this specific issue will go away with ->writepages(),
so let's wait and see if that is an issue.
The question is whether we want to mess with hidden filesystem
implementation details or we would want to provide a knob for the
user to choose between direct and buffered IO to upper.

Double page caches can be mitigated by invalidating upper page
cache after writeback. In any case, we need to invalidate upper
page cache after copy up and in other occasions (like when upper
fs has no direct IO support), so the code for using buffered IO to
upper and cleaning upper page cache needs to be implemented anyway.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-02-05 13:03                                                         ` Amir Goldstein
@ 2019-02-06 15:36                                                           ` Miklos Szeredi
  2019-02-12  7:43                                                             ` Amir Goldstein
  0 siblings, 1 reply; 42+ messages in thread
From: Miklos Szeredi @ 2019-02-06 15:36 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

On Tue, Feb 5, 2019 at 2:03 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Feb 5, 2019 at 9:42 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > > > >
> > > > > EIO on clone/dedupe:
> > > > > generic/303 generic/304
> > > > >
> > > > > Wrong st_blocks/fiemap:
> > > > > generic/353 generic/392 generic/422
> > > >
> > > > Hmm, generic/353 passes for me...
> > > >
> > >
> > > So for me:
> > > running generic/353 after generic/303 it fails
> > > and running it again it passes.
> > >
> >
> > The failed line in 303,304 is:
> >
> > xfs_io  -c "pread -v -q 9223372036854775806 1" /mnt/test/foo
> > pread: Input/output error
> >
> > And failure seems not related to clone/dedupe at all.
> > Easily reproduced on any file (i.e. touch /mnt/test/foo),
> > so its probably just an arithmetic len/offset bug.
> >
>
> Yes, pos+PAGE_SIZE < 0 (overrun).
>
> Pushed a fix to truncate upper read to i_size.
> It's more general although maybe non needed
> for the common case, where pos+PAGE_SIZE > 0.

Yes, we can allow tail page DIO reads, with the exception of the page
at LLONG_MAX.

> Interesting side effect. Now generic/353 fails always (or more often).
> It appears as though a 64K reflinked file is broken to 2 extents
> while the test expects it to have a single extent.
>
> I am not sure why the test makes this expectation, maybe because
> it was very probable for buffered write and made it easier to write the test.
> The fact is that if we issue buffered write to upper file, the test passes
> as expected, but when using direct IO to upper file the test fails.
>
> I guess that direct IO results in different block reservation than
> buffered IO in xfs. So it appears that converting buffered IO to direct IO
> on upper file may have some implications.
>
> Its quite likely that this specific issue will go away with ->writepages(),
> so let's wait and see if that is an issue.
> The question is whether we want to mess with hidden filesystem
> implementation details or we would want to provide a knob for the
> user to choose between direct and buffered IO to upper.

I don't think anyone outside that testcase would care.  I'd just add a
knob (and possibly fix the testcase...)

> Double page caches can be mitigated by invalidating upper page
> cache after writeback.  In any case, we need to invalidate upper
> page cache after copy up and in other occasions (like when upper
> fs has no direct IO support), so the code for using buffered IO to
> upper and cleaning upper page cache needs to be implemented anyway.

I think the only case where upper doesn't have direct IO support is
tmpfs.  And in that case we should just turn off overlay pagecache by
default, since there's no upside to having a cache at the overlay
level.  Same probably holds for DAX mounted filesystems as well.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-02-06 15:36                                                           ` Miklos Szeredi
@ 2019-02-12  7:43                                                             ` Amir Goldstein
  2019-02-12  8:11                                                               ` Miklos Szeredi
  0 siblings, 1 reply; 42+ messages in thread
From: Amir Goldstein @ 2019-02-12  7:43 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Vivek Goyal, cgxu519, overlayfs

> I think the only case where upper doesn't have direct IO support is
> tmpfs.  And in that case we should just turn off overlay pagecache by
> default, since there's no upside to having a cache at the overlay
> level.  Same probably holds for DAX mounted filesystems as well.
>

Confused. Isn't overlay page cache the mechanism by which we
plan to resolve MMAP_SHARED inconsistency.
Do you propose that tmpfs/DAX won't be able to get MMAP_SHARED
consistency, or am I missing something?

This thread has been rolling for a while, so here is a recap of remaining
issues:

- Fix st_blocks failing tests
- Synchronize page faults with truncate/fallocate?
  :Also fallocate/copyfile/reflink/etc will need some thought as we need
  :to not only flush out dirty pages, but in some cases prevent new
  :faults while the operation is in progress.
- Optimization/performance
  :And after that it needs to be optimized (->readpages() and
  :->writepages()), but I think that's the easy part.
- Truncate upper inode pages (post copyup and tail page write)
- Make page cache opt-in and require direct_IO()
- ovl_sync_fs()

Did I miss anything?

I will try to carve out some time to work on this during next cycle
and/or chew at the TODO list occasionally, unless you get to it
before I do. Let me know if you intend to work on this.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH v2 0/5] Experiments with overlayfs filemap
  2019-02-12  7:43                                                             ` Amir Goldstein
@ 2019-02-12  8:11                                                               ` Miklos Szeredi
  0 siblings, 0 replies; 42+ messages in thread
From: Miklos Szeredi @ 2019-02-12  8:11 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Vivek Goyal, cgxu519, overlayfs

On Tue, Feb 12, 2019 at 8:44 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > I think the only case where upper doesn't have direct IO support is
> > tmpfs.  And in that case we should just turn off overlay pagecache by
> > default, since there's no upside to having a cache at the overlay
> > level.  Same probably holds for DAX mounted filesystems as well.
> >
>
> Confused. Isn't overlay page cache the mechanism by which we
> plan to resolve MMAP_SHARED inconsistency.
> Do you propose that tmpfs/DAX won't be able to get MMAP_SHARED
> consistency, or am I missing something?

What I propose is making the default behavior be the most optimal (no
overlay pagecache for tmpfs or dax), but allow forcing overlay
pagecache on or off.

As you've said, the most important goodie to come from the overlay
pagecache is correct selection of inodes on syncfs.  MAP_SHARED
consistency is more of a tickbox thing, IMO.

>
> This thread has been rolling for a while, so here is a recap of remaining
> issues:
>
> - Fix st_blocks failing tests
> - Synchronize page faults with truncate/fallocate?
>   :Also fallocate/copyfile/reflink/etc will need some thought as we need
>   :to not only flush out dirty pages, but in some cases prevent new
>   :faults while the operation is in progress.
> - Optimization/performance
>   :And after that it needs to be optimized (->readpages() and
>   :->writepages()), but I think that's the easy part.
> - Truncate upper inode pages (post copyup and tail page write)
> - Make page cache opt-in and require direct_IO()
> - ovl_sync_fs()
>
> Did I miss anything?
>
> I will try to carve out some time to work on this during next cycle
> and/or chew at the TODO list occasionally, unless you get to it
> before I do. Let me know if you intend to work on this.

Okay.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2019-02-12  8:11 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-22 12:34 [RFC][PATCH v2 0/5] Experiments with overlayfs filemap Amir Goldstein
2019-01-22 12:34 ` [RFC][PATCH v2 1/5] ovl: reorder tests in ovl_open_need_copy_up() Amir Goldstein
2019-01-22 12:34 ` [RFC][PATCH v2 2/5] ovl: prepare for generic filemap file operations Amir Goldstein
2019-01-22 12:34 ` [RFC][PATCH v2 3/5] ovl: lazy copy up of data on first data access Amir Goldstein
2019-01-22 12:34 ` [RFC][PATCH v2 4/5] ovl: lazy copy up data on page fault Amir Goldstein
2019-01-22 12:34 ` [RFC][PATCH v2 5/5] ovl: noop aops to test filemap operations and lazy copy up Amir Goldstein
2019-01-24 17:18 ` [RFC][PATCH v2 0/5] Experiments with overlayfs filemap Amir Goldstein
2019-01-24 22:35   ` Amir Goldstein
2019-01-25  9:54     ` Miklos Szeredi
2019-01-25 11:24       ` Amir Goldstein
2019-01-25 12:21         ` Miklos Szeredi
2019-01-25 13:04           ` Amir Goldstein
2019-01-25 13:31             ` Miklos Szeredi
2019-01-25 15:56               ` Miklos Szeredi
2019-01-25 21:18                 ` Amir Goldstein
2019-01-27 18:22                   ` Amir Goldstein
2019-01-28 19:22                     ` Vivek Goyal
2019-01-28 20:57                       ` Amir Goldstein
2019-01-28 21:17                         ` Miklos Szeredi
2019-01-28 21:22                           ` Miklos Szeredi
2019-01-28 22:14                             ` Amir Goldstein
2019-01-29  7:17                               ` Miklos Szeredi
2019-01-29  8:54                                 ` Amir Goldstein
2019-01-29  8:58                                   ` Miklos Szeredi
2019-01-29  9:12                                     ` Amir Goldstein
2019-01-29 12:44                                       ` Miklos Szeredi
2019-01-29 16:47                                         ` Amir Goldstein
2019-01-31 16:23                                           ` Miklos Szeredi
2019-01-31 21:54                                             ` Amir Goldstein
2019-02-01  9:14                                               ` Miklos Szeredi
2019-02-01 13:22                                                 ` Amir Goldstein
2019-02-01 16:29                                                   ` Vivek Goyal
2019-02-01 13:25                                                 ` Amir Goldstein
2019-02-01 16:24                                                   ` Miklos Szeredi
2019-02-01 16:47                                                     ` Vivek Goyal
2019-02-02 16:51                                                     ` Amir Goldstein
2019-02-05  7:42                                                       ` Amir Goldstein
2019-02-05  7:50                                                         ` Miklos Szeredi
2019-02-05 13:03                                                         ` Amir Goldstein
2019-02-06 15:36                                                           ` Miklos Szeredi
2019-02-12  7:43                                                             ` Amir Goldstein
2019-02-12  8:11                                                               ` Miklos Szeredi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.