Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH v2 0/8] implement containerized syncfs for overlayfs
@ 2020-10-25  3:41 Chengguang Xu
  2020-10-25  3:41 ` [RFC PATCH v2 1/8] ovl: setup overlayfs' private bdi Chengguang Xu
                   ` (8 more replies)
  0 siblings, 9 replies; 23+ messages in thread
From: Chengguang Xu @ 2020-10-25  3:41 UTC (permalink / raw)
  To: miklos, amir73il, jack; +Cc: linux-unionfs, linux-fsdevel, Chengguang Xu

Current syncfs(2) syscall on overlayfs just calls sync_filesystem()
on upper_sb to synchronize whole dirty inodes in upper filesystem
regardless of the overlay ownership of the inode. In the use case of
container, when multiple containers using the same underlying upper
filesystem, it has some shortcomings as below.

(1) Performance
Synchronization is probably heavy because it actually syncs unnecessary
inodes for target overlayfs.

(2) Interference
Unplanned synchronization will probably impact IO performance of
unrelated container processes on the other overlayfs.

This series try to implement containerized syncfs for overlayfs so that
only sync target dirty upper inodes which are belong to specific overlayfs
instance. By doing this, it is able to reduce cost of synchronization and
will not seriously impact IO performance of unrelated processes. 

v1->v2:
- Mark overlayfs' inode dirty itself instead of adding notification
  mechanism to vfs inode.

Chengguang Xu (8):
  ovl: setup overlayfs' private bdi
  ovl: implement ->writepages operation
  ovl: implement overlayfs' ->evict_inode operation
  ovl: mark overlayfs' inode dirty on modification
  ovl: mark overlayfs' inode dirty on shared writable mmap
  ovl: implement overlayfs' ->write_inode operation
  ovl: cache dirty overlayfs' inode
  ovl: implement containerized syncfs for overlayfs

 fs/overlayfs/file.c      |  4 +++
 fs/overlayfs/inode.c     | 27 +++++++++++++++++++
 fs/overlayfs/overlayfs.h |  4 +++
 fs/overlayfs/super.c     | 57 +++++++++++++++++++++++++++++++++++++---
 fs/overlayfs/util.c      | 14 ++++++++++
 5 files changed, 102 insertions(+), 4 deletions(-)

-- 
2.26.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 1/8] ovl: setup overlayfs' private bdi
  2020-10-25  3:41 [RFC PATCH v2 0/8] implement containerized syncfs for overlayfs Chengguang Xu
@ 2020-10-25  3:41 ` Chengguang Xu
  2020-10-25  3:41 ` [RFC PATCH v2 2/8] ovl: implement ->writepages operation Chengguang Xu
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Chengguang Xu @ 2020-10-25  3:41 UTC (permalink / raw)
  To: miklos, amir73il, jack; +Cc: linux-unionfs, linux-fsdevel, Chengguang Xu

Setup overlayfs' private bdi so that we can collect
overlayfs' own dirty inodes.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
---
 fs/overlayfs/super.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..d1e546abce87 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -1869,6 +1869,10 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	if (!ofs)
 		goto out;
 
+	err = super_setup_bdi(sb);
+	if (err)
+		goto out_err;
+
 	ofs->creator_cred = cred = prepare_creds();
 	if (!cred)
 		goto out_err;
-- 
2.26.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 2/8] ovl: implement ->writepages operation
  2020-10-25  3:41 [RFC PATCH v2 0/8] implement containerized syncfs for overlayfs Chengguang Xu
  2020-10-25  3:41 ` [RFC PATCH v2 1/8] ovl: setup overlayfs' private bdi Chengguang Xu
@ 2020-10-25  3:41 ` Chengguang Xu
  2020-11-02 17:17   ` Jan Kara
  2020-10-25  3:41 ` [RFC PATCH v2 3/8] ovl: implement overlayfs' ->evict_inode operation Chengguang Xu
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Chengguang Xu @ 2020-10-25  3:41 UTC (permalink / raw)
  To: miklos, amir73il, jack; +Cc: linux-unionfs, linux-fsdevel, Chengguang Xu

Implement overlayfs' ->writepages operation so that
we can sync dirty data/metadata to upper filesystem.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
---
 fs/overlayfs/inode.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index b584dca845ba..f27fc5be34df 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -11,6 +11,7 @@
 #include <linux/posix_acl.h>
 #include <linux/ratelimit.h>
 #include <linux/fiemap.h>
+#include <linux/writeback.h>
 #include "overlayfs.h"
 
 
@@ -516,7 +517,32 @@ static const struct inode_operations ovl_special_inode_operations = {
 	.update_time	= ovl_update_time,
 };
 
+static int ovl_writepages(struct address_space *mapping,
+			  struct writeback_control *wbc)
+{
+	struct inode *upper_inode = ovl_inode_upper(mapping->host);
+	struct ovl_fs *ofs =  mapping->host->i_sb->s_fs_info;
+	struct writeback_control tmp_wbc = *wbc;
+
+	if (!ovl_should_sync(ofs))
+		return 0;
+
+	/*
+	 * for sync(2) writeback, it has a separate external IO
+	 * completion path by checking PAGECACHE_TAG_WRITEBACK
+	 * in pagecache, we have to set for_sync to 0 in thie case,
+	 * let writeback waits completion after syncing individual
+	 * dirty inode, because we haven't implemented overlayfs'
+	 * own pagecache yet.
+	 */
+	if (wbc->for_sync && (wbc->sync_mode == WB_SYNC_ALL))
+		tmp_wbc.for_sync = 0;
+
+	return sync_inode(upper_inode, &tmp_wbc);
+}
+
 static const struct address_space_operations ovl_aops = {
+	.writepages		= ovl_writepages,
 	/* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
 	.direct_IO		= noop_direct_IO,
 };
-- 
2.26.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 3/8] ovl: implement overlayfs' ->evict_inode operation
  2020-10-25  3:41 [RFC PATCH v2 0/8] implement containerized syncfs for overlayfs Chengguang Xu
  2020-10-25  3:41 ` [RFC PATCH v2 1/8] ovl: setup overlayfs' private bdi Chengguang Xu
  2020-10-25  3:41 ` [RFC PATCH v2 2/8] ovl: implement ->writepages operation Chengguang Xu
@ 2020-10-25  3:41 ` Chengguang Xu
  2020-10-25  3:41 ` [RFC PATCH v2 4/8] ovl: mark overlayfs' inode dirty on modification Chengguang Xu
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Chengguang Xu @ 2020-10-25  3:41 UTC (permalink / raw)
  To: miklos, amir73il, jack; +Cc: linux-unionfs, linux-fsdevel, Chengguang Xu

Implement overlayfs' ->evict_inode operation,
so that we can clear all inode dirty flags.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
---
 fs/overlayfs/super.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index d1e546abce87..5d7e12b8bd1f 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -390,11 +390,18 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
 	return ret;
 }
 
+static void ovl_evict_inode(struct inode *inode)
+{
+	inode->i_state &= ~I_DIRTY_ALL;
+	clear_inode(inode);
+}
+
 static const struct super_operations ovl_super_operations = {
 	.alloc_inode	= ovl_alloc_inode,
 	.free_inode	= ovl_free_inode,
 	.destroy_inode	= ovl_destroy_inode,
 	.drop_inode	= generic_delete_inode,
+	.evict_inode	= ovl_evict_inode,
 	.put_super	= ovl_put_super,
 	.sync_fs	= ovl_sync_fs,
 	.statfs		= ovl_statfs,
-- 
2.26.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 4/8] ovl: mark overlayfs' inode dirty on modification
  2020-10-25  3:41 [RFC PATCH v2 0/8] implement containerized syncfs for overlayfs Chengguang Xu
                   ` (2 preceding siblings ...)
  2020-10-25  3:41 ` [RFC PATCH v2 3/8] ovl: implement overlayfs' ->evict_inode operation Chengguang Xu
@ 2020-10-25  3:41 ` Chengguang Xu
  2020-10-25  3:41 ` [RFC PATCH v2 5/8] ovl: mark overlayfs' inode dirty on shared writable mmap Chengguang Xu
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Chengguang Xu @ 2020-10-25  3:41 UTC (permalink / raw)
  To: miklos, amir73il, jack; +Cc: linux-unionfs, linux-fsdevel, Chengguang Xu

Mark overlayfs' inode dirty on modification so that
we can recognize target inodes during syncfs.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
---
 fs/overlayfs/inode.c     |  1 +
 fs/overlayfs/overlayfs.h |  4 ++++
 fs/overlayfs/util.c      | 14 ++++++++++++++
 3 files changed, 19 insertions(+)

diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index f27fc5be34df..426f70ca5d5e 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -468,6 +468,7 @@ int ovl_update_time(struct inode *inode, struct timespec64 *ts, int flags)
 		if (upperpath.dentry) {
 			touch_atime(&upperpath);
 			inode->i_atime = d_inode(upperpath.dentry)->i_atime;
+			ovl_mark_inode_dirty(inode);
 		}
 	}
 	return 0;
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index f8880aa2ba0e..eaf1d5b05d8e 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -247,6 +247,7 @@ static inline bool ovl_open_flags_need_copy_up(int flags)
 }
 
 /* util.c */
+void ovl_mark_inode_dirty(struct inode *inode);
 int ovl_want_write(struct dentry *dentry);
 void ovl_drop_write(struct dentry *dentry);
 struct dentry *ovl_workdir(struct dentry *dentry);
@@ -472,6 +473,9 @@ static inline void ovl_copyattr(struct inode *from, struct inode *to)
 	to->i_mtime = from->i_mtime;
 	to->i_ctime = from->i_ctime;
 	i_size_write(to, i_size_read(from));
+
+	if (ovl_inode_upper(to) && from->i_state & I_DIRTY_ALL)
+		ovl_mark_inode_dirty(to);
 }
 
 static inline void ovl_copyflags(struct inode *from, struct inode *to)
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 23f475627d07..a6f59df744ae 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -950,3 +950,17 @@ char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
 	kfree(buf);
 	return ERR_PTR(res);
 }
+
+/*
+ * We intentionally add I_DIRTY_SYNC flag regardless dirty flag
+ * of upper inode so that we have chance to invoke ->write_inode
+ * to re-dirty overlayfs' inode during writeback process.
+ */
+void ovl_mark_inode_dirty(struct inode *inode)
+{
+	struct inode *upper = ovl_inode_upper(inode);
+	unsigned long iflag = I_DIRTY_SYNC;
+
+	iflag |= upper->i_state & I_DIRTY_ALL;
+	__mark_inode_dirty(inode, iflag);
+}
-- 
2.26.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 5/8] ovl: mark overlayfs' inode dirty on shared writable mmap
  2020-10-25  3:41 [RFC PATCH v2 0/8] implement containerized syncfs for overlayfs Chengguang Xu
                   ` (3 preceding siblings ...)
  2020-10-25  3:41 ` [RFC PATCH v2 4/8] ovl: mark overlayfs' inode dirty on modification Chengguang Xu
@ 2020-10-25  3:41 ` Chengguang Xu
  2020-11-02 17:30   ` Jan Kara
  2020-10-25  3:41 ` [RFC PATCH v2 6/8] ovl: implement overlayfs' ->write_inode operation Chengguang Xu
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Chengguang Xu @ 2020-10-25  3:41 UTC (permalink / raw)
  To: miklos, amir73il, jack; +Cc: linux-unionfs, linux-fsdevel, Chengguang Xu

Overlayfs cannot be notified when mmapped area gets dirty,
so we need to proactively mark inode dirty in ->mmap operation.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
---
 fs/overlayfs/file.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index efccb7c1f9bc..cd6fcdfd81a9 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
 		/* Drop reference count from new vm_file value */
 		fput(realfile);
 	} else {
+		if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
+		    vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
+			ovl_mark_inode_dirty(file_inode(file));
+
 		/* Drop reference count from previous vm_file value */
 		fput(file);
 	}
-- 
2.26.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 6/8] ovl: implement overlayfs' ->write_inode operation
  2020-10-25  3:41 [RFC PATCH v2 0/8] implement containerized syncfs for overlayfs Chengguang Xu
                   ` (4 preceding siblings ...)
  2020-10-25  3:41 ` [RFC PATCH v2 5/8] ovl: mark overlayfs' inode dirty on shared writable mmap Chengguang Xu
@ 2020-10-25  3:41 ` Chengguang Xu
  2020-10-25  3:41 ` [RFC PATCH v2 7/8] ovl: cache dirty overlayfs' inode Chengguang Xu
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Chengguang Xu @ 2020-10-25  3:41 UTC (permalink / raw)
  To: miklos, amir73il, jack; +Cc: linux-unionfs, linux-fsdevel, Chengguang Xu

Implement overlayfs' ->write_inode to sync dirty data
and redirty overlayfs' inode if necessary.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
---
 fs/overlayfs/super.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 5d7e12b8bd1f..960e79e7a8b5 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -15,6 +15,7 @@
 #include <linux/seq_file.h>
 #include <linux/posix_acl_xattr.h>
 #include <linux/exportfs.h>
+#include <linux/writeback.h>
 #include "overlayfs.h"
 
 MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
@@ -396,11 +397,36 @@ static void ovl_evict_inode (struct inode *inode)
 	clear_inode(inode);
 }
 
+static int ovl_write_inode(struct inode *inode,
+			   struct writeback_control *wbc)
+{
+	struct inode *upper = ovl_inode_upper(inode);
+	unsigned long iflag = 0;
+	int ret = 0;
+
+	if (!upper)
+		return 0;
+
+	if (upper->i_sb->s_op->write_inode)
+		ret = upper->i_sb->s_op->write_inode(inode, wbc);
+
+	iflag |= upper->i_state & I_DIRTY_ALL;
+	if (mapping_writably_mapped(upper->i_mapping) ||
+	    mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
+		iflag |= I_DIRTY_PAGES;
+
+	if (iflag)
+		ovl_mark_inode_dirty(inode);
+
+	return ret;
+}
+
 static const struct super_operations ovl_super_operations = {
 	.alloc_inode	= ovl_alloc_inode,
 	.free_inode	= ovl_free_inode,
 	.destroy_inode	= ovl_destroy_inode,
 	.drop_inode	= generic_delete_inode,
+	.write_inode	= ovl_write_inode,
 	.evict_inode	= ovl_evict_inode,
 	.put_super	= ovl_put_super,
 	.sync_fs	= ovl_sync_fs,
-- 
2.26.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 7/8] ovl: cache dirty overlayfs' inode
  2020-10-25  3:41 [RFC PATCH v2 0/8] implement containerized syncfs for overlayfs Chengguang Xu
                   ` (5 preceding siblings ...)
  2020-10-25  3:41 ` [RFC PATCH v2 6/8] ovl: implement overlayfs' ->write_inode operation Chengguang Xu
@ 2020-10-25  3:41 ` Chengguang Xu
  2020-10-25  3:41 ` [RFC PATCH v2 8/8] ovl: implement containerized syncfs for overlayfs Chengguang Xu
  2020-10-30 15:46 ` [RFC PATCH v2 0/8] " Miklos Szeredi
  8 siblings, 0 replies; 23+ messages in thread
From: Chengguang Xu @ 2020-10-25  3:41 UTC (permalink / raw)
  To: miklos, amir73il, jack; +Cc: linux-unionfs, linux-fsdevel, Chengguang Xu

Now drop overlayfs' inode will sync dirty data,
so we change to only drop clean inode.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
---
 fs/overlayfs/super.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 960e79e7a8b5..1d04117fb6ad 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -421,11 +421,21 @@ static int ovl_write_inode(struct inode *inode,
 	return ret;
 }
 
+int ovl_drop_inode(struct inode *inode)
+{
+	struct inode *upper = ovl_inode_upper(inode);
+
+	if (!upper || !(inode->i_state & I_DIRTY_ALL))
+		return 1;
+
+	return generic_drop_inode(inode);
+}
+
 static const struct super_operations ovl_super_operations = {
 	.alloc_inode	= ovl_alloc_inode,
 	.free_inode	= ovl_free_inode,
 	.destroy_inode	= ovl_destroy_inode,
-	.drop_inode	= generic_delete_inode,
+	.drop_inode	= ovl_drop_inode,
 	.write_inode	= ovl_write_inode,
 	.evict_inode	= ovl_evict_inode,
 	.put_super	= ovl_put_super,
-- 
2.26.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 8/8] ovl: implement containerized syncfs for overlayfs
  2020-10-25  3:41 [RFC PATCH v2 0/8] implement containerized syncfs for overlayfs Chengguang Xu
                   ` (6 preceding siblings ...)
  2020-10-25  3:41 ` [RFC PATCH v2 7/8] ovl: cache dirty overlayfs' inode Chengguang Xu
@ 2020-10-25  3:41 ` Chengguang Xu
  2020-10-30 15:46 ` [RFC PATCH v2 0/8] " Miklos Szeredi
  8 siblings, 0 replies; 23+ messages in thread
From: Chengguang Xu @ 2020-10-25  3:41 UTC (permalink / raw)
  To: miklos, amir73il, jack; +Cc: linux-unionfs, linux-fsdevel, Chengguang Xu

Now overlayfs can only sync dirty inode during
syncfs, so remove unnecessary sync_filesystem()
on upper file system.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
---
 fs/overlayfs/super.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 1d04117fb6ad..df33e8c8f1d0 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -271,8 +271,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
 	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
 	 * All the super blocks will be iterated, including upper_sb.
 	 *
-	 * If this is a syncfs(2) call, then we do need to call
-	 * sync_filesystem() on upper_sb, but enough if we do it when being
+	 * If this is a syncfs(2) call, it will be enough we do it when being
 	 * called with wait == 1.
 	 */
 	if (!wait)
@@ -281,7 +280,10 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
 	upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
 
 	down_read(&upper_sb->s_umount);
-	ret = sync_filesystem(upper_sb);
+	if (upper_sb->s_op->sync_fs)
+		ret = upper_sb->s_op->sync_fs(upper_sb, wait);
+	if (!ret)
+		ret = sync_blockdev(upper_sb->s_bdev);
 	up_read(&upper_sb->s_umount);
 
 	return ret;
-- 
2.26.2



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/8] implement containerized syncfs for overlayfs
  2020-10-25  3:41 [RFC PATCH v2 0/8] implement containerized syncfs for overlayfs Chengguang Xu
                   ` (7 preceding siblings ...)
  2020-10-25  3:41 ` [RFC PATCH v2 8/8] ovl: implement containerized syncfs for overlayfs Chengguang Xu
@ 2020-10-30 15:46 ` Miklos Szeredi
  2020-10-31 12:22   ` Chengguang Xu
  8 siblings, 1 reply; 23+ messages in thread
From: Miklos Szeredi @ 2020-10-30 15:46 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: Amir Goldstein, Jan Kara, overlayfs, linux-fsdevel

On Sun, Oct 25, 2020 at 4:42 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
>
> Current syncfs(2) syscall on overlayfs just calls sync_filesystem()
> on upper_sb to synchronize whole dirty inodes in upper filesystem
> regardless of the overlay ownership of the inode. In the use case of
> container, when multiple containers using the same underlying upper
> filesystem, it has some shortcomings as below.
>
> (1) Performance
> Synchronization is probably heavy because it actually syncs unnecessary
> inodes for target overlayfs.
>
> (2) Interference
> Unplanned synchronization will probably impact IO performance of
> unrelated container processes on the other overlayfs.
>
> This series try to implement containerized syncfs for overlayfs so that
> only sync target dirty upper inodes which are belong to specific overlayfs
> instance. By doing this, it is able to reduce cost of synchronization and
> will not seriously impact IO performance of unrelated processes.

Series looks good at first glance.  Still need to do an in-depth review.

In the meantime can you post some numbers showing the performance improvements?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 0/8] implement containerized syncfs for overlayfs
  2020-10-30 15:46 ` [RFC PATCH v2 0/8] " Miklos Szeredi
@ 2020-10-31 12:22   ` Chengguang Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Chengguang Xu @ 2020-10-31 12:22 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Amir Goldstein, Jan Kara, overlayfs, linux-fsdevel

 ---- 在 星期五, 2020-10-30 23:46:00 Miklos Szeredi <miklos@szeredi.hu> 撰写 ----
 > On Sun, Oct 25, 2020 at 4:42 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
 > >
 > > Current syncfs(2) syscall on overlayfs just calls sync_filesystem()
 > > on upper_sb to synchronize whole dirty inodes in upper filesystem
 > > regardless of the overlay ownership of the inode. In the use case of
 > > container, when multiple containers using the same underlying upper
 > > filesystem, it has some shortcomings as below.
 > >
 > > (1) Performance
 > > Synchronization is probably heavy because it actually syncs unnecessary
 > > inodes for target overlayfs.
 > >
 > > (2) Interference
 > > Unplanned synchronization will probably impact IO performance of
 > > unrelated container processes on the other overlayfs.
 > >
 > > This series try to implement containerized syncfs for overlayfs so that
 > > only sync target dirty upper inodes which are belong to specific overlayfs
 > > instance. By doing this, it is able to reduce cost of synchronization and
 > > will not seriously impact IO performance of unrelated processes.
 > 
 > Series looks good at first glance.  Still need to do an in-depth review.
 > 
 > In the meantime can you post some numbers showing the performance improvements?
 > 

Actually, the performance improvement depends on workload and the number of overlay instances.

I did a very simple test on my dev box (intel NUC ) and the result seems quite obvious.

I made two overlay instances and untar a kernel source to an overlay then compare

umount time(sys part) of two instances. 

------------------------------------------
test env:
Intel(R) Core(TM) i7-8809G CPU @ 3.10GHz  8core
32GB memory
Samsung NVMe SSD 970 pro

test script:

[root@localhost ovl]# cat do.sh
#!/bin/sh

rm -rf upper/* &> /dev/null
echo 3 > /proc/sys/vm/drop_caches

mount -t overlay overlay -o lowerdir=lower,workdir=work,upperdir=upper merged
mount -t overlay overlay -o lowerdir=lower2,workdir=work2,upperdir=upper2 merged2

tar Jxvf linux-5.9.tar.xz -C merged &> /dev/null

time umount merged2
time umount merged

---------------------------------
test result:

[root@localhost ovl]# seq 3 | while read line ; do sh -x do.sh ; done
+ rm -rf upper/linux-5.9
+ echo 3
+ mount -t overlay overlay -o lowerdir=lower,workdir=work,upperdir=upper merged
+ mount -t overlay overlay -o lowerdir=lower2,workdir=work2,upperdir=upper2 merged2
+ tar Jxvf linux-5.9.tar.xz -C merged
+ umount merged2

real    0m0.437s
user    0m0.001s
sys     0m0.001s
+ umount merged

real    0m2.107s
user    0m0.002s
sys     0m0.090s
+ rm -rf upper/linux-5.9
+ echo 3
+ mount -t overlay overlay -o lowerdir=lower,workdir=work,upperdir=upper merged
+ mount -t overlay overlay -o lowerdir=lower2,workdir=work2,upperdir=upper2 merged2
+ tar Jxvf linux-5.9.tar.xz -C merged
+ umount merged2

real    0m0.443s
user    0m0.002s
sys     0m0.001s
+ umount merged

real    0m2.032s
user    0m0.001s
sys     0m0.105s
+ rm -rf upper/linux-5.9
+ echo 3
+ mount -t overlay overlay -o lowerdir=lower,workdir=work,upperdir=upper merged
+ mount -t overlay overlay -o lowerdir=lower2,workdir=work2,upperdir=upper2 merged2
+ tar Jxvf linux-5.9.tar.xz -C merged
+ umount merged2

real    0m0.263s
user    0m0.001s
sys     0m0.001s
+ umount merged

real    0m2.094s
user    0m0.000s
sys     0m0.083s
[root@localhost o




Thanks,
Chengguang


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/8] ovl: implement ->writepages operation
  2020-10-25  3:41 ` [RFC PATCH v2 2/8] ovl: implement ->writepages operation Chengguang Xu
@ 2020-11-02 17:17   ` Jan Kara
  2020-11-04 12:18     ` Chengguang Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2020-11-02 17:17 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: miklos, amir73il, jack, linux-unionfs, linux-fsdevel

On Sun 25-10-20 11:41:11, Chengguang Xu wrote:
> Implement overlayfs' ->writepages operation so that
> we can sync dirty data/metadata to upper filesystem.
> 
> Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
> ---
>  fs/overlayfs/inode.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> index b584dca845ba..f27fc5be34df 100644
> --- a/fs/overlayfs/inode.c
> +++ b/fs/overlayfs/inode.c
> @@ -11,6 +11,7 @@
>  #include <linux/posix_acl.h>
>  #include <linux/ratelimit.h>
>  #include <linux/fiemap.h>
> +#include <linux/writeback.h>
>  #include "overlayfs.h"
>  
>  
> @@ -516,7 +517,32 @@ static const struct inode_operations ovl_special_inode_operations = {
>  	.update_time	= ovl_update_time,
>  };
>  
> +static int ovl_writepages(struct address_space *mapping,
> +			  struct writeback_control *wbc)
> +{
> +	struct inode *upper_inode = ovl_inode_upper(mapping->host);
> +	struct ovl_fs *ofs =  mapping->host->i_sb->s_fs_info;
> +	struct writeback_control tmp_wbc = *wbc;
> +
> +	if (!ovl_should_sync(ofs))
> +		return 0;
> +
> +	/*
> +	 * for sync(2) writeback, it has a separate external IO
> +	 * completion path by checking PAGECACHE_TAG_WRITEBACK
> +	 * in pagecache, we have to set for_sync to 0 in thie case,
> +	 * let writeback waits completion after syncing individual
> +	 * dirty inode, because we haven't implemented overlayfs'
> +	 * own pagecache yet.
> +	 */
> +	if (wbc->for_sync && (wbc->sync_mode == WB_SYNC_ALL))
> +		tmp_wbc.for_sync = 0;

This looks really hacky as it closely depends on the internal details of
writeback implementation. I'd be more open to say export wait_sb_inodes()
for overlayfs use... Because that's what I gather you need in your
overlayfs ->syncfs() implementation.

								Honza

> +
> +	return sync_inode(upper_inode, &tmp_wbc);
> +}
> +
>  static const struct address_space_operations ovl_aops = {
> +	.writepages		= ovl_writepages,
>  	/* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
>  	.direct_IO		= noop_direct_IO,
>  };
> -- 
> 2.26.2
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 5/8] ovl: mark overlayfs' inode dirty on shared writable mmap
  2020-10-25  3:41 ` [RFC PATCH v2 5/8] ovl: mark overlayfs' inode dirty on shared writable mmap Chengguang Xu
@ 2020-11-02 17:30   ` Jan Kara
  2020-11-04 11:54     ` Chengguang Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2020-11-02 17:30 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: miklos, amir73il, jack, linux-unionfs, linux-fsdevel

On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
> Overlayfs cannot be notified when mmapped area gets dirty,
> so we need to proactively mark inode dirty in ->mmap operation.
> 
> Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
> ---
>  fs/overlayfs/file.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index efccb7c1f9bc..cd6fcdfd81a9 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
>  		/* Drop reference count from new vm_file value */
>  		fput(realfile);
>  	} else {
> +		if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
> +		    vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
> +			ovl_mark_inode_dirty(file_inode(file));
> +

But does this work reliably? I mean once writeback runs, your inode (as
well as upper inode) is cleaned. Then a page fault comes so file has dirty
pages again and would need flushing but overlayfs inode stays clean? Am I
missing something?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 5/8] ovl: mark overlayfs' inode dirty on shared writable mmap
  2020-11-02 17:30   ` Jan Kara
@ 2020-11-04 11:54     ` Chengguang Xu
  2020-11-05 14:03       ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Chengguang Xu @ 2020-11-04 11:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: miklos, amir73il, linux-unionfs, linux-fsdevel, cgxu519, charliecgxu

 ---- 在 星期二, 2020-11-03 01:30:52 Jan Kara <jack@suse.cz> 撰写 ----
 > On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
 > > Overlayfs cannot be notified when mmapped area gets dirty,
 > > so we need to proactively mark inode dirty in ->mmap operation.
 > > 
 > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
 > > ---
 > >  fs/overlayfs/file.c | 4 ++++
 > >  1 file changed, 4 insertions(+)
 > > 
 > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
 > > index efccb7c1f9bc..cd6fcdfd81a9 100644
 > > --- a/fs/overlayfs/file.c
 > > +++ b/fs/overlayfs/file.c
 > > @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
 > >          /* Drop reference count from new vm_file value */
 > >          fput(realfile);
 > >      } else {
 > > +        if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
 > > +            vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
 > > +            ovl_mark_inode_dirty(file_inode(file));
 > > +
 > 
 > But does this work reliably? I mean once writeback runs, your inode (as
 > well as upper inode) is cleaned. Then a page fault comes so file has dirty
 > pages again and would need flushing but overlayfs inode stays clean? Am I
 > missing something?
 > 

Yeah, this is key point of this approach, in order to  fix the issue I explicitly set 
I_DIRTY_SYNC flag in ovl_mark_inode_dirty(), so what i mean is during writeback
we will call into ->write_inode() by this flag(I_DIRTY_SYNC) and at that place
we get chance to check mapping and re-dirty overlay's inode. The code logic
like below in ovl_write_inode().

    if (mapping_writably_mapped(upper->i_mapping) ||
         mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
                 iflag |= I_DIRTY_PAGES; 




Thanks,
Chengguang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/8] ovl: implement ->writepages operation
  2020-11-02 17:17   ` Jan Kara
@ 2020-11-04 12:18     ` Chengguang Xu
  2020-11-05 13:55       ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Chengguang Xu @ 2020-11-04 12:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: miklos, amir73il, linux-unionfs, linux-fsdevel, cgxu519, charliecgxu

 ---- 在 星期二, 2020-11-03 01:17:41 Jan Kara <jack@suse.cz> 撰写 ----
 > On Sun 25-10-20 11:41:11, Chengguang Xu wrote:
 > > Implement overlayfs' ->writepages operation so that
 > > we can sync dirty data/metadata to upper filesystem.
 > > 
 > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
 > > ---
 > >  fs/overlayfs/inode.c | 26 ++++++++++++++++++++++++++
 > >  1 file changed, 26 insertions(+)
 > > 
 > > diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
 > > index b584dca845ba..f27fc5be34df 100644
 > > --- a/fs/overlayfs/inode.c
 > > +++ b/fs/overlayfs/inode.c
 > > @@ -11,6 +11,7 @@
 > >  #include <linux/posix_acl.h>
 > >  #include <linux/ratelimit.h>
 > >  #include <linux/fiemap.h>
 > > +#include <linux/writeback.h>
 > >  #include "overlayfs.h"
 > >  
 > >  
 > > @@ -516,7 +517,32 @@ static const struct inode_operations ovl_special_inode_operations = {
 > >      .update_time    = ovl_update_time,
 > >  };
 > >  
 > > +static int ovl_writepages(struct address_space *mapping,
 > > +              struct writeback_control *wbc)
 > > +{
 > > +    struct inode *upper_inode = ovl_inode_upper(mapping->host);
 > > +    struct ovl_fs *ofs =  mapping->host->i_sb->s_fs_info;
 > > +    struct writeback_control tmp_wbc = *wbc;
 > > +
 > > +    if (!ovl_should_sync(ofs))
 > > +        return 0;
 > > +
 > > +    /*
 > > +     * for sync(2) writeback, it has a separate external IO
 > > +     * completion path by checking PAGECACHE_TAG_WRITEBACK
 > > +     * in pagecache, we have to set for_sync to 0 in thie case,
 > > +     * let writeback waits completion after syncing individual
 > > +     * dirty inode, because we haven't implemented overlayfs'
 > > +     * own pagecache yet.
 > > +     */
 > > +    if (wbc->for_sync && (wbc->sync_mode == WB_SYNC_ALL))
 > > +        tmp_wbc.for_sync = 0;
 > 
 > This looks really hacky as it closely depends on the internal details of
 > writeback implementation. I'd be more open to say export wait_sb_inodes()
 > for overlayfs use... Because that's what I gather you need in your
 > overlayfs ->syncfs() implementation.
 > 

Does  that mean we gather synced overlay's inode into a new waiting list(overlay's) and
do the waiting behavior in overlay's ->syncfs() ?


Thanks,
Chengguang



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/8] ovl: implement ->writepages operation
  2020-11-04 12:18     ` Chengguang Xu
@ 2020-11-05 13:55       ` Jan Kara
  2020-11-06  5:57         ` Chengguang Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2020-11-05 13:55 UTC (permalink / raw)
  To: Chengguang Xu
  Cc: Jan Kara, miklos, amir73il, linux-unionfs, linux-fsdevel, charliecgxu

On Wed 04-11-20 20:18:16, Chengguang Xu wrote:
>  ---- 在 星期二, 2020-11-03 01:17:41 Jan Kara <jack@suse.cz> 撰写 ----
>  > On Sun 25-10-20 11:41:11, Chengguang Xu wrote:
>  > > Implement overlayfs' ->writepages operation so that
>  > > we can sync dirty data/metadata to upper filesystem.
>  > > 
>  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
>  > > ---
>  > >  fs/overlayfs/inode.c | 26 ++++++++++++++++++++++++++
>  > >  1 file changed, 26 insertions(+)
>  > > 
>  > > diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
>  > > index b584dca845ba..f27fc5be34df 100644
>  > > --- a/fs/overlayfs/inode.c
>  > > +++ b/fs/overlayfs/inode.c
>  > > @@ -11,6 +11,7 @@
>  > >  #include <linux/posix_acl.h>
>  > >  #include <linux/ratelimit.h>
>  > >  #include <linux/fiemap.h>
>  > > +#include <linux/writeback.h>
>  > >  #include "overlayfs.h"
>  > >  
>  > >  
>  > > @@ -516,7 +517,32 @@ static const struct inode_operations ovl_special_inode_operations = {
>  > >      .update_time    = ovl_update_time,
>  > >  };
>  > >  
>  > > +static int ovl_writepages(struct address_space *mapping,
>  > > +              struct writeback_control *wbc)
>  > > +{
>  > > +    struct inode *upper_inode = ovl_inode_upper(mapping->host);
>  > > +    struct ovl_fs *ofs =  mapping->host->i_sb->s_fs_info;
>  > > +    struct writeback_control tmp_wbc = *wbc;
>  > > +
>  > > +    if (!ovl_should_sync(ofs))
>  > > +        return 0;
>  > > +
>  > > +    /*
>  > > +     * for sync(2) writeback, it has a separate external IO
>  > > +     * completion path by checking PAGECACHE_TAG_WRITEBACK
>  > > +     * in pagecache, we have to set for_sync to 0 in thie case,
>  > > +     * let writeback waits completion after syncing individual
>  > > +     * dirty inode, because we haven't implemented overlayfs'
>  > > +     * own pagecache yet.
>  > > +     */
>  > > +    if (wbc->for_sync && (wbc->sync_mode == WB_SYNC_ALL))
>  > > +        tmp_wbc.for_sync = 0;
>  > 
>  > This looks really hacky as it closely depends on the internal details of
>  > writeback implementation. I'd be more open to say export wait_sb_inodes()
>  > for overlayfs use... Because that's what I gather you need in your
>  > overlayfs ->syncfs() implementation.
>  > 
> 
> Does  that mean we gather synced overlay's inode into a new waiting list(overlay's) and
> do the waiting behavior in overlay's ->syncfs() ?

My idea was that you'd just use the standard writeback logic which ends up
gathering upper_sb inodes in the upper_sb->s_inodes_wb and then wait for
them in overlay's ->syncfs(). Maybe we'll end up waiting for more inodes
than strictly necessary but it shouldn't be too bad I'd say...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 5/8] ovl: mark overlayfs' inode dirty on shared writable mmap
  2020-11-04 11:54     ` Chengguang Xu
@ 2020-11-05 14:03       ` Jan Kara
  2020-11-05 14:21         ` Amir Goldstein
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2020-11-05 14:03 UTC (permalink / raw)
  To: Chengguang Xu
  Cc: Jan Kara, miklos, amir73il, linux-unionfs, linux-fsdevel, charliecgxu

On Wed 04-11-20 19:54:03, Chengguang Xu wrote:
>  ---- 在 星期二, 2020-11-03 01:30:52 Jan Kara <jack@suse.cz> 撰写 ----
>  > On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
>  > > Overlayfs cannot be notified when mmapped area gets dirty,
>  > > so we need to proactively mark inode dirty in ->mmap operation.
>  > > 
>  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
>  > > ---
>  > >  fs/overlayfs/file.c | 4 ++++
>  > >  1 file changed, 4 insertions(+)
>  > > 
>  > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>  > > index efccb7c1f9bc..cd6fcdfd81a9 100644
>  > > --- a/fs/overlayfs/file.c
>  > > +++ b/fs/overlayfs/file.c
>  > > @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
>  > >          /* Drop reference count from new vm_file value */
>  > >          fput(realfile);
>  > >      } else {
>  > > +        if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
>  > > +            vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
>  > > +            ovl_mark_inode_dirty(file_inode(file));
>  > > +
>  > 
>  > But does this work reliably? I mean once writeback runs, your inode (as
>  > well as upper inode) is cleaned. Then a page fault comes so file has dirty
>  > pages again and would need flushing but overlayfs inode stays clean? Am I
>  > missing something?
>  > 
> 
> Yeah, this is key point of this approach, in order to  fix the issue I
> explicitly set I_DIRTY_SYNC flag in ovl_mark_inode_dirty(), so what i
> mean is during writeback we will call into ->write_inode() by this
> flag(I_DIRTY_SYNC) and at that place we get chance to check mapping and
> re-dirty overlay's inode. The code logic like below in ovl_write_inode().
> 
>     if (mapping_writably_mapped(upper->i_mapping) ||
>          mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
>                  iflag |= I_DIRTY_PAGES; 

OK, but suppose the upper mapping is clean at this moment (upper inode has
been fully written out for whatever reason, but it is still mapped) so your
overlayfs inode becomes clean as well. Then I don't see a mechanism which
would make your overlayfs inode dirty again when a write to mmap happens,
set_page_dirty() will end up marking upper inode with I_DIRTY_PAGES flag.

Note that ovl_mmap() gets called only at mmap(2) syscall time but then
pages get faulted in, dirtied, cleaned fully at discretion of the mm
/ writeback subsystem.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 5/8] ovl: mark overlayfs' inode dirty on shared writable mmap
  2020-11-05 14:03       ` Jan Kara
@ 2020-11-05 14:21         ` Amir Goldstein
  2020-11-05 15:54           ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Amir Goldstein @ 2020-11-05 14:21 UTC (permalink / raw)
  To: Jan Kara; +Cc: Chengguang Xu, miklos, linux-unionfs, linux-fsdevel, charliecgxu

On Thu, Nov 5, 2020 at 4:03 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 04-11-20 19:54:03, Chengguang Xu wrote:
> >  ---- 在 星期二, 2020-11-03 01:30:52 Jan Kara <jack@suse.cz> 撰写 ----
> >  > On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
> >  > > Overlayfs cannot be notified when mmapped area gets dirty,
> >  > > so we need to proactively mark inode dirty in ->mmap operation.
> >  > >
> >  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
> >  > > ---
> >  > >  fs/overlayfs/file.c | 4 ++++
> >  > >  1 file changed, 4 insertions(+)
> >  > >
> >  > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> >  > > index efccb7c1f9bc..cd6fcdfd81a9 100644
> >  > > --- a/fs/overlayfs/file.c
> >  > > +++ b/fs/overlayfs/file.c
> >  > > @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
> >  > >          /* Drop reference count from new vm_file value */
> >  > >          fput(realfile);
> >  > >      } else {
> >  > > +        if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
> >  > > +            vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
> >  > > +            ovl_mark_inode_dirty(file_inode(file));
> >  > > +
> >  >
> >  > But does this work reliably? I mean once writeback runs, your inode (as
> >  > well as upper inode) is cleaned. Then a page fault comes so file has dirty
> >  > pages again and would need flushing but overlayfs inode stays clean? Am I
> >  > missing something?
> >  >
> >
> > Yeah, this is key point of this approach, in order to  fix the issue I
> > explicitly set I_DIRTY_SYNC flag in ovl_mark_inode_dirty(), so what i
> > mean is during writeback we will call into ->write_inode() by this
> > flag(I_DIRTY_SYNC) and at that place we get chance to check mapping and
> > re-dirty overlay's inode. The code logic like below in ovl_write_inode().
> >
> >     if (mapping_writably_mapped(upper->i_mapping) ||
> >          mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
> >                  iflag |= I_DIRTY_PAGES;
>
> OK, but suppose the upper mapping is clean at this moment (upper inode has
> been fully written out for whatever reason, but it is still mapped) so your
> overlayfs inode becomes clean as well. Then I don't see a mechanism which
> would make your overlayfs inode dirty again when a write to mmap happens,
> set_page_dirty() will end up marking upper inode with I_DIRTY_PAGES flag.
>
> Note that ovl_mmap() gets called only at mmap(2) syscall time but then
> pages get faulted in, dirtied, cleaned fully at discretion of the mm
> / writeback subsystem.
>

Perhaps I will add some background.

What I suggested was to maintain a "suspect list" in addition to
the dirty ovl inodes.

ovl inode is added to the suspect list on mmap (writable) and removed
from the suspect list on release() flush() or on sync_fs() if real inode is no
longer writably mapped.

There was another variant where ovl inode is added to suspect list on open
for write and removed from suspect list on release() flush() or sync_fs()
if real inode is not inode_is_open_for_write().

In both cases the list will have inodes whose real is not dirty, but
in both cases
the list shouldn't be terribly large to traverse on sync_fs().

Chengguang tried to implement the idea without an actual list by
re-dirtying the "suspect" inodes on every write_inode(), but I personally have
no idea if his idea works.

I think we can resort to using an actual suspect list if you say that it
cannot work like this?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 5/8] ovl: mark overlayfs' inode dirty on shared writable mmap
  2020-11-05 14:21         ` Amir Goldstein
@ 2020-11-05 15:54           ` Jan Kara
  2020-11-06  2:41             ` Chengguang Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2020-11-05 15:54 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Chengguang Xu, miklos, linux-unionfs, linux-fsdevel,
	charliecgxu

On Thu 05-11-20 16:21:27, Amir Goldstein wrote:
> On Thu, Nov 5, 2020 at 4:03 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 04-11-20 19:54:03, Chengguang Xu wrote:
> > >  ---- 在 星期二, 2020-11-03 01:30:52 Jan Kara <jack@suse.cz> 撰写 ----
> > >  > On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
> > >  > > Overlayfs cannot be notified when mmapped area gets dirty,
> > >  > > so we need to proactively mark inode dirty in ->mmap operation.
> > >  > >
> > >  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
> > >  > > ---
> > >  > >  fs/overlayfs/file.c | 4 ++++
> > >  > >  1 file changed, 4 insertions(+)
> > >  > >
> > >  > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > >  > > index efccb7c1f9bc..cd6fcdfd81a9 100644
> > >  > > --- a/fs/overlayfs/file.c
> > >  > > +++ b/fs/overlayfs/file.c
> > >  > > @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
> > >  > >          /* Drop reference count from new vm_file value */
> > >  > >          fput(realfile);
> > >  > >      } else {
> > >  > > +        if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
> > >  > > +            vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
> > >  > > +            ovl_mark_inode_dirty(file_inode(file));
> > >  > > +
> > >  >
> > >  > But does this work reliably? I mean once writeback runs, your inode (as
> > >  > well as upper inode) is cleaned. Then a page fault comes so file has dirty
> > >  > pages again and would need flushing but overlayfs inode stays clean? Am I
> > >  > missing something?
> > >  >
> > >
> > > Yeah, this is key point of this approach, in order to  fix the issue I
> > > explicitly set I_DIRTY_SYNC flag in ovl_mark_inode_dirty(), so what i
> > > mean is during writeback we will call into ->write_inode() by this
> > > flag(I_DIRTY_SYNC) and at that place we get chance to check mapping and
> > > re-dirty overlay's inode. The code logic like below in ovl_write_inode().
> > >
> > >     if (mapping_writably_mapped(upper->i_mapping) ||
> > >          mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
> > >                  iflag |= I_DIRTY_PAGES;
> >
> > OK, but suppose the upper mapping is clean at this moment (upper inode has
> > been fully written out for whatever reason, but it is still mapped) so your
> > overlayfs inode becomes clean as well. Then I don't see a mechanism which
> > would make your overlayfs inode dirty again when a write to mmap happens,
> > set_page_dirty() will end up marking upper inode with I_DIRTY_PAGES flag.
> >
> > Note that ovl_mmap() gets called only at mmap(2) syscall time but then
> > pages get faulted in, dirtied, cleaned fully at discretion of the mm
> > / writeback subsystem.
> >
> 
> Perhaps I will add some background.
> 
> What I suggested was to maintain a "suspect list" in addition to
> the dirty ovl inodes.
> 
> ovl inode is added to the suspect list on mmap (writable) and removed
> from the suspect list on release() flush() or on sync_fs() if real inode is no
> longer writably mapped.
> 
> There was another variant where ovl inode is added to suspect list on open
> for write and removed from suspect list on release() flush() or sync_fs()
> if real inode is not inode_is_open_for_write().
> 
> In both cases the list will have inodes whose real is not dirty, but
> in both cases
> the list shouldn't be terribly large to traverse on sync_fs().
> 
> Chengguang tried to implement the idea without an actual list by
> re-dirtying the "suspect" inodes on every write_inode(), but I personally have
> no idea if his idea works.
> 
> I think we can resort to using an actual suspect list if you say that it
> cannot work like this?

Yeah, the suspect list (i.e., additional list of inodes to check on sync)
you describe should work fine. Also the "keep suspect inode dirty" idea
of Chengguang could work fine but we'd have to use something like
inode_is_open_for_write() or inode_is_writeably_mapped() (which would need
to be implemented but it should be easy vma_interval_tree_foreach() walk
checking each found VMA for vma->vm_flags & VM_WRITE) for checking whether
inode should be redirtied or not.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 5/8] ovl: mark overlayfs' inode dirty on shared writable mmap
  2020-11-05 15:54           ` Jan Kara
@ 2020-11-06  2:41             ` Chengguang Xu
  2020-11-06  8:50               ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Chengguang Xu @ 2020-11-06  2:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, miklos, linux-unionfs, linux-fsdevel, charliecgxu

 ---- 在 星期四, 2020-11-05 23:54:34 Jan Kara <jack@suse.cz> 撰写 ----
 > On Thu 05-11-20 16:21:27, Amir Goldstein wrote:
 > > On Thu, Nov 5, 2020 at 4:03 PM Jan Kara <jack@suse.cz> wrote:
 > > >
 > > > On Wed 04-11-20 19:54:03, Chengguang Xu wrote:
 > > > >  ---- 在 星期二, 2020-11-03 01:30:52 Jan Kara <jack@suse.cz> 撰写 ----
 > > > >  > On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
 > > > >  > > Overlayfs cannot be notified when mmapped area gets dirty,
 > > > >  > > so we need to proactively mark inode dirty in ->mmap operation.
 > > > >  > >
 > > > >  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
 > > > >  > > ---
 > > > >  > >  fs/overlayfs/file.c | 4 ++++
 > > > >  > >  1 file changed, 4 insertions(+)
 > > > >  > >
 > > > >  > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
 > > > >  > > index efccb7c1f9bc..cd6fcdfd81a9 100644
 > > > >  > > --- a/fs/overlayfs/file.c
 > > > >  > > +++ b/fs/overlayfs/file.c
 > > > >  > > @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
 > > > >  > >          /* Drop reference count from new vm_file value */
 > > > >  > >          fput(realfile);
 > > > >  > >      } else {
 > > > >  > > +        if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
 > > > >  > > +            vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
 > > > >  > > +            ovl_mark_inode_dirty(file_inode(file));
 > > > >  > > +
 > > > >  >
 > > > >  > But does this work reliably? I mean once writeback runs, your inode (as
 > > > >  > well as upper inode) is cleaned. Then a page fault comes so file has dirty
 > > > >  > pages again and would need flushing but overlayfs inode stays clean? Am I
 > > > >  > missing something?
 > > > >  >
 > > > >
 > > > > Yeah, this is key point of this approach, in order to  fix the issue I
 > > > > explicitly set I_DIRTY_SYNC flag in ovl_mark_inode_dirty(), so what i
 > > > > mean is during writeback we will call into ->write_inode() by this
 > > > > flag(I_DIRTY_SYNC) and at that place we get chance to check mapping and
 > > > > re-dirty overlay's inode. The code logic like below in ovl_write_inode().
 > > > >
 > > > >     if (mapping_writably_mapped(upper->i_mapping) ||
 > > > >          mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
 > > > >                  iflag |= I_DIRTY_PAGES;
 > > >
 > > > OK, but suppose the upper mapping is clean at this moment (upper inode has
 > > > been fully written out for whatever reason, but it is still mapped) so your
 > > > overlayfs inode becomes clean as well. Then I don't see a mechanism which
 > > > would make your overlayfs inode dirty again when a write to mmap happens,
 > > > set_page_dirty() will end up marking upper inode with I_DIRTY_PAGES flag.
 > > >
 > > > Note that ovl_mmap() gets called only at mmap(2) syscall time but then
 > > > pages get faulted in, dirtied, cleaned fully at discretion of the mm
 > > > / writeback subsystem.
 > > >
 > > 
 > > Perhaps I will add some background.
 > > 
 > > What I suggested was to maintain a "suspect list" in addition to
 > > the dirty ovl inodes.
 > > 
 > > ovl inode is added to the suspect list on mmap (writable) and removed
 > > from the suspect list on release() flush() or on sync_fs() if real inode is no
 > > longer writably mapped.
 > > 
 > > There was another variant where ovl inode is added to suspect list on open
 > > for write and removed from suspect list on release() flush() or sync_fs()
 > > if real inode is not inode_is_open_for_write().
 > > 
 > > In both cases the list will have inodes whose real is not dirty, but
 > > in both cases
 > > the list shouldn't be terribly large to traverse on sync_fs().
 > > 
 > > Chengguang tried to implement the idea without an actual list by
 > > re-dirtying the "suspect" inodes on every write_inode(), but I personally have
 > > no idea if his idea works.
 > > 
 > > I think we can resort to using an actual suspect list if you say that it
 > > cannot work like this?
 > 
 > Yeah, the suspect list (i.e., additional list of inodes to check on sync)
 > you describe should work fine. 

I think this solution still has the problem we have met in below thread[1]
The main problem is the state combination of clean overlayfs' inode && dirty upper inode.
 
[1] https://www.spinics.net/lists/linux-unionfs/msg07448.html

 > Also the "keep suspect inode dirty" idea
 > of Chengguang could work fine but we'd have to use something like
 > inode_is_open_for_write() or inode_is_writeably_mapped() (which would need
 > to be implemented but it should be easy vma_interval_tree_foreach() walk
 > checking each found VMA for vma->vm_flags & VM_WRITE) for checking whether
 > inode should be redirtied or not.
 > 

I'm curious that isn't  it enough to check  i_mmap_writable by mapping_writably_mapped() ?
Am I missing something?


Thanks,
Chengguang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 2/8] ovl: implement ->writepages operation
  2020-11-05 13:55       ` Jan Kara
@ 2020-11-06  5:57         ` Chengguang Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Chengguang Xu @ 2020-11-06  5:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: miklos, amir73il, linux-unionfs, linux-fsdevel, charliecgxu

 ---- 在 星期四, 2020-11-05 21:55:06 Jan Kara <jack@suse.cz> 撰写 ----
 > On Wed 04-11-20 20:18:16, Chengguang Xu wrote:
 > >  ---- 在 星期二, 2020-11-03 01:17:41 Jan Kara <jack@suse.cz> 撰写 ----
 > >  > On Sun 25-10-20 11:41:11, Chengguang Xu wrote:
 > >  > > Implement overlayfs' ->writepages operation so that
 > >  > > we can sync dirty data/metadata to upper filesystem.
 > >  > > 
 > >  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
 > >  > > ---
 > >  > >  fs/overlayfs/inode.c | 26 ++++++++++++++++++++++++++
 > >  > >  1 file changed, 26 insertions(+)
 > >  > > 
 > >  > > diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
 > >  > > index b584dca845ba..f27fc5be34df 100644
 > >  > > --- a/fs/overlayfs/inode.c
 > >  > > +++ b/fs/overlayfs/inode.c
 > >  > > @@ -11,6 +11,7 @@
 > >  > >  #include <linux/posix_acl.h>
 > >  > >  #include <linux/ratelimit.h>
 > >  > >  #include <linux/fiemap.h>
 > >  > > +#include <linux/writeback.h>
 > >  > >  #include "overlayfs.h"
 > >  > >  
 > >  > >  
 > >  > > @@ -516,7 +517,32 @@ static const struct inode_operations ovl_special_inode_operations = {
 > >  > >      .update_time    = ovl_update_time,
 > >  > >  };
 > >  > >  
 > >  > > +static int ovl_writepages(struct address_space *mapping,
 > >  > > +              struct writeback_control *wbc)
 > >  > > +{
 > >  > > +    struct inode *upper_inode = ovl_inode_upper(mapping->host);
 > >  > > +    struct ovl_fs *ofs =  mapping->host->i_sb->s_fs_info;
 > >  > > +    struct writeback_control tmp_wbc = *wbc;
 > >  > > +
 > >  > > +    if (!ovl_should_sync(ofs))
 > >  > > +        return 0;
 > >  > > +
 > >  > > +    /*
 > >  > > +     * for sync(2) writeback, it has a separate external IO
 > >  > > +     * completion path by checking PAGECACHE_TAG_WRITEBACK
 > >  > > +     * in pagecache, we have to set for_sync to 0 in thie case,
 > >  > > +     * let writeback waits completion after syncing individual
 > >  > > +     * dirty inode, because we haven't implemented overlayfs'
 > >  > > +     * own pagecache yet.
 > >  > > +     */
 > >  > > +    if (wbc->for_sync && (wbc->sync_mode == WB_SYNC_ALL))
 > >  > > +        tmp_wbc.for_sync = 0;
 > >  > 
 > >  > This looks really hacky as it closely depends on the internal details of
 > >  > writeback implementation. I'd be more open to say export wait_sb_inodes()
 > >  > for overlayfs use... Because that's what I gather you need in your
 > >  > overlayfs ->syncfs() implementation.
 > >  > 
 > > 
 > > Does  that mean we gather synced overlay's inode into a new waiting list(overlay's) and
 > > do the waiting behavior in overlay's ->syncfs() ?
 > 
 > My idea was that you'd just use the standard writeback logic which ends up
 > gathering upper_sb inodes in the upper_sb->s_inodes_wb and then wait for
 > them in overlay's ->syncfs(). Maybe we'll end up waiting for more inodes
 > than strictly necessary but it shouldn't be too bad I'd say...
 > 

Yeah, I agree with you, I'll modify in next version.


Thanks,
Chengguang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 5/8] ovl: mark overlayfs' inode dirty on shared writable mmap
  2020-11-06  2:41             ` Chengguang Xu
@ 2020-11-06  8:50               ` Jan Kara
  2020-11-06  9:47                 ` Chengguang Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2020-11-06  8:50 UTC (permalink / raw)
  To: Chengguang Xu
  Cc: Jan Kara, Amir Goldstein, miklos, linux-unionfs, linux-fsdevel,
	charliecgxu

On Fri 06-11-20 10:41:44, Chengguang Xu wrote:
>  ---- 在 星期四, 2020-11-05 23:54:34 Jan Kara <jack@suse.cz> 撰写 ----
>  > On Thu 05-11-20 16:21:27, Amir Goldstein wrote:
>  > > On Thu, Nov 5, 2020 at 4:03 PM Jan Kara <jack@suse.cz> wrote:
>  > > >
>  > > > On Wed 04-11-20 19:54:03, Chengguang Xu wrote:
>  > > > >  ---- 在 星期二, 2020-11-03 01:30:52 Jan Kara <jack@suse.cz> 撰写 ----
>  > > > >  > On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
>  > > > >  > > Overlayfs cannot be notified when mmapped area gets dirty,
>  > > > >  > > so we need to proactively mark inode dirty in ->mmap operation.
>  > > > >  > >
>  > > > >  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
>  > > > >  > > ---
>  > > > >  > >  fs/overlayfs/file.c | 4 ++++
>  > > > >  > >  1 file changed, 4 insertions(+)
>  > > > >  > >
>  > > > >  > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>  > > > >  > > index efccb7c1f9bc..cd6fcdfd81a9 100644
>  > > > >  > > --- a/fs/overlayfs/file.c
>  > > > >  > > +++ b/fs/overlayfs/file.c
>  > > > >  > > @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
>  > > > >  > >          /* Drop reference count from new vm_file value */
>  > > > >  > >          fput(realfile);
>  > > > >  > >      } else {
>  > > > >  > > +        if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
>  > > > >  > > +            vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
>  > > > >  > > +            ovl_mark_inode_dirty(file_inode(file));
>  > > > >  > > +
>  > > > >  >
>  > > > >  > But does this work reliably? I mean once writeback runs, your inode (as
>  > > > >  > well as upper inode) is cleaned. Then a page fault comes so file has dirty
>  > > > >  > pages again and would need flushing but overlayfs inode stays clean? Am I
>  > > > >  > missing something?
>  > > > >  >
>  > > > >
>  > > > > Yeah, this is key point of this approach, in order to  fix the issue I
>  > > > > explicitly set I_DIRTY_SYNC flag in ovl_mark_inode_dirty(), so what i
>  > > > > mean is during writeback we will call into ->write_inode() by this
>  > > > > flag(I_DIRTY_SYNC) and at that place we get chance to check mapping and
>  > > > > re-dirty overlay's inode. The code logic like below in ovl_write_inode().
>  > > > >
>  > > > >     if (mapping_writably_mapped(upper->i_mapping) ||
>  > > > >          mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
>  > > > >                  iflag |= I_DIRTY_PAGES;
>  > > >
>  > > > OK, but suppose the upper mapping is clean at this moment (upper inode has
>  > > > been fully written out for whatever reason, but it is still mapped) so your
>  > > > overlayfs inode becomes clean as well. Then I don't see a mechanism which
>  > > > would make your overlayfs inode dirty again when a write to mmap happens,
>  > > > set_page_dirty() will end up marking upper inode with I_DIRTY_PAGES flag.
>  > > >
>  > > > Note that ovl_mmap() gets called only at mmap(2) syscall time but then
>  > > > pages get faulted in, dirtied, cleaned fully at discretion of the mm
>  > > > / writeback subsystem.
>  > > >
>  > > 
>  > > Perhaps I will add some background.
>  > > 
>  > > What I suggested was to maintain a "suspect list" in addition to
>  > > the dirty ovl inodes.
>  > > 
>  > > ovl inode is added to the suspect list on mmap (writable) and removed
>  > > from the suspect list on release() flush() or on sync_fs() if real inode is no
>  > > longer writably mapped.
>  > > 
>  > > There was another variant where ovl inode is added to suspect list on open
>  > > for write and removed from suspect list on release() flush() or sync_fs()
>  > > if real inode is not inode_is_open_for_write().
>  > > 
>  > > In both cases the list will have inodes whose real is not dirty, but
>  > > in both cases
>  > > the list shouldn't be terribly large to traverse on sync_fs().
>  > > 
>  > > Chengguang tried to implement the idea without an actual list by
>  > > re-dirtying the "suspect" inodes on every write_inode(), but I personally have
>  > > no idea if his idea works.
>  > > 
>  > > I think we can resort to using an actual suspect list if you say that it
>  > > cannot work like this?
>  > 
>  > Yeah, the suspect list (i.e., additional list of inodes to check on sync)
>  > you describe should work fine. 
> 
> I think this solution still has the problem we have met in below thread[1]
> The main problem is the state combination of clean overlayfs' inode && dirty upper inode.

But I think the scheme Amir proposed and I detailed in my previous email
should prevent that state. Because while the inode is mapped, it will be
kept in the dirty list. So which scenario do you think would lead to clean
overlayfs inode and dirty upper inode?

> [1] https://www.spinics.net/lists/linux-unionfs/msg07448.html
> 
>  > Also the "keep suspect inode dirty" idea
>  > of Chengguang could work fine but we'd have to use something like
>  > inode_is_open_for_write() or inode_is_writeably_mapped() (which would need
>  > to be implemented but it should be easy vma_interval_tree_foreach() walk
>  > checking each found VMA for vma->vm_flags & VM_WRITE) for checking whether
>  > inode should be redirtied or not.
>  > 
> 
> I'm curious that isn't  it enough to check  i_mmap_writable by
> mapping_writably_mapped() ?  Am I missing something?

What is i_mmap_writeable? I've grepped the tree and didn't find anything
like that...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 5/8] ovl: mark overlayfs' inode dirty on shared writable mmap
  2020-11-06  8:50               ` Jan Kara
@ 2020-11-06  9:47                 ` Chengguang Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Chengguang Xu @ 2020-11-06  9:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, miklos, linux-unionfs, linux-fsdevel, charliecgxu

 ---- 在 星期五, 2020-11-06 16:50:23 Jan Kara <jack@suse.cz> 撰写 ----
 > On Fri 06-11-20 10:41:44, Chengguang Xu wrote:
 > >  ---- 在 星期四, 2020-11-05 23:54:34 Jan Kara <jack@suse.cz> 撰写 ----
 > >  > On Thu 05-11-20 16:21:27, Amir Goldstein wrote:
 > >  > > On Thu, Nov 5, 2020 at 4:03 PM Jan Kara <jack@suse.cz> wrote:
 > >  > > >
 > >  > > > On Wed 04-11-20 19:54:03, Chengguang Xu wrote:
 > >  > > > >  ---- 在 星期二, 2020-11-03 01:30:52 Jan Kara <jack@suse.cz> 撰写 ----
 > >  > > > >  > On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
 > >  > > > >  > > Overlayfs cannot be notified when mmapped area gets dirty,
 > >  > > > >  > > so we need to proactively mark inode dirty in ->mmap operation.
 > >  > > > >  > >
 > >  > > > >  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
 > >  > > > >  > > ---
 > >  > > > >  > >  fs/overlayfs/file.c | 4 ++++
 > >  > > > >  > >  1 file changed, 4 insertions(+)
 > >  > > > >  > >
 > >  > > > >  > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
 > >  > > > >  > > index efccb7c1f9bc..cd6fcdfd81a9 100644
 > >  > > > >  > > --- a/fs/overlayfs/file.c
 > >  > > > >  > > +++ b/fs/overlayfs/file.c
 > >  > > > >  > > @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
 > >  > > > >  > >          /* Drop reference count from new vm_file value */
 > >  > > > >  > >          fput(realfile);
 > >  > > > >  > >      } else {
 > >  > > > >  > > +        if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
 > >  > > > >  > > +            vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
 > >  > > > >  > > +            ovl_mark_inode_dirty(file_inode(file));
 > >  > > > >  > > +
 > >  > > > >  >
 > >  > > > >  > But does this work reliably? I mean once writeback runs, your inode (as
 > >  > > > >  > well as upper inode) is cleaned. Then a page fault comes so file has dirty
 > >  > > > >  > pages again and would need flushing but overlayfs inode stays clean? Am I
 > >  > > > >  > missing something?
 > >  > > > >  >
 > >  > > > >
 > >  > > > > Yeah, this is key point of this approach, in order to  fix the issue I
 > >  > > > > explicitly set I_DIRTY_SYNC flag in ovl_mark_inode_dirty(), so what i
 > >  > > > > mean is during writeback we will call into ->write_inode() by this
 > >  > > > > flag(I_DIRTY_SYNC) and at that place we get chance to check mapping and
 > >  > > > > re-dirty overlay's inode. The code logic like below in ovl_write_inode().
 > >  > > > >
 > >  > > > >     if (mapping_writably_mapped(upper->i_mapping) ||
 > >  > > > >          mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
 > >  > > > >                  iflag |= I_DIRTY_PAGES;
 > >  > > >
 > >  > > > OK, but suppose the upper mapping is clean at this moment (upper inode has
 > >  > > > been fully written out for whatever reason, but it is still mapped) so your
 > >  > > > overlayfs inode becomes clean as well. Then I don't see a mechanism which
 > >  > > > would make your overlayfs inode dirty again when a write to mmap happens,
 > >  > > > set_page_dirty() will end up marking upper inode with I_DIRTY_PAGES flag.
 > >  > > >
 > >  > > > Note that ovl_mmap() gets called only at mmap(2) syscall time but then
 > >  > > > pages get faulted in, dirtied, cleaned fully at discretion of the mm
 > >  > > > / writeback subsystem.
 > >  > > >
 > >  > > 
 > >  > > Perhaps I will add some background.
 > >  > > 
 > >  > > What I suggested was to maintain a "suspect list" in addition to
 > >  > > the dirty ovl inodes.
 > >  > > 
 > >  > > ovl inode is added to the suspect list on mmap (writable) and removed
 > >  > > from the suspect list on release() flush() or on sync_fs() if real inode is no
 > >  > > longer writably mapped.
 > >  > > 
 > >  > > There was another variant where ovl inode is added to suspect list on open
 > >  > > for write and removed from suspect list on release() flush() or sync_fs()
 > >  > > if real inode is not inode_is_open_for_write().
 > >  > > 
 > >  > > In both cases the list will have inodes whose real is not dirty, but
 > >  > > in both cases
 > >  > > the list shouldn't be terribly large to traverse on sync_fs().
 > >  > > 
 > >  > > Chengguang tried to implement the idea without an actual list by
 > >  > > re-dirtying the "suspect" inodes on every write_inode(), but I personally have
 > >  > > no idea if his idea works.
 > >  > > 
 > >  > > I think we can resort to using an actual suspect list if you say that it
 > >  > > cannot work like this?
 > >  > 
 > >  > Yeah, the suspect list (i.e., additional list of inodes to check on sync)
 > >  > you describe should work fine. 
 > > 
 > > I think this solution still has the problem we have met in below thread[1]
 > > The main problem is the state combination of clean overlayfs' inode && dirty upper inode.
 > 
 > But I think the scheme Amir proposed and I detailed in my previous email
 > should prevent that state. Because while the inode is mapped, it will be
 > kept in the dirty list. So which scenario do you think would lead to clean
 > overlayfs inode and dirty upper inode?

If keeping in the dirty list means making  overlayfs inode dirty, then
I think we don't need extra list for that, vfs itself has writeback list and
the solution will be exactly the same as mine(re-dirty) . Right?


 > 
 > > [1] https://www.spinics.net/lists/linux-unionfs/msg07448.html
 > > 
 > >  > Also the "keep suspect inode dirty" idea
 > >  > of Chengguang could work fine but we'd have to use something like
 > >  > inode_is_open_for_write() or inode_is_writeably_mapped() (which would need
 > >  > to be implemented but it should be easy vma_interval_tree_foreach() walk
 > >  > checking each found VMA for vma->vm_flags & VM_WRITE) for checking whether
 > >  > inode should be redirtied or not.
 > >  > 
 > > 
 > > I'm curious that isn't  it enough to check  i_mmap_writable by
 > > mapping_writably_mapped() ?  Am I missing something?
 > 
 > What is i_mmap_writeable? I've grepped the tree and didn't find anything
 > like that...
 > 

Maybe spelling mistake? The reason I check this is I'm afraid of the permission change of vma by mprotect(2).


Thanks,
Chenguang

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, back to index

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-25  3:41 [RFC PATCH v2 0/8] implement containerized syncfs for overlayfs Chengguang Xu
2020-10-25  3:41 ` [RFC PATCH v2 1/8] ovl: setup overlayfs' private bdi Chengguang Xu
2020-10-25  3:41 ` [RFC PATCH v2 2/8] ovl: implement ->writepages operation Chengguang Xu
2020-11-02 17:17   ` Jan Kara
2020-11-04 12:18     ` Chengguang Xu
2020-11-05 13:55       ` Jan Kara
2020-11-06  5:57         ` Chengguang Xu
2020-10-25  3:41 ` [RFC PATCH v2 3/8] ovl: implement overlayfs' ->evict_inode operation Chengguang Xu
2020-10-25  3:41 ` [RFC PATCH v2 4/8] ovl: mark overlayfs' inode dirty on modification Chengguang Xu
2020-10-25  3:41 ` [RFC PATCH v2 5/8] ovl: mark overlayfs' inode dirty on shared writable mmap Chengguang Xu
2020-11-02 17:30   ` Jan Kara
2020-11-04 11:54     ` Chengguang Xu
2020-11-05 14:03       ` Jan Kara
2020-11-05 14:21         ` Amir Goldstein
2020-11-05 15:54           ` Jan Kara
2020-11-06  2:41             ` Chengguang Xu
2020-11-06  8:50               ` Jan Kara
2020-11-06  9:47                 ` Chengguang Xu
2020-10-25  3:41 ` [RFC PATCH v2 6/8] ovl: implement overlayfs' ->write_inode operation Chengguang Xu
2020-10-25  3:41 ` [RFC PATCH v2 7/8] ovl: cache dirty overlayfs' inode Chengguang Xu
2020-10-25  3:41 ` [RFC PATCH v2 8/8] ovl: implement containerized syncfs for overlayfs Chengguang Xu
2020-10-30 15:46 ` [RFC PATCH v2 0/8] " Miklos Szeredi
2020-10-31 12:22   ` Chengguang Xu

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git