All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6] ovl: Improving syncfs efficiency
@ 2018-01-22  9:47 Chengguang Xu
  2018-01-22 11:27 ` Amir Goldstein
  0 siblings, 1 reply; 10+ messages in thread
From: Chengguang Xu @ 2018-01-22  9:47 UTC (permalink / raw)
  To: miklos, jack; +Cc: amir73il, linux-unionfs, Chengguang Xu

Currently syncfs(2) call on overlayfs just simply call sync_filesystem()
on upper_sb to synchronize whole dirty inodes in upper filesystem
regardless of the overlay ownership of the inode. In the use case of
container, when multiple containers using the same underlying upper
filesystem, it has some shortcomings as below.

(1) Performance
Synchronization is probably heavy because it actually syncs unnecessary
inodes for target overlayfs.

(2) Interference
Unplanned synchronization will probably impact IO performance of
unrelated container processes on the other overlayfs.

This patch iterates upper inode list in overlayfs to only sync target
dirty inodes and wait for completion. By doing this, It is able to reduce
cost of synchronization and will not seriously impact IO performance of
unrelated processes. In special case, when having very few dirty inodes
and a lot of clean upper inodes in overlayfs, then iteration may waste
time so that synchronization is slower than before, but we think the
potential for performance improvement far surpass the potential for
performance regression in most cases.

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
---
Changes since v5:
- Move sync related functions to new file sync.c.
- Introduce OVL_EVICT_PENDING flag to avoid race conditions.
- If ovl indoe flag is I_WILL_FREE or I_FREEING then syncfs
will wait until OVL_EVICT_PENDING to be cleared.
- Move inode sync operation into evict_inode from drop_inode.
- Call upper_sb->sync_fs with no-wait after sync dirty upper
inodes.
- Hold s_inode_wblist_lock until deleting item from list.
- Remove write_inode fuction because no more used.

Changes since v4:
- Add syncing operation and deleting from upper_inodes list
during ovl inode destruction.
- Export symbol of inode_wait_for_writeback() for waiting
writeback on ovl inode.
- Reuse I_SYNC flag to avoid race conditions between syncfs
and drop_inode.

Changes since v3:
- Introduce upper_inode list to organize ovl indoe which has upper inode.
- Avoid iput operation inside spin lock.
- Change list iteration method for safety.
- Reuse inode->i_wb_list and sb->s_inodes_wb to save memory.
- Modify commit log and add more comments to the code.

Changes since v2:
- Decoupling with syncfs of upper fs by taking sb->s_sync_lock of
overlayfs.
- Factoring out sync operation to different helpers for easy understanding.

Changes since v1:
- If upper filesystem is readonly mode then skip synchronization.
- Introduce global wait list to replace temporary wait list for
concurrent synchronization.

 fs/overlayfs/Makefile    |   2 +-
 fs/overlayfs/overlayfs.h |   8 ++
 fs/overlayfs/ovl_entry.h |   5 +
 fs/overlayfs/super.c     |  44 +++-----
 fs/overlayfs/sync.c      | 256 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/overlayfs/util.c      |  23 ++++-
 6 files changed, 305 insertions(+), 33 deletions(-)
 create mode 100644 fs/overlayfs/sync.c

diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
index 99373bb..6e92c0e 100644
--- a/fs/overlayfs/Makefile
+++ b/fs/overlayfs/Makefile
@@ -4,4 +4,4 @@
 
 obj-$(CONFIG_OVERLAY_FS) += overlay.o
 
-overlay-objs := super.o namei.o util.o inode.o dir.o readdir.o copy_up.o
+overlay-objs := super.o namei.o util.o inode.o dir.o readdir.o copy_up.o sync.o
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index b489099..c6a9ecd 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -34,6 +34,8 @@ enum ovl_flag {
 	/* Non-merge dir that may contain whiteout entries */
 	OVL_WHITEOUTS,
 	OVL_INDEX,
+	/* Set when ovl inode is about to be freed */
+	OVL_EVICT_PENDING,
 };
 
 /*
@@ -241,6 +243,7 @@ int ovl_check_setxattr(struct dentry *dentry, struct dentry *upperdentry,
 int ovl_nlink_start(struct dentry *dentry, bool *locked);
 void ovl_nlink_end(struct dentry *dentry, bool locked);
 int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir);
+void ovl_detach_upper_inodes_list(struct inode *inode);
 
 static inline bool ovl_is_impuredir(struct dentry *dentry)
 {
@@ -322,3 +325,8 @@ int ovl_create_real(struct inode *dir, struct dentry *newdentry,
 int ovl_copy_xattr(struct dentry *old, struct dentry *new);
 int ovl_set_attr(struct dentry *upper, struct kstat *stat);
 struct ovl_fh *ovl_encode_fh(struct dentry *lower, bool is_upper);
+
+/* sync.c */
+int ovl_write_inode(struct inode *inode, struct writeback_control *wbc);
+void ovl_evict_inode(struct inode *inode);
+int ovl_sync_fs(struct super_block *sb, int wait);
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 9d0bc03..53909ba 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -52,6 +52,9 @@ struct ovl_fs {
 	/* Did we take the inuse lock? */
 	bool upperdir_locked;
 	bool workdir_locked;
+	/* upper inode list and lock */
+	spinlock_t upper_inodes_lock;
+	struct list_head upper_inodes;
 };
 
 /* private information held for every overlayfs dentry */
@@ -80,6 +83,8 @@ struct ovl_inode {
 
 	/* synchronize copy up and more */
 	struct mutex lock;
+	/* upper inodes list */
+	struct list_head upper_inodes_list;
 };
 
 static inline struct ovl_inode *OVL_I(struct inode *inode)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 76440fe..9315155 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -195,6 +195,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb)
 	oi->__upperdentry = NULL;
 	oi->lower = NULL;
 	mutex_init(&oi->lock);
+	INIT_LIST_HEAD(&oi->upper_inodes_list);
 
 	return &oi->vfs_inode;
 }
@@ -252,36 +253,6 @@ static void ovl_put_super(struct super_block *sb)
 	ovl_free_fs(ofs);
 }
 
-/* Sync real dirty inodes in upper filesystem (if it exists) */
-static int ovl_sync_fs(struct super_block *sb, int wait)
-{
-	struct ovl_fs *ofs = sb->s_fs_info;
-	struct super_block *upper_sb;
-	int ret;
-
-	if (!ofs->upper_mnt)
-		return 0;
-
-	/*
-	 * If this is a sync(2) call or an emergency sync, all the super blocks
-	 * will be iterated, including upper_sb, so no need to do anything.
-	 *
-	 * If this is a syncfs(2) call, then we do need to call
-	 * sync_filesystem() on upper_sb, but enough if we do it when being
-	 * called with wait == 1.
-	 */
-	if (!wait)
-		return 0;
-
-	upper_sb = ofs->upper_mnt->mnt_sb;
-
-	down_read(&upper_sb->s_umount);
-	ret = sync_filesystem(upper_sb);
-	up_read(&upper_sb->s_umount);
-
-	return ret;
-}
-
 /**
  * ovl_statfs
  * @sb: The overlayfs super block
@@ -354,10 +325,18 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
 	return 0;
 }
 
+static int ovl_drop_inode(struct inode *inode)
+{
+	if (ovl_inode_upper(inode))
+		ovl_set_flag(OVL_EVICT_PENDING, inode);
+	return 1;
+}
+
 static const struct super_operations ovl_super_operations = {
 	.alloc_inode	= ovl_alloc_inode,
 	.destroy_inode	= ovl_destroy_inode,
-	.drop_inode	= generic_delete_inode,
+	.drop_inode	= ovl_drop_inode,
+	.evict_inode	= ovl_evict_inode,
 	.put_super	= ovl_put_super,
 	.sync_fs	= ovl_sync_fs,
 	.statfs		= ovl_statfs,
@@ -1202,6 +1181,9 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	if (!ofs)
 		goto out;
 
+	spin_lock_init(&ofs->upper_inodes_lock);
+	INIT_LIST_HEAD(&ofs->upper_inodes);
+
 	ofs->creator_cred = cred = prepare_creds();
 	if (!cred)
 		goto out_err;
diff --git a/fs/overlayfs/sync.c b/fs/overlayfs/sync.c
new file mode 100644
index 0000000..e239b0a
--- /dev/null
+++ b/fs/overlayfs/sync.c
@@ -0,0 +1,256 @@
+/*
+ * Copyright (C) 2018 Meili Inc.
+ * Author Chengguang Xu <cgxu519@icloud.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/xattr.h>
+#include <linux/mount.h>
+#include <linux/writeback.h>
+#include <linux/blkdev.h>
+#include "overlayfs.h"
+
+void ovl_evict_inode(struct inode *inode)
+{
+	if (ovl_inode_upper(inode)) {
+		write_inode_now(ovl_inode_upper(inode), 1);
+		ovl_detach_upper_inodes_list(inode);
+
+		/*
+		 * ovl_sync_inodes() may wait until
+		 * flag OVL_EVICT_PENDING to be cleared.
+		 */
+		spin_lock(&inode->i_lock);
+		ovl_clear_flag(OVL_EVICT_PENDING, inode);
+		/* See also atomic_bitops.txt */
+		smp_mb__after_atomic();
+		wake_up_bit(&OVL_I(inode)->flags, OVL_EVICT_PENDING);
+		spin_unlock(&inode->i_lock);
+		clear_inode(inode);
+	}
+}
+
+/**
+ * ovl_sync_inodes
+ * @sb: The overlayfs super block
+ *
+ * upper_inodes list is used for organizing ovl inode which has upper inode,
+ * by iterating the list to looking for and syncing dirty upper inode.
+ *
+ * When starting syncing inode, we add the inode to wait list explicitly,
+ * in order to save memory we reuse inode->i_wb_list and sb->s_inodes_wb,
+ * so those fields have slightly differnt meanings in overlayfs.
+ */
+static void ovl_sync_inodes(struct super_block *sb)
+{
+	struct ovl_fs *ofs = sb->s_fs_info;
+	struct ovl_inode *oi, *oi_prev;
+	struct inode *inode, *iput_inode = NULL;
+	struct inode *upper_inode;
+	struct blk_plug plug;
+
+	struct writeback_control wbc_for_sync = {
+		.sync_mode		= WB_SYNC_ALL,
+		.for_sync		= 1,
+		.range_start		= 0,
+		.range_end		= LLONG_MAX,
+		.nr_to_write		= LONG_MAX,
+	};
+
+	blk_start_plug(&plug);
+	spin_lock(&ofs->upper_inodes_lock);
+	list_for_each_entry(oi, &ofs->upper_inodes, upper_inodes_list) {
+		inode = &oi->vfs_inode;
+		spin_lock(&inode->i_lock);
+
+		if (inode->i_state & I_NEW) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+
+		/*
+		 * If ovl indoe state is I_WILL_FREE or I_FREEING,
+		 * sync operation will be done in the ovl_evict_inode(),
+		 * so wait here until OVL_EVICT_PENDING flag to be cleared.
+		 */
+		if (inode->i_state & (I_WILL_FREE | I_FREEING)) {
+			oi_prev = list_entry(oi->upper_inodes_list.prev,
+					struct ovl_inode, upper_inodes_list);
+
+			if (ovl_test_flag(OVL_EVICT_PENDING, inode)) {
+				DEFINE_WAIT_BIT(wait, &oi->flags, OVL_EVICT_PENDING);
+				wait_queue_head_t *wqh;
+				bool sleep;
+
+				wqh = bit_waitqueue(&oi->flags, OVL_EVICT_PENDING);
+				prepare_to_wait(wqh, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+				sleep = ovl_test_flag(OVL_EVICT_PENDING, inode);
+				spin_unlock(&inode->i_lock);
+				spin_unlock(&ofs->upper_inodes_lock);
+				if (sleep)
+					schedule();
+				finish_wait(wqh, &wait.wq_entry);
+				/* We need to do this as 'inode' can be freed by now */
+				oi = oi_prev;
+				goto next;
+			}
+		}
+
+		ihold(inode);
+		upper_inode = ovl_inode_upper(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&ofs->upper_inodes_lock);
+
+		if (iput_inode)
+			iput(iput_inode);
+
+		if (!(upper_inode->i_state & I_DIRTY_ALL)) {
+			if (!mapping_tagged(upper_inode->i_mapping,
+						PAGECACHE_TAG_WRITEBACK)) {
+				iput_inode = inode;
+				goto next;
+			}
+		} else {
+			sync_inode(upper_inode, &wbc_for_sync);
+		}
+
+		/*
+		 * If ovl inode is not in sb->s_inodes_wb list, add to the
+		 * list and keep inode referece until IO to be completed on
+		 * the inode.
+		 */
+		spin_lock(&sb->s_inode_wblist_lock);
+		if (list_empty(&inode->i_wb_list)) {
+			list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
+			iput_inode = NULL;
+		} else {
+			iput_inode = inode;
+		}
+		spin_unlock(&sb->s_inode_wblist_lock);
+
+next:
+		if (need_resched()) {
+			blk_finish_plug(&plug);
+			cond_resched();
+			blk_start_plug(&plug);
+		}
+		spin_lock(&ofs->upper_inodes_lock);
+	}
+	spin_unlock(&ofs->upper_inodes_lock);
+	blk_finish_plug(&plug);
+
+	if (iput_inode)
+		iput(iput_inode);
+}
+
+/**
+ * ovl_wait_inodes
+ * @sb: The overlayfs super block
+ *
+ * Waiting writeback inodes which are in s_inodes_wb list,
+ * all the IO that has been issued up to the time this
+ * function is enter is guaranteed to be completed.
+ */
+static void ovl_wait_inodes(struct super_block *sb)
+{
+	LIST_HEAD(sync_wait_list);
+	struct inode *inode;
+	struct inode *upper_inode;
+
+	/*
+	 * s_sync_lock is used for serialising concurrent wait operations.
+	 * s_inode_wblist_lock is used for protecting wait list against with
+	 * ovl_sync_indoes().
+	 *
+	 * ovl inode in sb->s_inodes_wb list has already got inode reference,
+	 * this will avoid silent inode droping during waiting on it.
+	 *
+	 * Splice the sync wait list onto a temporary list to avoid waiting on
+	 * inodes that have started writeback after this point.
+	 */
+	mutex_lock(&sb->s_sync_lock);
+	spin_lock(&sb->s_inode_wblist_lock);
+	list_splice_init(&sb->s_inodes_wb, &sync_wait_list);
+
+	while (!list_empty(&sync_wait_list)) {
+		inode = list_first_entry(&sync_wait_list, struct inode, i_wb_list);
+		list_del_init(&inode->i_wb_list);
+		upper_inode = ovl_inode_upper(inode);
+		spin_unlock(&sb->s_inode_wblist_lock);
+
+		if (!mapping_tagged(upper_inode->i_mapping,
+				PAGECACHE_TAG_WRITEBACK)) {
+			goto next;
+		}
+
+		filemap_fdatawait_keep_errors(upper_inode->i_mapping);
+
+		if (need_resched())
+			cond_resched();
+
+next:
+		iput(inode);
+		spin_lock(&sb->s_inode_wblist_lock);
+	}
+	spin_unlock(&sb->s_inode_wblist_lock);
+	mutex_unlock(&sb->s_sync_lock);
+}
+
+/**
+ * ovl_sync_filesystem
+ * @sb: The overlayfs super block
+ *
+ * Sync underlying dirty inodes in upper filesystem and
+ * wait for completion.
+ */
+static int ovl_sync_filesystem(struct super_block *sb)
+{
+	struct ovl_fs *ofs = sb->s_fs_info;
+	struct super_block *upper_sb = ofs->upper_mnt->mnt_sb;
+	int ret = 0;
+
+	down_read(&upper_sb->s_umount);
+	if (!sb_rdonly(upper_sb)) {
+
+		ovl_sync_inodes(sb);
+		/* Calling sync_fs with no-wait for better performance. */
+		if (upper_sb->s_op->sync_fs)
+			upper_sb->s_op->sync_fs(upper_sb, 0);
+
+		ovl_wait_inodes(sb);
+		if (upper_sb->s_op->sync_fs)
+			upper_sb->s_op->sync_fs(upper_sb, 1);
+
+		ret = sync_blockdev(upper_sb->s_bdev);
+	}
+	up_read(&upper_sb->s_umount);
+	return ret;
+}
+
+/**
+ * ovl_sync_fs
+ * @sb: The overlayfs super block
+ * @wait: Wait for I/O to complete
+ *
+ * Sync real dirty inodes in upper filesystem (if it exists)
+ */
+int ovl_sync_fs(struct super_block *sb, int wait)
+{
+	/*
+	 * If this is a sync(2) call or an emergency sync,
+	 * all the super blocks will be iterated, including
+	 * upper_sb, so no need to do anything.
+	 *
+	 * If this is a syncfs(2) call, then we do need to
+	 * call sync_inode() on underlying dirty upper inode,
+	 * but enough if we do it when being called with wait == 1.
+	 */
+	if (!wait)
+		return 0;
+
+	return ovl_sync_filesystem(sb);
+}
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index d6bb1c9..621d5d4 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -251,11 +251,31 @@ void ovl_dentry_set_redirect(struct dentry *dentry, const char *redirect)
 	oi->redirect = redirect;
 }
 
+void ovl_attach_upper_inodes_list(struct inode *inode)
+{
+	struct ovl_fs *ofs = inode->i_sb->s_fs_info;
+
+	spin_lock(&ofs->upper_inodes_lock);
+	list_add(&OVL_I(inode)->upper_inodes_list, &ofs->upper_inodes);
+	spin_unlock(&ofs->upper_inodes_lock);
+}
+
+void ovl_detach_upper_inodes_list(struct inode *inode)
+{
+	struct ovl_fs *ofs = inode->i_sb->s_fs_info;
+
+	spin_lock(&ofs->upper_inodes_lock);
+	list_del_init(&OVL_I(inode)->upper_inodes_list);
+	spin_unlock(&ofs->upper_inodes_lock);
+}
+
 void ovl_inode_init(struct inode *inode, struct dentry *upperdentry,
 		    struct dentry *lowerdentry)
 {
-	if (upperdentry)
+	if (upperdentry) {
 		OVL_I(inode)->__upperdentry = upperdentry;
+		ovl_attach_upper_inodes_list(inode);
+	}
 	if (lowerdentry)
 		OVL_I(inode)->lower = d_inode(lowerdentry);
 
@@ -277,6 +297,7 @@ void ovl_inode_update(struct inode *inode, struct dentry *upperdentry)
 		inode->i_private = upperinode;
 		__insert_inode_hash(inode, (unsigned long) upperinode);
 	}
+	ovl_attach_upper_inodes_list(inode);
 }
 
 void ovl_dentry_version_inc(struct dentry *dentry, bool impurity)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] ovl: Improving syncfs efficiency
  2018-01-22  9:47 [PATCH v6] ovl: Improving syncfs efficiency Chengguang Xu
@ 2018-01-22 11:27 ` Amir Goldstein
  2018-01-22 12:05   ` Chengguang Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Amir Goldstein @ 2018-01-22 11:27 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: Miklos Szeredi, Jan Kara, overlayfs

On Mon, Jan 22, 2018 at 11:47 AM, Chengguang Xu <cgxu519@icloud.com> wrote:
> Currently syncfs(2) call on overlayfs just simply call sync_filesystem()
> on upper_sb to synchronize whole dirty inodes in upper filesystem
> regardless of the overlay ownership of the inode. In the use case of
> container, when multiple containers using the same underlying upper
> filesystem, it has some shortcomings as below.
>
> (1) Performance
> Synchronization is probably heavy because it actually syncs unnecessary
> inodes for target overlayfs.
>
> (2) Interference
> Unplanned synchronization will probably impact IO performance of
> unrelated container processes on the other overlayfs.
>
> This patch iterates upper inode list in overlayfs to only sync target
> dirty inodes and wait for completion. By doing this, It is able to reduce
> cost of synchronization and will not seriously impact IO performance of
> unrelated processes. In special case, when having very few dirty inodes
> and a lot of clean upper inodes in overlayfs, then iteration may waste
> time so that synchronization is slower than before, but we think the
> potential for performance improvement far surpass the potential for
> performance regression in most cases.
>
> Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
> ---
> Changes since v5:
> - Move sync related functions to new file sync.c.
> - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
> - If ovl indoe flag is I_WILL_FREE or I_FREEING then syncfs

typo 'indo' and in comment below as well.

> will wait until OVL_EVICT_PENDING to be cleared.
> - Move inode sync operation into evict_inode from drop_inode.
> - Call upper_sb->sync_fs with no-wait after sync dirty upper
> inodes.
> - Hold s_inode_wblist_lock until deleting item from list.
> - Remove write_inode fuction because no more used.
>
> Changes since v4:
> - Add syncing operation and deleting from upper_inodes list
> during ovl inode destruction.
> - Export symbol of inode_wait_for_writeback() for waiting
> writeback on ovl inode.
> - Reuse I_SYNC flag to avoid race conditions between syncfs
> and drop_inode.
>
> Changes since v3:
> - Introduce upper_inode list to organize ovl indoe which has upper inode.
> - Avoid iput operation inside spin lock.
> - Change list iteration method for safety.
> - Reuse inode->i_wb_list and sb->s_inodes_wb to save memory.
> - Modify commit log and add more comments to the code.
>
> Changes since v2:
> - Decoupling with syncfs of upper fs by taking sb->s_sync_lock of
> overlayfs.
> - Factoring out sync operation to different helpers for easy understanding.
>
> Changes since v1:
> - If upper filesystem is readonly mode then skip synchronization.
> - Introduce global wait list to replace temporary wait list for
> concurrent synchronization.
>
>  fs/overlayfs/Makefile    |   2 +-
>  fs/overlayfs/overlayfs.h |   8 ++
>  fs/overlayfs/ovl_entry.h |   5 +
>  fs/overlayfs/super.c     |  44 +++-----
>  fs/overlayfs/sync.c      | 256 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/overlayfs/util.c      |  23 ++++-
>  6 files changed, 305 insertions(+), 33 deletions(-)
>  create mode 100644 fs/overlayfs/sync.c
>
> diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
> index 99373bb..6e92c0e 100644
> --- a/fs/overlayfs/Makefile
> +++ b/fs/overlayfs/Makefile
> @@ -4,4 +4,4 @@
>
>  obj-$(CONFIG_OVERLAY_FS) += overlay.o
>
> -overlay-objs := super.o namei.o util.o inode.o dir.o readdir.o copy_up.o
> +overlay-objs := super.o namei.o util.o inode.o dir.o readdir.o copy_up.o sync.o
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index b489099..c6a9ecd 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -34,6 +34,8 @@ enum ovl_flag {
>         /* Non-merge dir that may contain whiteout entries */
>         OVL_WHITEOUTS,
>         OVL_INDEX,
> +       /* Set when ovl inode is about to be freed */
> +       OVL_EVICT_PENDING,
>  };
>
>  /*
> @@ -241,6 +243,7 @@ int ovl_check_setxattr(struct dentry *dentry, struct dentry *upperdentry,
>  int ovl_nlink_start(struct dentry *dentry, bool *locked);
>  void ovl_nlink_end(struct dentry *dentry, bool locked);
>  int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir);
> +void ovl_detach_upper_inodes_list(struct inode *inode);
>
>  static inline bool ovl_is_impuredir(struct dentry *dentry)
>  {
> @@ -322,3 +325,8 @@ int ovl_create_real(struct inode *dir, struct dentry *newdentry,
>  int ovl_copy_xattr(struct dentry *old, struct dentry *new);
>  int ovl_set_attr(struct dentry *upper, struct kstat *stat);
>  struct ovl_fh *ovl_encode_fh(struct dentry *lower, bool is_upper);
> +
> +/* sync.c */
> +int ovl_write_inode(struct inode *inode, struct writeback_control *wbc);

leftover

> +void ovl_evict_inode(struct inode *inode);
> +int ovl_sync_fs(struct super_block *sb, int wait);
> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> index 9d0bc03..53909ba 100644
> --- a/fs/overlayfs/ovl_entry.h
> +++ b/fs/overlayfs/ovl_entry.h
> @@ -52,6 +52,9 @@ struct ovl_fs {
>         /* Did we take the inuse lock? */
>         bool upperdir_locked;
>         bool workdir_locked;
> +       /* upper inode list and lock */
> +       spinlock_t upper_inodes_lock;
> +       struct list_head upper_inodes;
>  };
>
>  /* private information held for every overlayfs dentry */
> @@ -80,6 +83,8 @@ struct ovl_inode {
>
>         /* synchronize copy up and more */
>         struct mutex lock;
> +       /* upper inodes list */
> +       struct list_head upper_inodes_list;
>  };
>
>  static inline struct ovl_inode *OVL_I(struct inode *inode)
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 76440fe..9315155 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -195,6 +195,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb)
>         oi->__upperdentry = NULL;
>         oi->lower = NULL;
>         mutex_init(&oi->lock);
> +       INIT_LIST_HEAD(&oi->upper_inodes_list);
>
>         return &oi->vfs_inode;
>  }
> @@ -252,36 +253,6 @@ static void ovl_put_super(struct super_block *sb)
>         ovl_free_fs(ofs);
>  }
>
> -/* Sync real dirty inodes in upper filesystem (if it exists) */
> -static int ovl_sync_fs(struct super_block *sb, int wait)
> -{
> -       struct ovl_fs *ofs = sb->s_fs_info;
> -       struct super_block *upper_sb;
> -       int ret;
> -
> -       if (!ofs->upper_mnt)
> -               return 0;
> -
> -       /*
> -        * If this is a sync(2) call or an emergency sync, all the super blocks
> -        * will be iterated, including upper_sb, so no need to do anything.
> -        *
> -        * If this is a syncfs(2) call, then we do need to call
> -        * sync_filesystem() on upper_sb, but enough if we do it when being
> -        * called with wait == 1.
> -        */
> -       if (!wait)
> -               return 0;
> -
> -       upper_sb = ofs->upper_mnt->mnt_sb;
> -
> -       down_read(&upper_sb->s_umount);
> -       ret = sync_filesystem(upper_sb);
> -       up_read(&upper_sb->s_umount);
> -
> -       return ret;
> -}
> -
>  /**
>   * ovl_statfs
>   * @sb: The overlayfs super block
> @@ -354,10 +325,18 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
>         return 0;
>  }
>
> +static int ovl_drop_inode(struct inode *inode)
> +{
> +       if (ovl_inode_upper(inode))
> +               ovl_set_flag(OVL_EVICT_PENDING, inode);
> +       return 1;
> +}
> +
>  static const struct super_operations ovl_super_operations = {
>         .alloc_inode    = ovl_alloc_inode,
>         .destroy_inode  = ovl_destroy_inode,
> -       .drop_inode     = generic_delete_inode,
> +       .drop_inode     = ovl_drop_inode,
> +       .evict_inode    = ovl_evict_inode,
>         .put_super      = ovl_put_super,
>         .sync_fs        = ovl_sync_fs,
>         .statfs         = ovl_statfs,
> @@ -1202,6 +1181,9 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>         if (!ofs)
>                 goto out;
>
> +       spin_lock_init(&ofs->upper_inodes_lock);
> +       INIT_LIST_HEAD(&ofs->upper_inodes);
> +
>         ofs->creator_cred = cred = prepare_creds();
>         if (!cred)
>                 goto out_err;
> diff --git a/fs/overlayfs/sync.c b/fs/overlayfs/sync.c
> new file mode 100644
> index 0000000..e239b0a
> --- /dev/null
> +++ b/fs/overlayfs/sync.c
> @@ -0,0 +1,256 @@
> +/*
> + * Copyright (C) 2018 Meili Inc.
> + * Author Chengguang Xu <cgxu519@icloud.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/xattr.h>
> +#include <linux/mount.h>
> +#include <linux/writeback.h>
> +#include <linux/blkdev.h>
> +#include "overlayfs.h"
> +
> +void ovl_evict_inode(struct inode *inode)
> +{
> +       if (ovl_inode_upper(inode)) {
> +               write_inode_now(ovl_inode_upper(inode), 1);
> +               ovl_detach_upper_inodes_list(inode);
> +
> +               /*
> +                * ovl_sync_inodes() may wait until
> +                * flag OVL_EVICT_PENDING to be cleared.
> +                */
> +               spin_lock(&inode->i_lock);
> +               ovl_clear_flag(OVL_EVICT_PENDING, inode);
> +               /* See also atomic_bitops.txt */
> +               smp_mb__after_atomic();
> +               wake_up_bit(&OVL_I(inode)->flags, OVL_EVICT_PENDING);
> +               spin_unlock(&inode->i_lock);
> +               clear_inode(inode);
> +       }
> +}
> +
> +/**
> + * ovl_sync_inodes
> + * @sb: The overlayfs super block
> + *
> + * upper_inodes list is used for organizing ovl inode which has upper inode,
> + * by iterating the list to looking for and syncing dirty upper inode.
> + *
> + * When starting syncing inode, we add the inode to wait list explicitly,
> + * in order to save memory we reuse inode->i_wb_list and sb->s_inodes_wb,
> + * so those fields have slightly differnt meanings in overlayfs.
> + */
> +static void ovl_sync_inodes(struct super_block *sb)
> +{
> +       struct ovl_fs *ofs = sb->s_fs_info;
> +       struct ovl_inode *oi, *oi_prev;
> +       struct inode *inode, *iput_inode = NULL;
> +       struct inode *upper_inode;
> +       struct blk_plug plug;
> +
> +       struct writeback_control wbc_for_sync = {
> +               .sync_mode              = WB_SYNC_ALL,
> +               .for_sync               = 1,
> +               .range_start            = 0,
> +               .range_end              = LLONG_MAX,
> +               .nr_to_write            = LONG_MAX,
> +       };
> +
> +       blk_start_plug(&plug);
> +       spin_lock(&ofs->upper_inodes_lock);
> +       list_for_each_entry(oi, &ofs->upper_inodes, upper_inodes_list) {
> +               inode = &oi->vfs_inode;
> +               spin_lock(&inode->i_lock);
> +
> +               if (inode->i_state & I_NEW) {
> +                       spin_unlock(&inode->i_lock);
> +                       continue;
> +               }
> +
> +               /*
> +                * If ovl indoe state is I_WILL_FREE or I_FREEING,
> +                * sync operation will be done in the ovl_evict_inode(),
> +                * so wait here until OVL_EVICT_PENDING flag to be cleared.
> +                */
> +               if (inode->i_state & (I_WILL_FREE | I_FREEING)) {
> +                       oi_prev = list_entry(oi->upper_inodes_list.prev,
> +                                       struct ovl_inode, upper_inodes_list);
> +
> +                       if (ovl_test_flag(OVL_EVICT_PENDING, inode)) {
> +                               DEFINE_WAIT_BIT(wait, &oi->flags, OVL_EVICT_PENDING);
> +                               wait_queue_head_t *wqh;
> +                               bool sleep;
> +
> +                               wqh = bit_waitqueue(&oi->flags, OVL_EVICT_PENDING);
> +                               prepare_to_wait(wqh, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> +                               sleep = ovl_test_flag(OVL_EVICT_PENDING, inode);
> +                               spin_unlock(&inode->i_lock);
> +                               spin_unlock(&ofs->upper_inodes_lock);
> +                               if (sleep)
> +                                       schedule();
> +                               finish_wait(wqh, &wait.wq_entry);

Please move all this to helper ovl_wait_evict_pending()



> +                               /* We need to do this as 'inode' can be freed by now */

Could oi_prev have been freed or removed from upper_inodes list?
I don't see why not.
You could try to continue iteration from iput_inode if it is not NULL,
but what if it is NULL?
Maybe instead of iput_inode = NULL when adding to inodes_wb list
get another reference to inode and always store iput_inode = inode
to use as prev in this case?

It would be good if you wrote the "life cycle" of ovl_inode in a
header comment to
this file because it has become non-trivial to follow.
An explicit mention of locking order is also very good to have in
header comment.



> +                               oi = oi_prev;
> +                               goto next;
> +                       }
> +               }
> +
> +               ihold(inode);
> +               upper_inode = ovl_inode_upper(inode);
> +               spin_unlock(&inode->i_lock);
> +               spin_unlock(&ofs->upper_inodes_lock);
> +
> +               if (iput_inode)
> +                       iput(iput_inode);
> +
> +               if (!(upper_inode->i_state & I_DIRTY_ALL)) {
> +                       if (!mapping_tagged(upper_inode->i_mapping,
> +                                               PAGECACHE_TAG_WRITEBACK)) {
> +                               iput_inode = inode;
> +                               goto next;
> +                       }
> +               } else {
> +                       sync_inode(upper_inode, &wbc_for_sync);
> +               }
> +
> +               /*
> +                * If ovl inode is not in sb->s_inodes_wb list, add to the
> +                * list and keep inode referece until IO to be completed on
> +                * the inode.
> +                */
> +               spin_lock(&sb->s_inode_wblist_lock);
> +               if (list_empty(&inode->i_wb_list)) {
> +                       list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> +                       iput_inode = NULL;
> +               } else {
> +                       iput_inode = inode;
> +               }
> +               spin_unlock(&sb->s_inode_wblist_lock);
> +
> +next:
> +               if (need_resched()) {
> +                       blk_finish_plug(&plug);
> +                       cond_resched();
> +                       blk_start_plug(&plug);
> +               }
> +               spin_lock(&ofs->upper_inodes_lock);
> +       }
> +       spin_unlock(&ofs->upper_inodes_lock);
> +       blk_finish_plug(&plug);
> +
> +       if (iput_inode)
> +               iput(iput_inode);
> +}
> +
> +/**
> + * ovl_wait_inodes
> + * @sb: The overlayfs super block
> + *
> + * Waiting writeback inodes which are in s_inodes_wb list,
> + * all the IO that has been issued up to the time this
> + * function is enter is guaranteed to be completed.
> + */
> +static void ovl_wait_inodes(struct super_block *sb)
> +{
> +       LIST_HEAD(sync_wait_list);
> +       struct inode *inode;
> +       struct inode *upper_inode;
> +
> +       /*
> +        * s_sync_lock is used for serialising concurrent wait operations.
> +        * s_inode_wblist_lock is used for protecting wait list against with
> +        * ovl_sync_indoes().
> +        *
> +        * ovl inode in sb->s_inodes_wb list has already got inode reference,
> +        * this will avoid silent inode droping during waiting on it.
> +        *
> +        * Splice the sync wait list onto a temporary list to avoid waiting on
> +        * inodes that have started writeback after this point.
> +        */
> +       mutex_lock(&sb->s_sync_lock);
> +       spin_lock(&sb->s_inode_wblist_lock);
> +       list_splice_init(&sb->s_inodes_wb, &sync_wait_list);
> +
> +       while (!list_empty(&sync_wait_list)) {
> +               inode = list_first_entry(&sync_wait_list, struct inode, i_wb_list);
> +               list_del_init(&inode->i_wb_list);
> +               upper_inode = ovl_inode_upper(inode);
> +               spin_unlock(&sb->s_inode_wblist_lock);
> +
> +               if (!mapping_tagged(upper_inode->i_mapping,
> +                               PAGECACHE_TAG_WRITEBACK)) {
> +                       goto next;
> +               }
> +
> +               filemap_fdatawait_keep_errors(upper_inode->i_mapping);
> +
> +               if (need_resched())
> +                       cond_resched();
> +
> +next:
> +               iput(inode);
> +               spin_lock(&sb->s_inode_wblist_lock);
> +       }
> +       spin_unlock(&sb->s_inode_wblist_lock);
> +       mutex_unlock(&sb->s_sync_lock);
> +}
> +
> +/**
> + * ovl_sync_filesystem
> + * @sb: The overlayfs super block
> + *
> + * Sync underlying dirty inodes in upper filesystem and
> + * wait for completion.
> + */
> +static int ovl_sync_filesystem(struct super_block *sb)
> +{
> +       struct ovl_fs *ofs = sb->s_fs_info;
> +       struct super_block *upper_sb = ofs->upper_mnt->mnt_sb;
> +       int ret = 0;
> +
> +       down_read(&upper_sb->s_umount);
> +       if (!sb_rdonly(upper_sb)) {
> +
> +               ovl_sync_inodes(sb);
> +               /* Calling sync_fs with no-wait for better performance. */
> +               if (upper_sb->s_op->sync_fs)
> +                       upper_sb->s_op->sync_fs(upper_sb, 0);
> +
> +               ovl_wait_inodes(sb);
> +               if (upper_sb->s_op->sync_fs)
> +                       upper_sb->s_op->sync_fs(upper_sb, 1);
> +
> +               ret = sync_blockdev(upper_sb->s_bdev);
> +       }
> +       up_read(&upper_sb->s_umount);
> +       return ret;
> +}
> +
> +/**
> + * ovl_sync_fs
> + * @sb: The overlayfs super block
> + * @wait: Wait for I/O to complete
> + *
> + * Sync real dirty inodes in upper filesystem (if it exists)
> + */
> +int ovl_sync_fs(struct super_block *sb, int wait)
> +{
> +       /*
> +        * If this is a sync(2) call or an emergency sync,
> +        * all the super blocks will be iterated, including
> +        * upper_sb, so no need to do anything.
> +        *
> +        * If this is a syncfs(2) call, then we do need to
> +        * call sync_inode() on underlying dirty upper inode,
> +        * but enough if we do it when being called with wait == 1.
> +        */
> +       if (!wait)
> +               return 0;
> +
> +       return ovl_sync_filesystem(sb);
> +}
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index d6bb1c9..621d5d4 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -251,11 +251,31 @@ void ovl_dentry_set_redirect(struct dentry *dentry, const char *redirect)
>         oi->redirect = redirect;
>  }
>
> +void ovl_attach_upper_inodes_list(struct inode *inode)
> +{
> +       struct ovl_fs *ofs = inode->i_sb->s_fs_info;
> +
> +       spin_lock(&ofs->upper_inodes_lock);
> +       list_add(&OVL_I(inode)->upper_inodes_list, &ofs->upper_inodes);
> +       spin_unlock(&ofs->upper_inodes_lock);
> +}
> +
> +void ovl_detach_upper_inodes_list(struct inode *inode)
> +{
> +       struct ovl_fs *ofs = inode->i_sb->s_fs_info;
> +
> +       spin_lock(&ofs->upper_inodes_lock);
> +       list_del_init(&OVL_I(inode)->upper_inodes_list);
> +       spin_unlock(&ofs->upper_inodes_lock);
> +}
> +
>  void ovl_inode_init(struct inode *inode, struct dentry *upperdentry,
>                     struct dentry *lowerdentry)
>  {
> -       if (upperdentry)
> +       if (upperdentry) {
>                 OVL_I(inode)->__upperdentry = upperdentry;
> +               ovl_attach_upper_inodes_list(inode);
> +       }
>         if (lowerdentry)
>                 OVL_I(inode)->lower = d_inode(lowerdentry);
>
> @@ -277,6 +297,7 @@ void ovl_inode_update(struct inode *inode, struct dentry *upperdentry)
>                 inode->i_private = upperinode;
>                 __insert_inode_hash(inode, (unsigned long) upperinode);
>         }
> +       ovl_attach_upper_inodes_list(inode);
>  }
>
>  void ovl_dentry_version_inc(struct dentry *dentry, bool impurity)
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] ovl: Improving syncfs efficiency
  2018-01-22 11:27 ` Amir Goldstein
@ 2018-01-22 12:05   ` Chengguang Xu
  2018-01-22 12:58     ` Amir Goldstein
  0 siblings, 1 reply; 10+ messages in thread
From: Chengguang Xu @ 2018-01-22 12:05 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, Jan Kara, overlayfs

> 
> 在 2018年1月22日,下午7:27,Amir Goldstein <amir73il@gmail.com> 写道:
> 
> On Mon, Jan 22, 2018 at 11:47 AM, Chengguang Xu <cgxu519@icloud.com> wrote:
>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem()
>> on upper_sb to synchronize whole dirty inodes in upper filesystem
>> regardless of the overlay ownership of the inode. In the use case of
>> container, when multiple containers using the same underlying upper
>> filesystem, it has some shortcomings as below.
>> 
>> (1) Performance
>> Synchronization is probably heavy because it actually syncs unnecessary
>> inodes for target overlayfs.
>> 
>> (2) Interference
>> Unplanned synchronization will probably impact IO performance of
>> unrelated container processes on the other overlayfs.
>> 
>> This patch iterates upper inode list in overlayfs to only sync target
>> dirty inodes and wait for completion. By doing this, It is able to reduce
>> cost of synchronization and will not seriously impact IO performance of
>> unrelated processes. In special case, when having very few dirty inodes
>> and a lot of clean upper inodes in overlayfs, then iteration may waste
>> time so that synchronization is slower than before, but we think the
>> potential for performance improvement far surpass the potential for
>> performance regression in most cases.
>> 
>> Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
>> ---
>> Changes since v5:
>> - Move sync related functions to new file sync.c.
>> - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
>> - If ovl indoe flag is I_WILL_FREE or I_FREEING then syncfs
> 
> typo 'indo' and in comment below as well.
> 
>> will wait until OVL_EVICT_PENDING to be cleared.
>> - Move inode sync operation into evict_inode from drop_inode.
>> - Call upper_sb->sync_fs with no-wait after sync dirty upper
>> inodes.
>> - Hold s_inode_wblist_lock until deleting item from list.
>> - Remove write_inode fuction because no more used.
>> 
>> Changes since v4:
>> - Add syncing operation and deleting from upper_inodes list
>> during ovl inode destruction.
>> - Export symbol of inode_wait_for_writeback() for waiting
>> writeback on ovl inode.
>> - Reuse I_SYNC flag to avoid race conditions between syncfs
>> and drop_inode.
>> 
>> Changes since v3:
>> - Introduce upper_inode list to organize ovl indoe which has upper inode.
>> - Avoid iput operation inside spin lock.
>> - Change list iteration method for safety.
>> - Reuse inode->i_wb_list and sb->s_inodes_wb to save memory.
>> - Modify commit log and add more comments to the code.
>> 
>> Changes since v2:
>> - Decoupling with syncfs of upper fs by taking sb->s_sync_lock of
>> overlayfs.
>> - Factoring out sync operation to different helpers for easy understanding.
>> 
>> Changes since v1:
>> - If upper filesystem is readonly mode then skip synchronization.
>> - Introduce global wait list to replace temporary wait list for
>> concurrent synchronization.
>> 
>> fs/overlayfs/Makefile    |   2 +-
>> fs/overlayfs/overlayfs.h |   8 ++
>> fs/overlayfs/ovl_entry.h |   5 +
>> fs/overlayfs/super.c     |  44 +++-----
>> fs/overlayfs/sync.c      | 256 +++++++++++++++++++++++++++++++++++++++++++++++
>> fs/overlayfs/util.c      |  23 ++++-
>> 6 files changed, 305 insertions(+), 33 deletions(-)
>> create mode 100644 fs/overlayfs/sync.c
>> 
>> diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
>> index 99373bb..6e92c0e 100644
>> --- a/fs/overlayfs/Makefile
>> +++ b/fs/overlayfs/Makefile
>> @@ -4,4 +4,4 @@
>> 
>> obj-$(CONFIG_OVERLAY_FS) += overlay.o
>> 
>> -overlay-objs := super.o namei.o util.o inode.o dir.o readdir.o copy_up.o
>> +overlay-objs := super.o namei.o util.o inode.o dir.o readdir.o copy_up.o sync.o
>> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
>> index b489099..c6a9ecd 100644
>> --- a/fs/overlayfs/overlayfs.h
>> +++ b/fs/overlayfs/overlayfs.h
>> @@ -34,6 +34,8 @@ enum ovl_flag {
>>        /* Non-merge dir that may contain whiteout entries */
>>        OVL_WHITEOUTS,
>>        OVL_INDEX,
>> +       /* Set when ovl inode is about to be freed */
>> +       OVL_EVICT_PENDING,
>> };
>> 
>> /*
>> @@ -241,6 +243,7 @@ int ovl_check_setxattr(struct dentry *dentry, struct dentry *upperdentry,
>> int ovl_nlink_start(struct dentry *dentry, bool *locked);
>> void ovl_nlink_end(struct dentry *dentry, bool locked);
>> int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir);
>> +void ovl_detach_upper_inodes_list(struct inode *inode);
>> 
>> static inline bool ovl_is_impuredir(struct dentry *dentry)
>> {
>> @@ -322,3 +325,8 @@ int ovl_create_real(struct inode *dir, struct dentry *newdentry,
>> int ovl_copy_xattr(struct dentry *old, struct dentry *new);
>> int ovl_set_attr(struct dentry *upper, struct kstat *stat);
>> struct ovl_fh *ovl_encode_fh(struct dentry *lower, bool is_upper);
>> +
>> +/* sync.c */
>> +int ovl_write_inode(struct inode *inode, struct writeback_control *wbc);
> 
> leftover
> 
>> +void ovl_evict_inode(struct inode *inode);
>> +int ovl_sync_fs(struct super_block *sb, int wait);
>> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
>> index 9d0bc03..53909ba 100644
>> --- a/fs/overlayfs/ovl_entry.h
>> +++ b/fs/overlayfs/ovl_entry.h
>> @@ -52,6 +52,9 @@ struct ovl_fs {
>>        /* Did we take the inuse lock? */
>>        bool upperdir_locked;
>>        bool workdir_locked;
>> +       /* upper inode list and lock */
>> +       spinlock_t upper_inodes_lock;
>> +       struct list_head upper_inodes;
>> };
>> 
>> /* private information held for every overlayfs dentry */
>> @@ -80,6 +83,8 @@ struct ovl_inode {
>> 
>>        /* synchronize copy up and more */
>>        struct mutex lock;
>> +       /* upper inodes list */
>> +       struct list_head upper_inodes_list;
>> };
>> 
>> static inline struct ovl_inode *OVL_I(struct inode *inode)
>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>> index 76440fe..9315155 100644
>> --- a/fs/overlayfs/super.c
>> +++ b/fs/overlayfs/super.c
>> @@ -195,6 +195,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb)
>>        oi->__upperdentry = NULL;
>>        oi->lower = NULL;
>>        mutex_init(&oi->lock);
>> +       INIT_LIST_HEAD(&oi->upper_inodes_list);
>> 
>>        return &oi->vfs_inode;
>> }
>> @@ -252,36 +253,6 @@ static void ovl_put_super(struct super_block *sb)
>>        ovl_free_fs(ofs);
>> }
>> 
>> -/* Sync real dirty inodes in upper filesystem (if it exists) */
>> -static int ovl_sync_fs(struct super_block *sb, int wait)
>> -{
>> -       struct ovl_fs *ofs = sb->s_fs_info;
>> -       struct super_block *upper_sb;
>> -       int ret;
>> -
>> -       if (!ofs->upper_mnt)
>> -               return 0;
>> -
>> -       /*
>> -        * If this is a sync(2) call or an emergency sync, all the super blocks
>> -        * will be iterated, including upper_sb, so no need to do anything.
>> -        *
>> -        * If this is a syncfs(2) call, then we do need to call
>> -        * sync_filesystem() on upper_sb, but enough if we do it when being
>> -        * called with wait == 1.
>> -        */
>> -       if (!wait)
>> -               return 0;
>> -
>> -       upper_sb = ofs->upper_mnt->mnt_sb;
>> -
>> -       down_read(&upper_sb->s_umount);
>> -       ret = sync_filesystem(upper_sb);
>> -       up_read(&upper_sb->s_umount);
>> -
>> -       return ret;
>> -}
>> -
>> /**
>>  * ovl_statfs
>>  * @sb: The overlayfs super block
>> @@ -354,10 +325,18 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
>>        return 0;
>> }
>> 
>> +static int ovl_drop_inode(struct inode *inode)
>> +{
>> +       if (ovl_inode_upper(inode))
>> +               ovl_set_flag(OVL_EVICT_PENDING, inode);
>> +       return 1;
>> +}
>> +
>> static const struct super_operations ovl_super_operations = {
>>        .alloc_inode    = ovl_alloc_inode,
>>        .destroy_inode  = ovl_destroy_inode,
>> -       .drop_inode     = generic_delete_inode,
>> +       .drop_inode     = ovl_drop_inode,
>> +       .evict_inode    = ovl_evict_inode,
>>        .put_super      = ovl_put_super,
>>        .sync_fs        = ovl_sync_fs,
>>        .statfs         = ovl_statfs,
>> @@ -1202,6 +1181,9 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>>        if (!ofs)
>>                goto out;
>> 
>> +       spin_lock_init(&ofs->upper_inodes_lock);
>> +       INIT_LIST_HEAD(&ofs->upper_inodes);
>> +
>>        ofs->creator_cred = cred = prepare_creds();
>>        if (!cred)
>>                goto out_err;
>> diff --git a/fs/overlayfs/sync.c b/fs/overlayfs/sync.c
>> new file mode 100644
>> index 0000000..e239b0a
>> --- /dev/null
>> +++ b/fs/overlayfs/sync.c
>> @@ -0,0 +1,256 @@
>> +/*
>> + * Copyright (C) 2018 Meili Inc.
>> + * Author Chengguang Xu <cgxu519@icloud.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include <linux/fs.h>
>> +#include <linux/xattr.h>
>> +#include <linux/mount.h>
>> +#include <linux/writeback.h>
>> +#include <linux/blkdev.h>
>> +#include "overlayfs.h"
>> +
>> +void ovl_evict_inode(struct inode *inode)
>> +{
>> +       if (ovl_inode_upper(inode)) {
>> +               write_inode_now(ovl_inode_upper(inode), 1);
>> +               ovl_detach_upper_inodes_list(inode);
>> +
>> +               /*
>> +                * ovl_sync_inodes() may wait until
>> +                * flag OVL_EVICT_PENDING to be cleared.
>> +                */
>> +               spin_lock(&inode->i_lock);
>> +               ovl_clear_flag(OVL_EVICT_PENDING, inode);
>> +               /* See also atomic_bitops.txt */
>> +               smp_mb__after_atomic();
>> +               wake_up_bit(&OVL_I(inode)->flags, OVL_EVICT_PENDING);
>> +               spin_unlock(&inode->i_lock);
>> +               clear_inode(inode);
>> +       }
>> +}
>> +
>> +/**
>> + * ovl_sync_inodes
>> + * @sb: The overlayfs super block
>> + *
>> + * upper_inodes list is used for organizing ovl inode which has upper inode,
>> + * by iterating the list to looking for and syncing dirty upper inode.
>> + *
>> + * When starting syncing inode, we add the inode to wait list explicitly,
>> + * in order to save memory we reuse inode->i_wb_list and sb->s_inodes_wb,
>> + * so those fields have slightly differnt meanings in overlayfs.
>> + */
>> +static void ovl_sync_inodes(struct super_block *sb)
>> +{
>> +       struct ovl_fs *ofs = sb->s_fs_info;
>> +       struct ovl_inode *oi, *oi_prev;
>> +       struct inode *inode, *iput_inode = NULL;
>> +       struct inode *upper_inode;
>> +       struct blk_plug plug;
>> +
>> +       struct writeback_control wbc_for_sync = {
>> +               .sync_mode              = WB_SYNC_ALL,
>> +               .for_sync               = 1,
>> +               .range_start            = 0,
>> +               .range_end              = LLONG_MAX,
>> +               .nr_to_write            = LONG_MAX,
>> +       };
>> +
>> +       blk_start_plug(&plug);
>> +       spin_lock(&ofs->upper_inodes_lock);
>> +       list_for_each_entry(oi, &ofs->upper_inodes, upper_inodes_list) {
>> +               inode = &oi->vfs_inode;
>> +               spin_lock(&inode->i_lock);
>> +
>> +               if (inode->i_state & I_NEW) {
>> +                       spin_unlock(&inode->i_lock);
>> +                       continue;
>> +               }
>> +
>> +               /*
>> +                * If ovl indoe state is I_WILL_FREE or I_FREEING,
>> +                * sync operation will be done in the ovl_evict_inode(),
>> +                * so wait here until OVL_EVICT_PENDING flag to be cleared.
>> +                */
>> +               if (inode->i_state & (I_WILL_FREE | I_FREEING)) {
>> +                       oi_prev = list_entry(oi->upper_inodes_list.prev,
>> +                                       struct ovl_inode, upper_inodes_list);
>> +
>> +                       if (ovl_test_flag(OVL_EVICT_PENDING, inode)) {
>> +                               DEFINE_WAIT_BIT(wait, &oi->flags, OVL_EVICT_PENDING);
>> +                               wait_queue_head_t *wqh;
>> +                               bool sleep;
>> +
>> +                               wqh = bit_waitqueue(&oi->flags, OVL_EVICT_PENDING);
>> +                               prepare_to_wait(wqh, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
>> +                               sleep = ovl_test_flag(OVL_EVICT_PENDING, inode);
>> +                               spin_unlock(&inode->i_lock);
>> +                               spin_unlock(&ofs->upper_inodes_lock);
>> +                               if (sleep)
>> +                                       schedule();
>> +                               finish_wait(wqh, &wait.wq_entry);
> 
> Please move all this to helper ovl_wait_evict_pending()
> 
> 
> 
>> +                               /* We need to do this as 'inode' can be freed by now */
> 
> Could oi_prev have been freed or removed from upper_inodes list?
> I don't see why not.
> You could try to continue iteration from iput_inode if it is not NULL,
> but what if it is NULL?

When iterating the list we either get reference or wait until the item gets deleted,
so I think there is no chance let previous inode becomes NULL.


> Maybe instead of iput_inode = NULL when adding to inodes_wb list
> get another reference to inode and always store iput_inode = inode
> to use as prev in this case?

Maybe it’s better than current, but I think it’s more complex to understand
the behavior, so please let me wait for more comments for the approach.


> 
> It would be good if you wrote the "life cycle" of ovl_inode in a
> header comment to
> this file because it has become non-trivial to follow.
> An explicit mention of locking order is also very good to have in
> header comment.
> 
> 

Let me add in next version.

Talking about life cycle of ovl_inode, currently I only mark OVL_EVICT_EXPENDING
for ovl_inode which has upper inode, is it acceptable? 

Thanks,
Chengguang.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] ovl: Improving syncfs efficiency
  2018-01-22 12:05   ` Chengguang Xu
@ 2018-01-22 12:58     ` Amir Goldstein
  2018-01-22 16:00       ` Chengguang Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Amir Goldstein @ 2018-01-22 12:58 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: Miklos Szeredi, Jan Kara, overlayfs

On Mon, Jan 22, 2018 at 2:05 PM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>
>> 在 2018年1月22日,下午7:27,Amir Goldstein <amir73il@gmail.com> 写道:
>>
>> On Mon, Jan 22, 2018 at 11:47 AM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem()
>>> on upper_sb to synchronize whole dirty inodes in upper filesystem
>>> regardless of the overlay ownership of the inode. In the use case of
>>> container, when multiple containers using the same underlying upper
>>> filesystem, it has some shortcomings as below.
>>>
>>> (1) Performance
>>> Synchronization is probably heavy because it actually syncs unnecessary
>>> inodes for target overlayfs.
>>>
>>> (2) Interference
>>> Unplanned synchronization will probably impact IO performance of
>>> unrelated container processes on the other overlayfs.
>>>
>>> This patch iterates upper inode list in overlayfs to only sync target
>>> dirty inodes and wait for completion. By doing this, It is able to reduce
>>> cost of synchronization and will not seriously impact IO performance of
>>> unrelated processes. In special case, when having very few dirty inodes
>>> and a lot of clean upper inodes in overlayfs, then iteration may waste
>>> time so that synchronization is slower than before, but we think the
>>> potential for performance improvement far surpass the potential for
>>> performance regression in most cases.
>>>
>>> Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
>>> ---
>>> Changes since v5:
>>> - Move sync related functions to new file sync.c.
>>> - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
>>> - If ovl indoe flag is I_WILL_FREE or I_FREEING then syncfs
>>
>> typo 'indo' and in comment below as well.
>>
>>> will wait until OVL_EVICT_PENDING to be cleared.
>>> - Move inode sync operation into evict_inode from drop_inode.
>>> - Call upper_sb->sync_fs with no-wait after sync dirty upper
>>> inodes.
>>> - Hold s_inode_wblist_lock until deleting item from list.
>>> - Remove write_inode fuction because no more used.
>>>
>>> Changes since v4:
>>> - Add syncing operation and deleting from upper_inodes list
>>> during ovl inode destruction.
>>> - Export symbol of inode_wait_for_writeback() for waiting
>>> writeback on ovl inode.
>>> - Reuse I_SYNC flag to avoid race conditions between syncfs
>>> and drop_inode.
>>>
>>> Changes since v3:
>>> - Introduce upper_inode list to organize ovl indoe which has upper inode.
>>> - Avoid iput operation inside spin lock.
>>> - Change list iteration method for safety.
>>> - Reuse inode->i_wb_list and sb->s_inodes_wb to save memory.
>>> - Modify commit log and add more comments to the code.
>>>
>>> Changes since v2:
>>> - Decoupling with syncfs of upper fs by taking sb->s_sync_lock of
>>> overlayfs.
>>> - Factoring out sync operation to different helpers for easy understanding.
>>>
>>> Changes since v1:
>>> - If upper filesystem is readonly mode then skip synchronization.
>>> - Introduce global wait list to replace temporary wait list for
>>> concurrent synchronization.
>>>
>>> fs/overlayfs/Makefile    |   2 +-
>>> fs/overlayfs/overlayfs.h |   8 ++
>>> fs/overlayfs/ovl_entry.h |   5 +
>>> fs/overlayfs/super.c     |  44 +++-----
>>> fs/overlayfs/sync.c      | 256 +++++++++++++++++++++++++++++++++++++++++++++++
>>> fs/overlayfs/util.c      |  23 ++++-
>>> 6 files changed, 305 insertions(+), 33 deletions(-)
>>> create mode 100644 fs/overlayfs/sync.c
>>>
>>> diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
>>> index 99373bb..6e92c0e 100644
>>> --- a/fs/overlayfs/Makefile
>>> +++ b/fs/overlayfs/Makefile
>>> @@ -4,4 +4,4 @@
>>>
>>> obj-$(CONFIG_OVERLAY_FS) += overlay.o
>>>
>>> -overlay-objs := super.o namei.o util.o inode.o dir.o readdir.o copy_up.o
>>> +overlay-objs := super.o namei.o util.o inode.o dir.o readdir.o copy_up.o sync.o
>>> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
>>> index b489099..c6a9ecd 100644
>>> --- a/fs/overlayfs/overlayfs.h
>>> +++ b/fs/overlayfs/overlayfs.h
>>> @@ -34,6 +34,8 @@ enum ovl_flag {
>>>        /* Non-merge dir that may contain whiteout entries */
>>>        OVL_WHITEOUTS,
>>>        OVL_INDEX,
>>> +       /* Set when ovl inode is about to be freed */
>>> +       OVL_EVICT_PENDING,
>>> };
>>>
>>> /*
>>> @@ -241,6 +243,7 @@ int ovl_check_setxattr(struct dentry *dentry, struct dentry *upperdentry,
>>> int ovl_nlink_start(struct dentry *dentry, bool *locked);
>>> void ovl_nlink_end(struct dentry *dentry, bool locked);
>>> int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir);
>>> +void ovl_detach_upper_inodes_list(struct inode *inode);
>>>
>>> static inline bool ovl_is_impuredir(struct dentry *dentry)
>>> {
>>> @@ -322,3 +325,8 @@ int ovl_create_real(struct inode *dir, struct dentry *newdentry,
>>> int ovl_copy_xattr(struct dentry *old, struct dentry *new);
>>> int ovl_set_attr(struct dentry *upper, struct kstat *stat);
>>> struct ovl_fh *ovl_encode_fh(struct dentry *lower, bool is_upper);
>>> +
>>> +/* sync.c */
>>> +int ovl_write_inode(struct inode *inode, struct writeback_control *wbc);
>>
>> leftover
>>
>>> +void ovl_evict_inode(struct inode *inode);
>>> +int ovl_sync_fs(struct super_block *sb, int wait);
>>> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
>>> index 9d0bc03..53909ba 100644
>>> --- a/fs/overlayfs/ovl_entry.h
>>> +++ b/fs/overlayfs/ovl_entry.h
>>> @@ -52,6 +52,9 @@ struct ovl_fs {
>>>        /* Did we take the inuse lock? */
>>>        bool upperdir_locked;
>>>        bool workdir_locked;
>>> +       /* upper inode list and lock */
>>> +       spinlock_t upper_inodes_lock;
>>> +       struct list_head upper_inodes;
>>> };
>>>
>>> /* private information held for every overlayfs dentry */
>>> @@ -80,6 +83,8 @@ struct ovl_inode {
>>>
>>>        /* synchronize copy up and more */
>>>        struct mutex lock;
>>> +       /* upper inodes list */
>>> +       struct list_head upper_inodes_list;
>>> };
>>>
>>> static inline struct ovl_inode *OVL_I(struct inode *inode)
>>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>>> index 76440fe..9315155 100644
>>> --- a/fs/overlayfs/super.c
>>> +++ b/fs/overlayfs/super.c
>>> @@ -195,6 +195,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb)
>>>        oi->__upperdentry = NULL;
>>>        oi->lower = NULL;
>>>        mutex_init(&oi->lock);
>>> +       INIT_LIST_HEAD(&oi->upper_inodes_list);
>>>
>>>        return &oi->vfs_inode;
>>> }
>>> @@ -252,36 +253,6 @@ static void ovl_put_super(struct super_block *sb)
>>>        ovl_free_fs(ofs);
>>> }
>>>
>>> -/* Sync real dirty inodes in upper filesystem (if it exists) */
>>> -static int ovl_sync_fs(struct super_block *sb, int wait)
>>> -{
>>> -       struct ovl_fs *ofs = sb->s_fs_info;
>>> -       struct super_block *upper_sb;
>>> -       int ret;
>>> -
>>> -       if (!ofs->upper_mnt)
>>> -               return 0;
>>> -
>>> -       /*
>>> -        * If this is a sync(2) call or an emergency sync, all the super blocks
>>> -        * will be iterated, including upper_sb, so no need to do anything.
>>> -        *
>>> -        * If this is a syncfs(2) call, then we do need to call
>>> -        * sync_filesystem() on upper_sb, but enough if we do it when being
>>> -        * called with wait == 1.
>>> -        */
>>> -       if (!wait)
>>> -               return 0;
>>> -
>>> -       upper_sb = ofs->upper_mnt->mnt_sb;
>>> -
>>> -       down_read(&upper_sb->s_umount);
>>> -       ret = sync_filesystem(upper_sb);
>>> -       up_read(&upper_sb->s_umount);
>>> -
>>> -       return ret;
>>> -}
>>> -
>>> /**
>>>  * ovl_statfs
>>>  * @sb: The overlayfs super block
>>> @@ -354,10 +325,18 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
>>>        return 0;
>>> }
>>>
>>> +static int ovl_drop_inode(struct inode *inode)
>>> +{
>>> +       if (ovl_inode_upper(inode))
>>> +               ovl_set_flag(OVL_EVICT_PENDING, inode);
>>> +       return 1;
>>> +}
>>> +
>>> static const struct super_operations ovl_super_operations = {
>>>        .alloc_inode    = ovl_alloc_inode,
>>>        .destroy_inode  = ovl_destroy_inode,
>>> -       .drop_inode     = generic_delete_inode,
>>> +       .drop_inode     = ovl_drop_inode,
>>> +       .evict_inode    = ovl_evict_inode,
>>>        .put_super      = ovl_put_super,
>>>        .sync_fs        = ovl_sync_fs,
>>>        .statfs         = ovl_statfs,
>>> @@ -1202,6 +1181,9 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>>>        if (!ofs)
>>>                goto out;
>>>
>>> +       spin_lock_init(&ofs->upper_inodes_lock);
>>> +       INIT_LIST_HEAD(&ofs->upper_inodes);
>>> +
>>>        ofs->creator_cred = cred = prepare_creds();
>>>        if (!cred)
>>>                goto out_err;
>>> diff --git a/fs/overlayfs/sync.c b/fs/overlayfs/sync.c
>>> new file mode 100644
>>> index 0000000..e239b0a
>>> --- /dev/null
>>> +++ b/fs/overlayfs/sync.c
>>> @@ -0,0 +1,256 @@
>>> +/*
>>> + * Copyright (C) 2018 Meili Inc.
>>> + * Author Chengguang Xu <cgxu519@icloud.com>
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify it
>>> + * under the terms of the GNU General Public License version 2 as published by
>>> + * the Free Software Foundation.
>>> + */
>>> +
>>> +#include <linux/fs.h>
>>> +#include <linux/xattr.h>
>>> +#include <linux/mount.h>
>>> +#include <linux/writeback.h>
>>> +#include <linux/blkdev.h>
>>> +#include "overlayfs.h"
>>> +
>>> +void ovl_evict_inode(struct inode *inode)
>>> +{
>>> +       if (ovl_inode_upper(inode)) {
>>> +               write_inode_now(ovl_inode_upper(inode), 1);
>>> +               ovl_detach_upper_inodes_list(inode);
>>> +
>>> +               /*
>>> +                * ovl_sync_inodes() may wait until
>>> +                * flag OVL_EVICT_PENDING to be cleared.
>>> +                */
>>> +               spin_lock(&inode->i_lock);
>>> +               ovl_clear_flag(OVL_EVICT_PENDING, inode);
>>> +               /* See also atomic_bitops.txt */
>>> +               smp_mb__after_atomic();
>>> +               wake_up_bit(&OVL_I(inode)->flags, OVL_EVICT_PENDING);
>>> +               spin_unlock(&inode->i_lock);
>>> +               clear_inode(inode);
>>> +       }
>>> +}
>>> +
>>> +/**
>>> + * ovl_sync_inodes
>>> + * @sb: The overlayfs super block
>>> + *
>>> + * upper_inodes list is used for organizing ovl inode which has upper inode,
>>> + * by iterating the list to looking for and syncing dirty upper inode.
>>> + *
>>> + * When starting syncing inode, we add the inode to wait list explicitly,
>>> + * in order to save memory we reuse inode->i_wb_list and sb->s_inodes_wb,
>>> + * so those fields have slightly differnt meanings in overlayfs.
>>> + */
>>> +static void ovl_sync_inodes(struct super_block *sb)
>>> +{
>>> +       struct ovl_fs *ofs = sb->s_fs_info;
>>> +       struct ovl_inode *oi, *oi_prev;
>>> +       struct inode *inode, *iput_inode = NULL;
>>> +       struct inode *upper_inode;
>>> +       struct blk_plug plug;
>>> +
>>> +       struct writeback_control wbc_for_sync = {
>>> +               .sync_mode              = WB_SYNC_ALL,
>>> +               .for_sync               = 1,
>>> +               .range_start            = 0,
>>> +               .range_end              = LLONG_MAX,
>>> +               .nr_to_write            = LONG_MAX,
>>> +       };
>>> +
>>> +       blk_start_plug(&plug);
>>> +       spin_lock(&ofs->upper_inodes_lock);
>>> +       list_for_each_entry(oi, &ofs->upper_inodes, upper_inodes_list) {
>>> +               inode = &oi->vfs_inode;
>>> +               spin_lock(&inode->i_lock);
>>> +
>>> +               if (inode->i_state & I_NEW) {
>>> +                       spin_unlock(&inode->i_lock);
>>> +                       continue;
>>> +               }
>>> +
>>> +               /*
>>> +                * If ovl indoe state is I_WILL_FREE or I_FREEING,
>>> +                * sync operation will be done in the ovl_evict_inode(),
>>> +                * so wait here until OVL_EVICT_PENDING flag to be cleared.
>>> +                */
>>> +               if (inode->i_state & (I_WILL_FREE | I_FREEING)) {
>>> +                       oi_prev = list_entry(oi->upper_inodes_list.prev,
>>> +                                       struct ovl_inode, upper_inodes_list);
>>> +
>>> +                       if (ovl_test_flag(OVL_EVICT_PENDING, inode)) {
>>> +                               DEFINE_WAIT_BIT(wait, &oi->flags, OVL_EVICT_PENDING);
>>> +                               wait_queue_head_t *wqh;
>>> +                               bool sleep;
>>> +
>>> +                               wqh = bit_waitqueue(&oi->flags, OVL_EVICT_PENDING);
>>> +                               prepare_to_wait(wqh, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
>>> +                               sleep = ovl_test_flag(OVL_EVICT_PENDING, inode);
>>> +                               spin_unlock(&inode->i_lock);
>>> +                               spin_unlock(&ofs->upper_inodes_lock);
>>> +                               if (sleep)
>>> +                                       schedule();
>>> +                               finish_wait(wqh, &wait.wq_entry);
>>
>> Please move all this to helper ovl_wait_evict_pending()
>>
>>
>>
>>> +                               /* We need to do this as 'inode' can be freed by now */
>>
>> Could oi_prev have been freed or removed from upper_inodes list?
>> I don't see why not.
>> You could try to continue iteration from iput_inode if it is not NULL,
>> but what if it is NULL?
>
> When iterating the list we either get reference or wait until the item gets deleted,
> so I think there is no chance let previous inode becomes NULL.
>

Maybe I am missing something.
When iterating the upper_inodes list you get a reference on current (oi)
and maybe have a reference on previous via iput_inode.
previous may have a reference from wb_list, but that reference can be lost
while we are not holding the wblist lock. No?

>
>> Maybe instead of iput_inode = NULL when adding to inodes_wb list
>> get another reference to inode and always store iput_inode = inode
>> to use as prev in this case?
>
> Maybe it’s better than current, but I think it’s more complex to understand
> the behavior, so please let me wait for more comments for the approach.
>

Sure, let's wait for other comments, but I can't say I understand what is
complex to understand about change that I propose. If anything it is simpler
to understand, especially if you rename iput_inode to prev_inode:

+               /*
+                * If ovl inode is not in sb->s_inodes_wb list, add to the
+                * list and increment inode reference until IO to be
completed on
+                * the inode.
+                */
+               spin_lock(&sb->s_inode_wblist_lock);
+               if (list_empty(&inode->i_wb_list)) {
+                       list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
+                       ihold(inode);
+               }
+               iput_inode = inode;

>
>>
>> It would be good if you wrote the "life cycle" of ovl_inode in a
>> header comment to
>> this file because it has become non-trivial to follow.
>> An explicit mention of locking order is also very good to have in
>> header comment.
>>
>>
>
> Let me add in next version.
>
> Talking about life cycle of ovl_inode, currently I only mark OVL_EVICT_EXPENDING
> for ovl_inode which has upper inode, is it acceptable?
>

Until your change, ovl_inode life cycle was 'generic' when refcount
dropped to 0.
Now, inode can be in OVL_EVICT_PENDING state, it can be waiting for sync
on or off upper_inodes list. I'm just saying it would be nice to
document all the
possible states of an ovl_inode during eviction and document all the locking
and refcount guaranties in one place to see that we did not miss anything.

The volume of this documentation is entirely up to you.
I think it will be good for you as well to tell the whole story from
start to end
and verify there are no holes in the plot.

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] ovl: Improving syncfs efficiency
  2018-01-22 12:58     ` Amir Goldstein
@ 2018-01-22 16:00       ` Chengguang Xu
  2018-01-22 18:45         ` Amir Goldstein
  0 siblings, 1 reply; 10+ messages in thread
From: Chengguang Xu @ 2018-01-22 16:00 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, Jan Kara, overlayfs


> 在 2018年1月22日,下午8:58,Amir Goldstein <amir73il@gmail.com> 写道:
> 
> On Mon, Jan 22, 2018 at 2:05 PM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>> 
>>> 在 2018年1月22日,下午7:27,Amir Goldstein <amir73il@gmail.com> 写道:
>>> 
>>> On Mon, Jan 22, 2018 at 11:47 AM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem()
>>>> on upper_sb to synchronize whole dirty inodes in upper filesystem
>>>> regardless of the overlay ownership of the inode. In the use case of
>>>> container, when multiple containers using the same underlying upper
>>>> filesystem, it has some shortcomings as below.
>>>> 
>>>> (1) Performance
>>>> Synchronization is probably heavy because it actually syncs unnecessary
>>>> inodes for target overlayfs.
>>>> 
>>>> (2) Interference
>>>> Unplanned synchronization will probably impact IO performance of
>>>> unrelated container processes on the other overlayfs.
>>>> 
>>>> This patch iterates upper inode list in overlayfs to only sync target
>>>> dirty inodes and wait for completion. By doing this, It is able to reduce
>>>> cost of synchronization and will not seriously impact IO performance of
>>>> unrelated processes. In special case, when having very few dirty inodes
>>>> and a lot of clean upper inodes in overlayfs, then iteration may waste
>>>> time so that synchronization is slower than before, but we think the
>>>> potential for performance improvement far surpass the potential for
>>>> performance regression in most cases.
>>>> 
>>>> Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
>>>> ---
>>>> Changes since v5:
>>>> - Move sync related functions to new file sync.c.
>>>> - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
>>>> - If ovl indoe flag is I_WILL_FREE or I_FREEING then syncfs
>>> 
>>> typo 'indo' and in comment below as well.
>>> 
>>>> will wait until OVL_EVICT_PENDING to be cleared.
>>>> - Move inode sync operation into evict_inode from drop_inode.
>>>> - Call upper_sb->sync_fs with no-wait after sync dirty upper
>>>> inodes.
>>>> - Hold s_inode_wblist_lock until deleting item from list.
>>>> - Remove write_inode fuction because no more used.
>>>> 
>>>> Changes since v4:
>>>> - Add syncing operation and deleting from upper_inodes list
>>>> during ovl inode destruction.
>>>> - Export symbol of inode_wait_for_writeback() for waiting
>>>> writeback on ovl inode.
>>>> - Reuse I_SYNC flag to avoid race conditions between syncfs
>>>> and drop_inode.
>>>> 
>>>> Changes since v3:
>>>> - Introduce upper_inode list to organize ovl indoe which has upper inode.
>>>> - Avoid iput operation inside spin lock.
>>>> - Change list iteration method for safety.
>>>> - Reuse inode->i_wb_list and sb->s_inodes_wb to save memory.
>>>> - Modify commit log and add more comments to the code.
>>>> 
>>>> Changes since v2:
>>>> - Decoupling with syncfs of upper fs by taking sb->s_sync_lock of
>>>> overlayfs.
>>>> - Factoring out sync operation to different helpers for easy understanding.
>>>> 
>>>> Changes since v1:
>>>> - If upper filesystem is readonly mode then skip synchronization.
>>>> - Introduce global wait list to replace temporary wait list for
>>>> concurrent synchronization.
>>>> 
>>>> fs/overlayfs/Makefile    |   2 +-
>>>> fs/overlayfs/overlayfs.h |   8 ++
>>>> fs/overlayfs/ovl_entry.h |   5 +
>>>> fs/overlayfs/super.c     |  44 +++-----
>>>> fs/overlayfs/sync.c      | 256 +++++++++++++++++++++++++++++++++++++++++++++++
>>>> fs/overlayfs/util.c      |  23 ++++-
>>>> 6 files changed, 305 insertions(+), 33 deletions(-)
>>>> create mode 100644 fs/overlayfs/sync.c
>>>> 
>>>> diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
>>>> index 99373bb..6e92c0e 100644
>>>> --- a/fs/overlayfs/Makefile
>>>> +++ b/fs/overlayfs/Makefile
>>>> @@ -4,4 +4,4 @@
>>>> 
>>>> obj-$(CONFIG_OVERLAY_FS) += overlay.o
>>>> 
>>>> -overlay-objs := super.o namei.o util.o inode.o dir.o readdir.o copy_up.o
>>>> +overlay-objs := super.o namei.o util.o inode.o dir.o readdir.o copy_up.o sync.o
>>>> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
>>>> index b489099..c6a9ecd 100644
>>>> --- a/fs/overlayfs/overlayfs.h
>>>> +++ b/fs/overlayfs/overlayfs.h
>>>> @@ -34,6 +34,8 @@ enum ovl_flag {
>>>>       /* Non-merge dir that may contain whiteout entries */
>>>>       OVL_WHITEOUTS,
>>>>       OVL_INDEX,
>>>> +       /* Set when ovl inode is about to be freed */
>>>> +       OVL_EVICT_PENDING,
>>>> };
>>>> 
>>>> /*
>>>> @@ -241,6 +243,7 @@ int ovl_check_setxattr(struct dentry *dentry, struct dentry *upperdentry,
>>>> int ovl_nlink_start(struct dentry *dentry, bool *locked);
>>>> void ovl_nlink_end(struct dentry *dentry, bool locked);
>>>> int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir);
>>>> +void ovl_detach_upper_inodes_list(struct inode *inode);
>>>> 
>>>> static inline bool ovl_is_impuredir(struct dentry *dentry)
>>>> {
>>>> @@ -322,3 +325,8 @@ int ovl_create_real(struct inode *dir, struct dentry *newdentry,
>>>> int ovl_copy_xattr(struct dentry *old, struct dentry *new);
>>>> int ovl_set_attr(struct dentry *upper, struct kstat *stat);
>>>> struct ovl_fh *ovl_encode_fh(struct dentry *lower, bool is_upper);
>>>> +
>>>> +/* sync.c */
>>>> +int ovl_write_inode(struct inode *inode, struct writeback_control *wbc);
>>> 
>>> leftover
>>> 
>>>> +void ovl_evict_inode(struct inode *inode);
>>>> +int ovl_sync_fs(struct super_block *sb, int wait);
>>>> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
>>>> index 9d0bc03..53909ba 100644
>>>> --- a/fs/overlayfs/ovl_entry.h
>>>> +++ b/fs/overlayfs/ovl_entry.h
>>>> @@ -52,6 +52,9 @@ struct ovl_fs {
>>>>       /* Did we take the inuse lock? */
>>>>       bool upperdir_locked;
>>>>       bool workdir_locked;
>>>> +       /* upper inode list and lock */
>>>> +       spinlock_t upper_inodes_lock;
>>>> +       struct list_head upper_inodes;
>>>> };
>>>> 
>>>> /* private information held for every overlayfs dentry */
>>>> @@ -80,6 +83,8 @@ struct ovl_inode {
>>>> 
>>>>       /* synchronize copy up and more */
>>>>       struct mutex lock;
>>>> +       /* upper inodes list */
>>>> +       struct list_head upper_inodes_list;
>>>> };
>>>> 
>>>> static inline struct ovl_inode *OVL_I(struct inode *inode)
>>>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>>>> index 76440fe..9315155 100644
>>>> --- a/fs/overlayfs/super.c
>>>> +++ b/fs/overlayfs/super.c
>>>> @@ -195,6 +195,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb)
>>>>       oi->__upperdentry = NULL;
>>>>       oi->lower = NULL;
>>>>       mutex_init(&oi->lock);
>>>> +       INIT_LIST_HEAD(&oi->upper_inodes_list);
>>>> 
>>>>       return &oi->vfs_inode;
>>>> }
>>>> @@ -252,36 +253,6 @@ static void ovl_put_super(struct super_block *sb)
>>>>       ovl_free_fs(ofs);
>>>> }
>>>> 
>>>> -/* Sync real dirty inodes in upper filesystem (if it exists) */
>>>> -static int ovl_sync_fs(struct super_block *sb, int wait)
>>>> -{
>>>> -       struct ovl_fs *ofs = sb->s_fs_info;
>>>> -       struct super_block *upper_sb;
>>>> -       int ret;
>>>> -
>>>> -       if (!ofs->upper_mnt)
>>>> -               return 0;
>>>> -
>>>> -       /*
>>>> -        * If this is a sync(2) call or an emergency sync, all the super blocks
>>>> -        * will be iterated, including upper_sb, so no need to do anything.
>>>> -        *
>>>> -        * If this is a syncfs(2) call, then we do need to call
>>>> -        * sync_filesystem() on upper_sb, but enough if we do it when being
>>>> -        * called with wait == 1.
>>>> -        */
>>>> -       if (!wait)
>>>> -               return 0;
>>>> -
>>>> -       upper_sb = ofs->upper_mnt->mnt_sb;
>>>> -
>>>> -       down_read(&upper_sb->s_umount);
>>>> -       ret = sync_filesystem(upper_sb);
>>>> -       up_read(&upper_sb->s_umount);
>>>> -
>>>> -       return ret;
>>>> -}
>>>> -
>>>> /**
>>>> * ovl_statfs
>>>> * @sb: The overlayfs super block
>>>> @@ -354,10 +325,18 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
>>>>       return 0;
>>>> }
>>>> 
>>>> +static int ovl_drop_inode(struct inode *inode)
>>>> +{
>>>> +       if (ovl_inode_upper(inode))
>>>> +               ovl_set_flag(OVL_EVICT_PENDING, inode);
>>>> +       return 1;
>>>> +}
>>>> +
>>>> static const struct super_operations ovl_super_operations = {
>>>>       .alloc_inode    = ovl_alloc_inode,
>>>>       .destroy_inode  = ovl_destroy_inode,
>>>> -       .drop_inode     = generic_delete_inode,
>>>> +       .drop_inode     = ovl_drop_inode,
>>>> +       .evict_inode    = ovl_evict_inode,
>>>>       .put_super      = ovl_put_super,
>>>>       .sync_fs        = ovl_sync_fs,
>>>>       .statfs         = ovl_statfs,
>>>> @@ -1202,6 +1181,9 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>>>>       if (!ofs)
>>>>               goto out;
>>>> 
>>>> +       spin_lock_init(&ofs->upper_inodes_lock);
>>>> +       INIT_LIST_HEAD(&ofs->upper_inodes);
>>>> +
>>>>       ofs->creator_cred = cred = prepare_creds();
>>>>       if (!cred)
>>>>               goto out_err;
>>>> diff --git a/fs/overlayfs/sync.c b/fs/overlayfs/sync.c
>>>> new file mode 100644
>>>> index 0000000..e239b0a
>>>> --- /dev/null
>>>> +++ b/fs/overlayfs/sync.c
>>>> @@ -0,0 +1,256 @@
>>>> +/*
>>>> + * Copyright (C) 2018 Meili Inc.
>>>> + * Author Chengguang Xu <cgxu519@icloud.com>
>>>> + *
>>>> + * This program is free software; you can redistribute it and/or modify it
>>>> + * under the terms of the GNU General Public License version 2 as published by
>>>> + * the Free Software Foundation.
>>>> + */
>>>> +
>>>> +#include <linux/fs.h>
>>>> +#include <linux/xattr.h>
>>>> +#include <linux/mount.h>
>>>> +#include <linux/writeback.h>
>>>> +#include <linux/blkdev.h>
>>>> +#include "overlayfs.h"
>>>> +
>>>> +void ovl_evict_inode(struct inode *inode)
>>>> +{
>>>> +       if (ovl_inode_upper(inode)) {
>>>> +               write_inode_now(ovl_inode_upper(inode), 1);
>>>> +               ovl_detach_upper_inodes_list(inode);
>>>> +
>>>> +               /*
>>>> +                * ovl_sync_inodes() may wait until
>>>> +                * flag OVL_EVICT_PENDING to be cleared.
>>>> +                */
>>>> +               spin_lock(&inode->i_lock);
>>>> +               ovl_clear_flag(OVL_EVICT_PENDING, inode);
>>>> +               /* See also atomic_bitops.txt */
>>>> +               smp_mb__after_atomic();
>>>> +               wake_up_bit(&OVL_I(inode)->flags, OVL_EVICT_PENDING);
>>>> +               spin_unlock(&inode->i_lock);
>>>> +               clear_inode(inode);
>>>> +       }
>>>> +}
>>>> +
>>>> +/**
>>>> + * ovl_sync_inodes
>>>> + * @sb: The overlayfs super block
>>>> + *
>>>> + * upper_inodes list is used for organizing ovl inode which has upper inode,
>>>> + * by iterating the list to looking for and syncing dirty upper inode.
>>>> + *
>>>> + * When starting syncing inode, we add the inode to wait list explicitly,
>>>> + * in order to save memory we reuse inode->i_wb_list and sb->s_inodes_wb,
>>>> + * so those fields have slightly differnt meanings in overlayfs.
>>>> + */
>>>> +static void ovl_sync_inodes(struct super_block *sb)
>>>> +{
>>>> +       struct ovl_fs *ofs = sb->s_fs_info;
>>>> +       struct ovl_inode *oi, *oi_prev;
>>>> +       struct inode *inode, *iput_inode = NULL;
>>>> +       struct inode *upper_inode;
>>>> +       struct blk_plug plug;
>>>> +
>>>> +       struct writeback_control wbc_for_sync = {
>>>> +               .sync_mode              = WB_SYNC_ALL,
>>>> +               .for_sync               = 1,
>>>> +               .range_start            = 0,
>>>> +               .range_end              = LLONG_MAX,
>>>> +               .nr_to_write            = LONG_MAX,
>>>> +       };
>>>> +
>>>> +       blk_start_plug(&plug);
>>>> +       spin_lock(&ofs->upper_inodes_lock);
>>>> +       list_for_each_entry(oi, &ofs->upper_inodes, upper_inodes_list) {
>>>> +               inode = &oi->vfs_inode;
>>>> +               spin_lock(&inode->i_lock);
>>>> +
>>>> +               if (inode->i_state & I_NEW) {
>>>> +                       spin_unlock(&inode->i_lock);
>>>> +                       continue;
>>>> +               }
>>>> +
>>>> +               /*
>>>> +                * If ovl indoe state is I_WILL_FREE or I_FREEING,
>>>> +                * sync operation will be done in the ovl_evict_inode(),
>>>> +                * so wait here until OVL_EVICT_PENDING flag to be cleared.
>>>> +                */
>>>> +               if (inode->i_state & (I_WILL_FREE | I_FREEING)) {
>>>> +                       oi_prev = list_entry(oi->upper_inodes_list.prev,
>>>> +                                       struct ovl_inode, upper_inodes_list);
>>>> +
>>>> +                       if (ovl_test_flag(OVL_EVICT_PENDING, inode)) {
>>>> +                               DEFINE_WAIT_BIT(wait, &oi->flags, OVL_EVICT_PENDING);
>>>> +                               wait_queue_head_t *wqh;
>>>> +                               bool sleep;
>>>> +
>>>> +                               wqh = bit_waitqueue(&oi->flags, OVL_EVICT_PENDING);
>>>> +                               prepare_to_wait(wqh, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
>>>> +                               sleep = ovl_test_flag(OVL_EVICT_PENDING, inode);
>>>> +                               spin_unlock(&inode->i_lock);
>>>> +                               spin_unlock(&ofs->upper_inodes_lock);
>>>> +                               if (sleep)
>>>> +                                       schedule();
>>>> +                               finish_wait(wqh, &wait.wq_entry);
>>> 
>>> Please move all this to helper ovl_wait_evict_pending()
>>> 
>>> 
>>> 
>>>> +                               /* We need to do this as 'inode' can be freed by now */
>>> 
>>> Could oi_prev have been freed or removed from upper_inodes list?
>>> I don't see why not.
>>> You could try to continue iteration from iput_inode if it is not NULL,
>>> but what if it is NULL?
>> 
>> When iterating the list we either get reference or wait until the item gets deleted,
>> so I think there is no chance let previous inode becomes NULL.
>> 
> 
> Maybe I am missing something.
> When iterating the upper_inodes list you get a reference on current (oi)
> and maybe have a reference on previous via iput_inode.
> previous may have a reference from wb_list, but that reference can be lost
> while we are not holding the wblist lock. No?

The reference of previous inode can be lost if the inode is not our target,
but it must be happened after iput(iput_inode), before this will be safe.

When deleting item from list we take lock to protect from concurrent operations,
so if we have already got list lock, it will be safe to touch items in the list.

> 
>> 
>>> Maybe instead of iput_inode = NULL when adding to inodes_wb list
>>> get another reference to inode and always store iput_inode = inode
>>> to use as prev in this case?
>> 
>> Maybe it’s better than current, but I think it’s more complex to understand
>> the behavior, so please let me wait for more comments for the approach.
>> 
> 
> Sure, let's wait for other comments, but I can't say I understand what is
> complex to understand about change that I propose. If anything it is simpler
> to understand, especially if you rename iput_inode to prev_inode:
> 
> +               /*
> +                * If ovl inode is not in sb->s_inodes_wb list, add to the
> +                * list and increment inode reference until IO to be
> completed on
> +                * the inode.
> +                */
> +               spin_lock(&sb->s_inode_wblist_lock);
> +               if (list_empty(&inode->i_wb_list)) {
> +                       list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> +                       ihold(inode);
> +               }
> +               iput_inode = inode;


Actually, I forgot to compare whether the previous item is the list head or not,
so I think there is a problem when deleting very first item in the list,
even we use iput_inode here, we can’t get rid of list operation because of that.
if you think the code should be more readable, how about modify like below?

1. take sb->s_sync_lock mutex lock
2. splice to local temp list
3. get first entry every time to operate
4. put into sb->s_inode_wblist again one by one

I think there is no much concurrent syncfs running at the same time.


> 
>> 
>>> 
>>> It would be good if you wrote the "life cycle" of ovl_inode in a
>>> header comment to
>>> this file because it has become non-trivial to follow.
>>> An explicit mention of locking order is also very good to have in
>>> header comment.
>>> 
>>> 
>> 
>> Let me add in next version.
>> 
>> Talking about life cycle of ovl_inode, currently I only mark OVL_EVICT_EXPENDING
>> for ovl_inode which has upper inode, is it acceptable?
>> 
> 
> Until your change, ovl_inode life cycle was 'generic' when refcount
> dropped to 0.
> Now, inode can be in OVL_EVICT_PENDING state, it can be waiting for sync
> on or off upper_inodes list. I'm just saying it would be nice to
> document all the
> possible states of an ovl_inode during eviction and document all the locking
> and refcount guaranties in one place to see that we did not miss anything.
> 
> The volume of this documentation is entirely up to you.
> I think it will be good for you as well to tell the whole story from
> start to end
> and verify there are no holes in the plot.


I did not mean only write above asking to the document, I’m just asking option for OVL_EVICT_PEDING flag.
Currently the flag cycle of ovl indoe split into different cycles by having upper inode or not.

1. having upper inode
(allocation) -> new -> none -> OVL_EVICT_PENDING|I_FREEING -> I_FREEING|I_CLEAR -> (destruction)

2. no upper inode
(allocation) -> new -> none -> I_FREEING -> I_FREEING|I_CLEAR -> (destruction)

If you all think it’s harmless, then I’ll keep it as what it is because I think it’s helpful 
for performance in some cases.	

If there were some misunderstandings I’m sorry for insufficient explanations in previous reply.



Thanks,
Chengguang.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] ovl: Improving syncfs efficiency
  2018-01-22 16:00       ` Chengguang Xu
@ 2018-01-22 18:45         ` Amir Goldstein
  2018-01-23  0:24           ` Chengguang Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Amir Goldstein @ 2018-01-22 18:45 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: Miklos Szeredi, Jan Kara, overlayfs

On Mon, Jan 22, 2018 at 6:00 PM, Chengguang Xu <cgxu519@icloud.com> wrote:
>
>> 在 2018年1月22日,下午8:58,Amir Goldstein <amir73il@gmail.com> 写道:
>>
>> On Mon, Jan 22, 2018 at 2:05 PM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>>
>>>> 在 2018年1月22日,下午7:27,Amir Goldstein <amir73il@gmail.com> 写道:
>>>>
>>>> On Mon, Jan 22, 2018 at 11:47 AM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem()
>>>>> on upper_sb to synchronize whole dirty inodes in upper filesystem
>>>>> regardless of the overlay ownership of the inode. In the use case of
>>>>> container, when multiple containers using the same underlying upper
>>>>> filesystem, it has some shortcomings as below.
>>>>>
>>>>> (1) Performance
>>>>> Synchronization is probably heavy because it actually syncs unnecessary
>>>>> inodes for target overlayfs.
>>>>>
>>>>> (2) Interference
>>>>> Unplanned synchronization will probably impact IO performance of
>>>>> unrelated container processes on the other overlayfs.
>>>>>
>>>>> This patch iterates upper inode list in overlayfs to only sync target
>>>>> dirty inodes and wait for completion. By doing this, It is able to reduce
>>>>> cost of synchronization and will not seriously impact IO performance of
>>>>> unrelated processes. In special case, when having very few dirty inodes
>>>>> and a lot of clean upper inodes in overlayfs, then iteration may waste
>>>>> time so that synchronization is slower than before, but we think the
>>>>> potential for performance improvement far surpass the potential for
>>>>> performance regression in most cases.
>>>>>
>>>>> Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
>>>>> ---
>>>>> Changes since v5:
>>>>> - Move sync related functions to new file sync.c.
>>>>> - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
>>>>> - If ovl indoe flag is I_WILL_FREE or I_FREEING then syncfs
>>>>> will wait until OVL_EVICT_PENDING to be cleared.
>>>>> - Move inode sync operation into evict_inode from drop_inode.
>>>>> - Call upper_sb->sync_fs with no-wait after sync dirty upper
>>>>> inodes.
>>>>> - Hold s_inode_wblist_lock until deleting item from list.
>>>>> - Remove write_inode fuction because no more used.
>>>>>
[...]
>>>>
>>>>
>>>>
>>>>> +                               /* We need to do this as 'inode' can be freed by now */
>>>>
>>>> Could oi_prev have been freed or removed from upper_inodes list?
>>>> I don't see why not.
>>>> You could try to continue iteration from iput_inode if it is not NULL,
>>>> but what if it is NULL?
>>>
>>> When iterating the list we either get reference or wait until the item gets deleted,
>>> so I think there is no chance let previous inode becomes NULL.
>>>
>>
>> Maybe I am missing something.
>> When iterating the upper_inodes list you get a reference on current (oi)
>> and maybe have a reference on previous via iput_inode.
>> previous may have a reference from wb_list, but that reference can be lost
>> while we are not holding the wblist lock. No?
>
> The reference of previous inode can be lost if the inode is not our target,
> but it must be happened after iput(iput_inode), before this will be safe.
>

I was talking about the case where iput_inode is NULL.

> When deleting item from list we take lock to protect from concurrent operations,
> so if we have already got list lock, it will be safe to touch items in the list.
>

But we have let go of the list lock.

>>
>>>
>>>> Maybe instead of iput_inode = NULL when adding to inodes_wb list
>>>> get another reference to inode and always store iput_inode = inode
>>>> to use as prev in this case?
>>>
>>> Maybe it’s better than current, but I think it’s more complex to understand
>>> the behavior, so please let me wait for more comments for the approach.
>>>
>>
>> Sure, let's wait for other comments, but I can't say I understand what is
>> complex to understand about change that I propose. If anything it is simpler
>> to understand, especially if you rename iput_inode to prev_inode:
>>
>> +               /*
>> +                * If ovl inode is not in sb->s_inodes_wb list, add to the
>> +                * list and increment inode reference until IO to be
>> completed on
>> +                * the inode.
>> +                */
>> +               spin_lock(&sb->s_inode_wblist_lock);
>> +               if (list_empty(&inode->i_wb_list)) {
>> +                       list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
>> +                       ihold(inode);
>> +               }
>> +               iput_inode = inode;
>
>
> Actually, I forgot to compare whether the previous item is the list head or not,
> so I think there is a problem when deleting very first item in the list,
> even we use iput_inode here, we can’t get rid of list operation because of that.
> if you think the code should be more readable, how about modify like below?
>
> 1. take sb->s_sync_lock mutex lock
> 2. splice to local temp list
> 3. get first entry every time to operate
> 4. put into sb->s_inode_wblist again one by one
>
> I think there is no much concurrent syncfs running at the same time.

This could work. Need to see it.

>
>
>>
>>>
>>>>
>>>> It would be good if you wrote the "life cycle" of ovl_inode in a
>>>> header comment to
>>>> this file because it has become non-trivial to follow.
>>>> An explicit mention of locking order is also very good to have in
>>>> header comment.
>>>>
>>>>
>>>
>>> Let me add in next version.
>>>
>>> Talking about life cycle of ovl_inode, currently I only mark OVL_EVICT_EXPENDING
>>> for ovl_inode which has upper inode, is it acceptable?
>>>
>>
>> Until your change, ovl_inode life cycle was 'generic' when refcount
>> dropped to 0.
>> Now, inode can be in OVL_EVICT_PENDING state, it can be waiting for sync
>> on or off upper_inodes list. I'm just saying it would be nice to
>> document all the
>> possible states of an ovl_inode during eviction and document all the locking
>> and refcount guaranties in one place to see that we did not miss anything.
>>
>> The volume of this documentation is entirely up to you.
>> I think it will be good for you as well to tell the whole story from
>> start to end
>> and verify there are no holes in the plot.
>
>
> I did not mean only write above asking to the document, I’m just asking option for OVL_EVICT_PEDING flag.

If you meant the flag name or using a flag in general, that seems good to me.

> Currently the flag cycle of ovl indoe split into different cycles by having upper inode or not.
>
> 1. having upper inode
> (allocation) -> new -> none -> OVL_EVICT_PENDING|I_FREEING -> I_FREEING|I_CLEAR -> (destruction)
>

That's good, if there are other conditions that guaranty that state
transitions are not racy
like list locks concurrent sync lock and elevated reflink you should
mention them.

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] ovl: Improving syncfs efficiency
  2018-01-22 18:45         ` Amir Goldstein
@ 2018-01-23  0:24           ` Chengguang Xu
  2018-01-23  6:39             ` Amir Goldstein
  0 siblings, 1 reply; 10+ messages in thread
From: Chengguang Xu @ 2018-01-23  0:24 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, Jan Kara, overlayfs

> 
> 在 2018年1月23日,上午2:45,Amir Goldstein <amir73il@gmail.com> 写道:
> 
> On Mon, Jan 22, 2018 at 6:00 PM, Chengguang Xu <cgxu519@icloud.com> wrote:
>> 
>>> 在 2018年1月22日,下午8:58,Amir Goldstein <amir73il@gmail.com> 写道:
>>> 
>>> On Mon, Jan 22, 2018 at 2:05 PM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>>> 
>>>>> 在 2018年1月22日,下午7:27,Amir Goldstein <amir73il@gmail.com> 写道:
>>>>> 
>>>>> On Mon, Jan 22, 2018 at 11:47 AM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>>>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem()
>>>>>> on upper_sb to synchronize whole dirty inodes in upper filesystem
>>>>>> regardless of the overlay ownership of the inode. In the use case of
>>>>>> container, when multiple containers using the same underlying upper
>>>>>> filesystem, it has some shortcomings as below.
>>>>>> 
>>>>>> (1) Performance
>>>>>> Synchronization is probably heavy because it actually syncs unnecessary
>>>>>> inodes for target overlayfs.
>>>>>> 
>>>>>> (2) Interference
>>>>>> Unplanned synchronization will probably impact IO performance of
>>>>>> unrelated container processes on the other overlayfs.
>>>>>> 
>>>>>> This patch iterates upper inode list in overlayfs to only sync target
>>>>>> dirty inodes and wait for completion. By doing this, It is able to reduce
>>>>>> cost of synchronization and will not seriously impact IO performance of
>>>>>> unrelated processes. In special case, when having very few dirty inodes
>>>>>> and a lot of clean upper inodes in overlayfs, then iteration may waste
>>>>>> time so that synchronization is slower than before, but we think the
>>>>>> potential for performance improvement far surpass the potential for
>>>>>> performance regression in most cases.
>>>>>> 
>>>>>> Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
>>>>>> ---
>>>>>> Changes since v5:
>>>>>> - Move sync related functions to new file sync.c.
>>>>>> - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
>>>>>> - If ovl indoe flag is I_WILL_FREE or I_FREEING then syncfs
>>>>>> will wait until OVL_EVICT_PENDING to be cleared.
>>>>>> - Move inode sync operation into evict_inode from drop_inode.
>>>>>> - Call upper_sb->sync_fs with no-wait after sync dirty upper
>>>>>> inodes.
>>>>>> - Hold s_inode_wblist_lock until deleting item from list.
>>>>>> - Remove write_inode fuction because no more used.
>>>>>> 
> [...]
>>>>> 
>>>>> 
>>>>> 
>>>>>> +                               /* We need to do this as 'inode' can be freed by now */
>>>>> 
>>>>> Could oi_prev have been freed or removed from upper_inodes list?
>>>>> I don't see why not.
>>>>> You could try to continue iteration from iput_inode if it is not NULL,
>>>>> but what if it is NULL?
>>>> 
>>>> When iterating the list we either get reference or wait until the item gets deleted,
>>>> so I think there is no chance let previous inode becomes NULL.
>>>> 
>>> 
>>> Maybe I am missing something.
>>> When iterating the upper_inodes list you get a reference on current (oi)
>>> and maybe have a reference on previous via iput_inode.
>>> previous may have a reference from wb_list, but that reference can be lost
>>> while we are not holding the wblist lock. No?
>> 
>> The reference of previous inode can be lost if the inode is not our target,
>> but it must be happened after iput(iput_inode), before this will be safe.
>> 
> 
> I was talking about the case where iput_inode is NULL.

If iput_inode is NULL previous inode will not put reference until finishing wait,
so this case is safer than iput_inode having value. Meanwhile if inode has reference,
it won’t get deleted from list.


> 
>> When deleting item from list we take lock to protect from concurrent operations,
>> so if we have already got list lock, it will be safe to touch items in the list.
>> 
> 
> But we have let go of the list lock.

I don’t know what do you worry about, could you be more specific?

Thanks,
Chengguang.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] ovl: Improving syncfs efficiency
  2018-01-23  0:24           ` Chengguang Xu
@ 2018-01-23  6:39             ` Amir Goldstein
  2018-01-23  8:12               ` Chengguang Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Amir Goldstein @ 2018-01-23  6:39 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: Miklos Szeredi, Jan Kara, overlayfs

On Tue, Jan 23, 2018 at 2:24 AM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>
>> 在 2018年1月23日,上午2:45,Amir Goldstein <amir73il@gmail.com> 写道:
>>
>> On Mon, Jan 22, 2018 at 6:00 PM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>
>>>> 在 2018年1月22日,下午8:58,Amir Goldstein <amir73il@gmail.com> 写道:
>>>>
>>>> On Mon, Jan 22, 2018 at 2:05 PM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>>>>
>>>>>> 在 2018年1月22日,下午7:27,Amir Goldstein <amir73il@gmail.com> 写道:
>>>>>>
>>>>>> On Mon, Jan 22, 2018 at 11:47 AM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>>>>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem()
>>>>>>> on upper_sb to synchronize whole dirty inodes in upper filesystem
>>>>>>> regardless of the overlay ownership of the inode. In the use case of
>>>>>>> container, when multiple containers using the same underlying upper
>>>>>>> filesystem, it has some shortcomings as below.
>>>>>>>
>>>>>>> (1) Performance
>>>>>>> Synchronization is probably heavy because it actually syncs unnecessary
>>>>>>> inodes for target overlayfs.
>>>>>>>
>>>>>>> (2) Interference
>>>>>>> Unplanned synchronization will probably impact IO performance of
>>>>>>> unrelated container processes on the other overlayfs.
>>>>>>>
>>>>>>> This patch iterates upper inode list in overlayfs to only sync target
>>>>>>> dirty inodes and wait for completion. By doing this, It is able to reduce
>>>>>>> cost of synchronization and will not seriously impact IO performance of
>>>>>>> unrelated processes. In special case, when having very few dirty inodes
>>>>>>> and a lot of clean upper inodes in overlayfs, then iteration may waste
>>>>>>> time so that synchronization is slower than before, but we think the
>>>>>>> potential for performance improvement far surpass the potential for
>>>>>>> performance regression in most cases.
>>>>>>>
>>>>>>> Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
>>>>>>> ---
>>>>>>> Changes since v5:
>>>>>>> - Move sync related functions to new file sync.c.
>>>>>>> - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
>>>>>>> - If ovl indoe flag is I_WILL_FREE or I_FREEING then syncfs
>>>>>>> will wait until OVL_EVICT_PENDING to be cleared.
>>>>>>> - Move inode sync operation into evict_inode from drop_inode.
>>>>>>> - Call upper_sb->sync_fs with no-wait after sync dirty upper
>>>>>>> inodes.
>>>>>>> - Hold s_inode_wblist_lock until deleting item from list.
>>>>>>> - Remove write_inode fuction because no more used.
>>>>>>>
>> [...]
>>>>>>
>>>>>>
>>>>>>
>>>>>>> +                               /* We need to do this as 'inode' can be freed by now */
>>>>>>
>>>>>> Could oi_prev have been freed or removed from upper_inodes list?
>>>>>> I don't see why not.
>>>>>> You could try to continue iteration from iput_inode if it is not NULL,
>>>>>> but what if it is NULL?
>>>>>
>>>>> When iterating the list we either get reference or wait until the item gets deleted,
>>>>> so I think there is no chance let previous inode becomes NULL.
>>>>>
>>>>
>>>> Maybe I am missing something.
>>>> When iterating the upper_inodes list you get a reference on current (oi)
>>>> and maybe have a reference on previous via iput_inode.
>>>> previous may have a reference from wb_list, but that reference can be lost
>>>> while we are not holding the wblist lock. No?
>>>
>>> The reference of previous inode can be lost if the inode is not our target,
>>> but it must be happened after iput(iput_inode), before this will be safe.
>>>
>>
>> I was talking about the case where iput_inode is NULL.
>
> If iput_inode is NULL previous inode will not put reference until finishing wait,
> so this case is safer than iput_inode having value. Meanwhile if inode has reference,
> it won’t get deleted from list.
>

Everything you write may be true, but when writing "list" and "lock" and
not specifying which "list" and which "lock" the explanation is not clear.

If inode has reference it won't get deleted from upper_inodes list,
but it can be deleted from  s_inodes_wb list once wait for io is completed
and then loose the last reference and get deleted from upper_inodes list.

As far as I can tell, ovl_sync_inodes() and ovl_wait_inodes() are not
synchronized by any exclusive lock, so when you let go of all the
ovl_wait_inodes() locks for finish_wait(), you have no guaranty on any
of the inodes on upper_inodes anymore, and specifically no guaranty
that oi_prev is alive.

As far as I can tell, your suggestion to synchronize both ovl_sync_inodes()
and ovl_wait_inodes() with sb->s_sync_lock will take care of that problem.
Since in current implementation both functions are called with wait == 1
anyway, I don't see any problem with synchronizing the two functions. do you?

In any case, if you do agree with my analysis and add sb->s_sync_lock
to ovl_wait_inodes(), please document the reasons that make the upper_inode
in ovl_wait_inodes() iteration safe, because I don't think they are as
trivial as
you make them sound.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] ovl: Improving syncfs efficiency
  2018-01-23  6:39             ` Amir Goldstein
@ 2018-01-23  8:12               ` Chengguang Xu
  2018-01-23  9:00                 ` Amir Goldstein
  0 siblings, 1 reply; 10+ messages in thread
From: Chengguang Xu @ 2018-01-23  8:12 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, Jan Kara, overlayfs

> 
> 在 2018年1月23日,下午2:39,Amir Goldstein <amir73il@gmail.com> 写道:
> 
> On Tue, Jan 23, 2018 at 2:24 AM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>> 
>>> 在 2018年1月23日,上午2:45,Amir Goldstein <amir73il@gmail.com> 写道:
>>> 
>>> On Mon, Jan 22, 2018 at 6:00 PM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>> 
>>>>> 在 2018年1月22日,下午8:58,Amir Goldstein <amir73il@gmail.com> 写道:
>>>>> 
>>>>> On Mon, Jan 22, 2018 at 2:05 PM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>>>>> 
>>>>>>> 在 2018年1月22日,下午7:27,Amir Goldstein <amir73il@gmail.com> 写道:
>>>>>>> 
>>>>>>> On Mon, Jan 22, 2018 at 11:47 AM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>>>>>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem()
>>>>>>>> on upper_sb to synchronize whole dirty inodes in upper filesystem
>>>>>>>> regardless of the overlay ownership of the inode. In the use case of
>>>>>>>> container, when multiple containers using the same underlying upper
>>>>>>>> filesystem, it has some shortcomings as below.
>>>>>>>> 
>>>>>>>> (1) Performance
>>>>>>>> Synchronization is probably heavy because it actually syncs unnecessary
>>>>>>>> inodes for target overlayfs.
>>>>>>>> 
>>>>>>>> (2) Interference
>>>>>>>> Unplanned synchronization will probably impact IO performance of
>>>>>>>> unrelated container processes on the other overlayfs.
>>>>>>>> 
>>>>>>>> This patch iterates upper inode list in overlayfs to only sync target
>>>>>>>> dirty inodes and wait for completion. By doing this, It is able to reduce
>>>>>>>> cost of synchronization and will not seriously impact IO performance of
>>>>>>>> unrelated processes. In special case, when having very few dirty inodes
>>>>>>>> and a lot of clean upper inodes in overlayfs, then iteration may waste
>>>>>>>> time so that synchronization is slower than before, but we think the
>>>>>>>> potential for performance improvement far surpass the potential for
>>>>>>>> performance regression in most cases.
>>>>>>>> 
>>>>>>>> Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
>>>>>>>> ---
>>>>>>>> Changes since v5:
>>>>>>>> - Move sync related functions to new file sync.c.
>>>>>>>> - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
>>>>>>>> - If ovl indoe flag is I_WILL_FREE or I_FREEING then syncfs
>>>>>>>> will wait until OVL_EVICT_PENDING to be cleared.
>>>>>>>> - Move inode sync operation into evict_inode from drop_inode.
>>>>>>>> - Call upper_sb->sync_fs with no-wait after sync dirty upper
>>>>>>>> inodes.
>>>>>>>> - Hold s_inode_wblist_lock until deleting item from list.
>>>>>>>> - Remove write_inode fuction because no more used.
>>>>>>>> 
>>> [...]
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> +                               /* We need to do this as 'inode' can be freed by now */
>>>>>>> 
>>>>>>> Could oi_prev have been freed or removed from upper_inodes list?
>>>>>>> I don't see why not.
>>>>>>> You could try to continue iteration from iput_inode if it is not NULL,
>>>>>>> but what if it is NULL?
>>>>>> 
>>>>>> When iterating the list we either get reference or wait until the item gets deleted,
>>>>>> so I think there is no chance let previous inode becomes NULL.
>>>>>> 
>>>>> 
>>>>> Maybe I am missing something.
>>>>> When iterating the upper_inodes list you get a reference on current (oi)
>>>>> and maybe have a reference on previous via iput_inode.
>>>>> previous may have a reference from wb_list, but that reference can be lost
>>>>> while we are not holding the wblist lock. No?
>>>> 
>>>> The reference of previous inode can be lost if the inode is not our target,
>>>> but it must be happened after iput(iput_inode), before this will be safe.
>>>> 
>>> 
>>> I was talking about the case where iput_inode is NULL.
>> 
>> If iput_inode is NULL previous inode will not put reference until finishing wait,
>> so this case is safer than iput_inode having value. Meanwhile if inode has reference,
>> it won’t get deleted from list.
>> 
> 
> Everything you write may be true, but when writing "list" and "lock" and
> not specifying which "list" and which "lock" the explanation is not clear.
> 
> If inode has reference it won't get deleted from upper_inodes list,
> but it can be deleted from  s_inodes_wb list once wait for io is completed
> and then loose the last reference and get deleted from upper_inodes list.
> 
> As far as I can tell, ovl_sync_inodes() and ovl_wait_inodes() are not
> synchronized by any exclusive lock, so when you let go of all the
> ovl_wait_inodes() locks for finish_wait(), you have no guaranty on any
> of the inodes on upper_inodes anymore, and specifically no guaranty
> that oi_prev is alive.

I’m guessing you must be worrying about one syncfs(A) is waiting the IO to
be completed on inode 100, at the same time another syncfs(B) is iterating
inode 200 which is next of inode 100 in the upper_inodes list. Right?

In this case, second syncfs(B) will get another refcount of inode 100 in
ovl_sync_inodes() so even ater decreasing refcount of inode 100 in 
ovl_wait_inodes() for syncfs(A), inode 100 is still safe to use in
the ovl_sync_inodes() for syncfs(B). 


> 
> As far as I can tell, your suggestion to synchronize both ovl_sync_inodes()
> and ovl_wait_inodes() with sb->s_sync_lock will take care of that problem.
> Since in current implementation both functions are called with wait == 1
> anyway, I don't see any problem with synchronizing the two functions. do you?
> 

I did some basic tests(heavy untar && concurrent syncfs) to verify my proposal, 
and up to now seems there is no problem, but I want to say my proposal is
not mainly for addressing synchronizing problem between ovl_sync_inodes()
and ovl_wait_inodes(), I just want the code looks simple and easy to understand.


> In any case, if you do agree with my analysis and add sb->s_sync_lock
> to ovl_wait_inodes(), please document the reasons that make the upper_inode
> in ovl_wait_inodes() iteration safe, because I don't think they are as
> trivial as
> you make them sound.

I think if we modify to my proposal, then the code will be easy enough to understand,
so there will be no much doubt for safety of list iterating.


Thanks,
Chengguang.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6] ovl: Improving syncfs efficiency
  2018-01-23  8:12               ` Chengguang Xu
@ 2018-01-23  9:00                 ` Amir Goldstein
  0 siblings, 0 replies; 10+ messages in thread
From: Amir Goldstein @ 2018-01-23  9:00 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: Miklos Szeredi, Jan Kara, overlayfs

On Tue, Jan 23, 2018 at 10:12 AM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>
>> 在 2018年1月23日,下午2:39,Amir Goldstein <amir73il@gmail.com> 写道:
>>
>> On Tue, Jan 23, 2018 at 2:24 AM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>>
>>>> 在 2018年1月23日,上午2:45,Amir Goldstein <amir73il@gmail.com> 写道:
>>>>
>>>> On Mon, Jan 22, 2018 at 6:00 PM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>>>
>>>>>> 在 2018年1月22日,下午8:58,Amir Goldstein <amir73il@gmail.com> 写道:
>>>>>>
>>>>>> On Mon, Jan 22, 2018 at 2:05 PM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>>>>>>
>>>>>>>> 在 2018年1月22日,下午7:27,Amir Goldstein <amir73il@gmail.com> 写道:
>>>>>>>>
>>>>>>>> On Mon, Jan 22, 2018 at 11:47 AM, Chengguang Xu <cgxu519@icloud.com> wrote:
>>>>>>>>> Currently syncfs(2) call on overlayfs just simply call sync_filesystem()
>>>>>>>>> on upper_sb to synchronize whole dirty inodes in upper filesystem
>>>>>>>>> regardless of the overlay ownership of the inode. In the use case of
>>>>>>>>> container, when multiple containers using the same underlying upper
>>>>>>>>> filesystem, it has some shortcomings as below.
>>>>>>>>>
>>>>>>>>> (1) Performance
>>>>>>>>> Synchronization is probably heavy because it actually syncs unnecessary
>>>>>>>>> inodes for target overlayfs.
>>>>>>>>>
>>>>>>>>> (2) Interference
>>>>>>>>> Unplanned synchronization will probably impact IO performance of
>>>>>>>>> unrelated container processes on the other overlayfs.
>>>>>>>>>
>>>>>>>>> This patch iterates upper inode list in overlayfs to only sync target
>>>>>>>>> dirty inodes and wait for completion. By doing this, It is able to reduce
>>>>>>>>> cost of synchronization and will not seriously impact IO performance of
>>>>>>>>> unrelated processes. In special case, when having very few dirty inodes
>>>>>>>>> and a lot of clean upper inodes in overlayfs, then iteration may waste
>>>>>>>>> time so that synchronization is slower than before, but we think the
>>>>>>>>> potential for performance improvement far surpass the potential for
>>>>>>>>> performance regression in most cases.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
>>>>>>>>> ---
>>>>>>>>> Changes since v5:
>>>>>>>>> - Move sync related functions to new file sync.c.
>>>>>>>>> - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
>>>>>>>>> - If ovl indoe flag is I_WILL_FREE or I_FREEING then syncfs
>>>>>>>>> will wait until OVL_EVICT_PENDING to be cleared.
>>>>>>>>> - Move inode sync operation into evict_inode from drop_inode.
>>>>>>>>> - Call upper_sb->sync_fs with no-wait after sync dirty upper
>>>>>>>>> inodes.
>>>>>>>>> - Hold s_inode_wblist_lock until deleting item from list.
>>>>>>>>> - Remove write_inode fuction because no more used.
>>>>>>>>>
>>>> [...]
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> +                               /* We need to do this as 'inode' can be freed by now */
>>>>>>>>
>>>>>>>> Could oi_prev have been freed or removed from upper_inodes list?
>>>>>>>> I don't see why not.
>>>>>>>> You could try to continue iteration from iput_inode if it is not NULL,
>>>>>>>> but what if it is NULL?
>>>>>>>
>>>>>>> When iterating the list we either get reference or wait until the item gets deleted,
>>>>>>> so I think there is no chance let previous inode becomes NULL.
>>>>>>>
>>>>>>
>>>>>> Maybe I am missing something.
>>>>>> When iterating the upper_inodes list you get a reference on current (oi)
>>>>>> and maybe have a reference on previous via iput_inode.
>>>>>> previous may have a reference from wb_list, but that reference can be lost
>>>>>> while we are not holding the wblist lock. No?
>>>>>
>>>>> The reference of previous inode can be lost if the inode is not our target,
>>>>> but it must be happened after iput(iput_inode), before this will be safe.
>>>>>
>>>>
>>>> I was talking about the case where iput_inode is NULL.
>>>
>>> If iput_inode is NULL previous inode will not put reference until finishing wait,
>>> so this case is safer than iput_inode having value. Meanwhile if inode has reference,
>>> it won’t get deleted from list.
>>>
>>
>> Everything you write may be true, but when writing "list" and "lock" and
>> not specifying which "list" and which "lock" the explanation is not clear.
>>
>> If inode has reference it won't get deleted from upper_inodes list,
>> but it can be deleted from  s_inodes_wb list once wait for io is completed
>> and then loose the last reference and get deleted from upper_inodes list.
>>
>> As far as I can tell, ovl_sync_inodes() and ovl_wait_inodes() are not
>> synchronized by any exclusive lock, so when you let go of all the
>> ovl_wait_inodes() locks for finish_wait(), you have no guaranty on any
>> of the inodes on upper_inodes anymore, and specifically no guaranty
>> that oi_prev is alive.
>
> I’m guessing you must be worrying about one syncfs(A) is waiting the IO to
> be completed on inode 100, at the same time another syncfs(B) is iterating
> inode 200 which is next of inode 100 in the upper_inodes list. Right?

Not exactly. inode 100 can be put on wb_list by syncfs(B) without holding
a reference in iput_inode. Then syncfs(C) can start, see that inode 100 is
already on wb_list and don't see inode 200 on upper list at all. syncfs(C) will
call ovl_wait_inodes() which will drop the reference on inode 100 and then
evict of inode 100 can start and complete while syncfs(B) is still waiting on
EVICT_PENDING of inode 200. When wait for inode 200 in finished inode
100 is already gone.
This is a complex situation and I can't really say if it is possible or if
something will prevent it, but the best way to avoid it is to have clear
semantics, so we won't need to have this discussion at all.

>
> In this case, second syncfs(B) will get another refcount of inode 100 in
> ovl_sync_inodes() so even ater decreasing refcount of inode 100 in
> ovl_wait_inodes() for syncfs(A), inode 100 is still safe to use in
> the ovl_sync_inodes() for syncfs(B).
>
>
>>
>> As far as I can tell, your suggestion to synchronize both ovl_sync_inodes()
>> and ovl_wait_inodes() with sb->s_sync_lock will take care of that problem.
>> Since in current implementation both functions are called with wait == 1
>> anyway, I don't see any problem with synchronizing the two functions. do you?
>>
>
> I did some basic tests(heavy untar && concurrent syncfs) to verify my proposal,
> and up to now seems there is no problem, but I want to say my proposal is
> not mainly for addressing synchronizing problem between ovl_sync_inodes()
> and ovl_wait_inodes(), I just want the code looks simple and easy to understand.
>
>
>> In any case, if you do agree with my analysis and add sb->s_sync_lock
>> to ovl_wait_inodes(), please document the reasons that make the upper_inode
>> in ovl_wait_inodes() iteration safe, because I don't think they are as
>> trivial as
>> you make them sound.
>
> I think if we modify to my proposal, then the code will be easy enough to understand,
> so there will be no much doubt for safety of list iterating.
>

I'm not sure if you are new to kernel development, but understandable code
is not less important then correct code. You are not really writing to
the compiler
you are writing for reviewers and future developers that can make subtle changes
to your code. If the semantics are not perfectly clear, then even if
your code is
correct now it could be easily broken by future developer.

So yes, your proposal sounds good in that regard.

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-01-23  9:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-22  9:47 [PATCH v6] ovl: Improving syncfs efficiency Chengguang Xu
2018-01-22 11:27 ` Amir Goldstein
2018-01-22 12:05   ` Chengguang Xu
2018-01-22 12:58     ` Amir Goldstein
2018-01-22 16:00       ` Chengguang Xu
2018-01-22 18:45         ` Amir Goldstein
2018-01-23  0:24           ` Chengguang Xu
2018-01-23  6:39             ` Amir Goldstein
2018-01-23  8:12               ` Chengguang Xu
2018-01-23  9:00                 ` Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.