linux-unionfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11] ovl: Improving syncfs efficiency
@ 2020-02-10  3:10 Chengguang Xu
  2020-03-17  4:41 ` 回复:[PATCH " Chengguang Xu
  2020-04-15 19:19 ` [PATCH " Miklos Szeredi
  0 siblings, 2 replies; 14+ messages in thread
From: Chengguang Xu @ 2020-02-10  3:10 UTC (permalink / raw)
  To: miklos, amir73il, jack; +Cc: linux-unionfs, Chengguang Xu

Current syncfs(2) syscall on overlayfs just calls sync_filesystem()
on upper_sb to synchronize whole dirty inodes in upper filesystem
regardless of the overlay ownership of the inode. In the use case of
container, when multiple containers using the same underlying upper
filesystem, it has some shortcomings as below.

(1) Performance
Synchronization is probably heavy because it actually syncs unnecessary
inodes for target overlayfs.

(2) Interference
Unplanned synchronization will probably impact IO performance of
unrelated container processes on the other overlayfs.

This patch iterates upper inode list in overlayfs to only sync target
dirty inodes and wait for completion. By doing this, It is able to
reduce cost of synchronization and will not seriously impact IO performance
of unrelated processes. In special case, when having very few dirty inodes
and a lot of clean upper inodes in overlayfs, then iteration may waste
time so that synchronization is slower than before, but we think the
potential for performance improvement far surpass the potential for
performance regression in most cases.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
---
Changes since v10:
- Add special handling in ovl_evict_inode() for inode eviction trigered
by kswapd memory reclamation(with PF_MEMALLOC flag).
- Rebase on current latest overlayfs-next tree.
- Slightly modify license information.

Changes since v9:
- Rebase on current latest overlayfs-next tree.
- Calling clear_inode() regardless of having upper inode.

Changes since v8:
- Remove unnecessary blk_start_plug() call in ovl_sync_inodes().

Changes since v7:
- Check OVL_EVICT_PENDING flag instead of I_FREEING to recognize
inode which is in eviction process.
- Do not move ovl_inode back to ofs->upper_inodes when inode is in
eviction process.
- Delete unnecessary memory barrier in ovl_evict_inode().

Changes since v6:
- Take/release sb->s_sync_lock bofore/after sync_fs to serialize
concurrent syncfs.
- Change list iterating method to improve code readability.
- Fix typo in commit log and comments.
- Add header comment to sync.c.

Changes since v5:
- Move sync related functions to new file sync.c.
- Introduce OVL_EVICT_PENDING flag to avoid race conditions.
- If ovl inode flag is I_WILL_FREE or I_FREEING then syncfs
will wait until OVL_EVICT_PENDING to be cleared.
- Move inode sync operation into evict_inode from drop_inode.
- Call upper_sb->sync_fs with no-wait after sync dirty upper
inodes.
- Hold s_inode_wblist_lock until deleting item from list.
- Remove write_inode fuction because no more used.

Changes since v4:
- Add syncing operation and deleting from upper_inodes list
during ovl inode destruction.
- Export symbol of inode_wait_for_writeback() for waiting
writeback on ovl inode.
- Reuse I_SYNC flag to avoid race conditions between syncfs
and drop_inode.

Changes since v3:
- Introduce upper_inode list to organize ovl inode which has upper
  inode.
- Avoid iput operation inside spin lock.
- Change list iteration method for safety.
- Reuse inode->i_wb_list and sb->s_inodes_wb to save memory.
- Modify commit log and add more comments to the code.

Changes since v2:
- Decoupling with syncfs of upper fs by taking sb->s_sync_lock of
overlayfs.
- Factoring out sync operation to different helpers for easy
  understanding.

Changes since v1:
- If upper filesystem is readonly mode then skip synchronization.
- Introduce global wait list to replace temporary wait list for
concurrent synchronization.

 fs/overlayfs/Makefile    |   2 +-
 fs/overlayfs/overlayfs.h |   9 ++
 fs/overlayfs/ovl_entry.h |   5 +
 fs/overlayfs/super.c     |  44 ++----
 fs/overlayfs/sync.c      | 331 +++++++++++++++++++++++++++++++++++++++
 fs/overlayfs/util.c      |  23 ++-
 6 files changed, 381 insertions(+), 33 deletions(-)
 create mode 100644 fs/overlayfs/sync.c

diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
index 9164c585eb2f..0c4e53fcf4ef 100644
--- a/fs/overlayfs/Makefile
+++ b/fs/overlayfs/Makefile
@@ -6,4 +6,4 @@
 obj-$(CONFIG_OVERLAY_FS) += overlay.o
 
 overlay-objs := super.o namei.o util.o inode.o file.o dir.o readdir.o \
-		copy_up.o export.o
+		copy_up.o export.o sync.o
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 3623d28aa4fa..3cdb06e206a3 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -40,6 +40,10 @@ enum ovl_inode_flag {
 	OVL_UPPERDATA,
 	/* Inode number will remain constant over copy up. */
 	OVL_CONST_INO,
+	/* Set when ovl inode is about to be freed */
+	OVL_EVICT_PENDING,
+	/* Set when sync upper inode in workqueue work */
+	OVL_WRITE_INODE_PENDING,
 };
 
 enum ovl_entry_flag {
@@ -295,6 +299,7 @@ bool ovl_is_metacopy_dentry(struct dentry *dentry);
 char *ovl_get_redirect_xattr(struct dentry *dentry, int padding);
 ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value,
 		     size_t padding);
+void ovl_detach_upper_inodes_list(struct inode *inode);
 
 static inline bool ovl_is_impuredir(struct dentry *dentry)
 {
@@ -466,3 +471,7 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
 
 /* export.c */
 extern const struct export_operations ovl_export_operations;
+
+/* sync.c */
+void ovl_evict_inode(struct inode *inode);
+int ovl_sync_fs(struct super_block *sb, int wait);
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 89015ea822e7..42e092e8cd61 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -75,6 +75,9 @@ struct ovl_fs {
 	struct inode *indexdir_trap;
 	/* -1: disabled, 0: same fs, 1..32: number of unused ino bits */
 	int xino_mode;
+	/* Upper inode list and lock */
+	spinlock_t upper_inodes_lock;
+	struct list_head upper_inodes;
 };
 
 static inline struct ovl_fs *OVL_FS(struct super_block *sb)
@@ -115,6 +118,8 @@ struct ovl_inode {
 
 	/* synchronize copy up and more */
 	struct mutex lock;
+	/* Upper inodes list */
+	struct list_head upper_inodes_list;
 };
 
 static inline struct ovl_inode *OVL_I(struct inode *inode)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 319fe0d355b0..3e6871e1a7ba 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -183,6 +183,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb)
 	oi->lower = NULL;
 	oi->lowerdata = NULL;
 	mutex_init(&oi->lock);
+	INIT_LIST_HEAD(&oi->upper_inodes_list);
 
 	return &oi->vfs_inode;
 }
@@ -249,36 +250,6 @@ static void ovl_put_super(struct super_block *sb)
 	ovl_free_fs(ofs);
 }
 
-/* Sync real dirty inodes in upper filesystem (if it exists) */
-static int ovl_sync_fs(struct super_block *sb, int wait)
-{
-	struct ovl_fs *ofs = sb->s_fs_info;
-	struct super_block *upper_sb;
-	int ret;
-
-	if (!ofs->upper_mnt)
-		return 0;
-
-	/*
-	 * If this is a sync(2) call or an emergency sync, all the super blocks
-	 * will be iterated, including upper_sb, so no need to do anything.
-	 *
-	 * If this is a syncfs(2) call, then we do need to call
-	 * sync_filesystem() on upper_sb, but enough if we do it when being
-	 * called with wait == 1.
-	 */
-	if (!wait)
-		return 0;
-
-	upper_sb = ofs->upper_mnt->mnt_sb;
-
-	down_read(&upper_sb->s_umount);
-	ret = sync_filesystem(upper_sb);
-	up_read(&upper_sb->s_umount);
-
-	return ret;
-}
-
 /**
  * ovl_statfs
  * @sb: The overlayfs super block
@@ -376,11 +347,19 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
 	return 0;
 }
 
+static int ovl_drop_inode(struct inode *inode)
+{
+	if (ovl_inode_upper(inode))
+		ovl_set_flag(OVL_EVICT_PENDING, inode);
+	return 1;
+}
+
 static const struct super_operations ovl_super_operations = {
 	.alloc_inode	= ovl_alloc_inode,
 	.free_inode	= ovl_free_inode,
 	.destroy_inode	= ovl_destroy_inode,
-	.drop_inode	= generic_delete_inode,
+	.drop_inode	= ovl_drop_inode,
+	.evict_inode	= ovl_evict_inode,
 	.put_super	= ovl_put_super,
 	.sync_fs	= ovl_sync_fs,
 	.statfs		= ovl_statfs,
@@ -1601,6 +1580,9 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	if (!ofs)
 		goto out;
 
+	spin_lock_init(&ofs->upper_inodes_lock);
+	INIT_LIST_HEAD(&ofs->upper_inodes);
+
 	ofs->creator_cred = cred = prepare_creds();
 	if (!cred)
 		goto out_err;
diff --git a/fs/overlayfs/sync.c b/fs/overlayfs/sync.c
new file mode 100644
index 000000000000..aecd312ec851
--- /dev/null
+++ b/fs/overlayfs/sync.c
@@ -0,0 +1,331 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 All Rights Reserved.
+ * Author Chengguang Xu <cgxu519@mykernel.net>
+ */
+
+#include <linux/fs.h>
+#include <linux/xattr.h>
+#include <linux/mount.h>
+#include <linux/writeback.h>
+#include <linux/blkdev.h>
+#include "overlayfs.h"
+
+/**
+ * upper_inodes list is used for organizing potential target of syncfs,
+ * ovl inode which has upper inode will be added to this list while
+ * initializing or updating and will be deleted from this list while
+ * evicting.
+ *
+ * Introduce ovl_inode flag "OVL_EVICT_PENDING" to indicate the ovl inode
+ * is in eviction process, syncfs(actually ovl_sync_inodes()) will wait on
+ * evicting inode until the IO to be completed in evict_inode().
+ *
+ * inode state/ovl_inode flags cycle in syncfs context will be as below.
+ * OVL_EVICT_PENDING (ovl_inode->flags) is only marked when inode having
+ * upper inode.
+ *
+ * (allocation)
+ *   |
+ * I_NEW (inode->i_state)
+ *   |
+ * NONE  (inode->i_state)
+ *   |
+ * OVL_EVICT_PENDING (ovl_inode->flags)
+ *   |
+ * I_FREEING (inode->i_state) | OVL_EVICT_PENDING (ovl_inode->flags)
+ *   |
+ * I_FREEING (inode->i_state) | I_CLEAR (inode->i_state)
+ *   |
+ * (destruction)
+ *
+ *
+ * There are five locks in in syncfs contexts:
+ *
+ * upper_sb->s_umount(semaphore)    : avoiding r/o to r/w or vice versa
+ * sb->s_sync_lock(mutex)           : avoiding concorrent syncfs running
+ * ofs->upper_inodes_lock(spinlock) : protecting upper_inodes list
+ * sb->s_inode_wblist_lock(spinlock): protecting s_inodes_wb(sync waiting) list
+ * inode->i_lock(spinlock)          : protecting inode fields
+ *
+ * Lock dependencies in syncfs contexts:
+ *
+ * upper_sb->s_umount
+ *	sb->s_sync_lock
+ *		ofs->upper_inodes_lock
+ *			inode->i_lock
+ *
+ * upper_sb->s_umount
+ *	sb->s_sync_lock
+ *		sb->s_inode_wblist_lock
+ *
+ */
+
+struct ovl_write_inode_work {
+	struct work_struct work;
+	struct inode *inode;
+};
+
+static void ovl_write_inode_work_fn(struct work_struct *work)
+{
+	struct ovl_write_inode_work *ovl_wiw;
+	struct inode *inode;
+
+	ovl_wiw = container_of(work, struct ovl_write_inode_work, work);
+	inode = ovl_wiw->inode;
+	write_inode_now(ovl_inode_upper(inode), 1);
+
+	spin_lock(&inode->i_lock);
+	ovl_clear_flag(OVL_WRITE_INODE_PENDING, inode);
+	wake_up_bit(&OVL_I(inode)->flags, OVL_WRITE_INODE_PENDING);
+	spin_unlock(&inode->i_lock);
+}
+
+void ovl_evict_inode(struct inode *inode)
+{
+	struct ovl_inode *oi = OVL_I(inode);
+	struct ovl_write_inode_work ovl_wiw;
+	DEFINE_WAIT_BIT(wait, &oi->flags, OVL_WRITE_INODE_PENDING);
+	wait_queue_head_t *wqh;
+
+	if (ovl_inode_upper(inode)) {
+		if (current->flags & PF_MEMALLOC) {
+			spin_lock(&inode->i_lock);
+			ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
+			wqh = bit_waitqueue(&oi->flags,
+					OVL_WRITE_INODE_PENDING);
+			prepare_to_wait(wqh, &wait.wq_entry,
+					TASK_UNINTERRUPTIBLE);
+			spin_unlock(&inode->i_lock);
+
+			ovl_wiw.inode = inode;
+			INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
+			schedule_work(&ovl_wiw.work);
+
+			schedule();
+			finish_wait(wqh, &wait.wq_entry);
+		} else {
+			write_inode_now(ovl_inode_upper(inode), 1);
+		}
+		ovl_detach_upper_inodes_list(inode);
+
+		/*
+		 * ovl_sync_inodes() may wait until
+		 * flag OVL_EVICT_PENDING to be cleared.
+		 */
+		spin_lock(&inode->i_lock);
+		ovl_clear_flag(OVL_EVICT_PENDING, inode);
+		wake_up_bit(&OVL_I(inode)->flags, OVL_EVICT_PENDING);
+		spin_unlock(&inode->i_lock);
+	}
+	clear_inode(inode);
+}
+
+void ovl_wait_evict_pending(struct inode *inode)
+{
+	struct ovl_fs *ofs = inode->i_sb->s_fs_info;
+	struct ovl_inode *oi = OVL_I(inode);
+	DEFINE_WAIT_BIT(wait, &oi->flags, OVL_EVICT_PENDING);
+	wait_queue_head_t *wqh;
+
+	wqh = bit_waitqueue(&oi->flags, OVL_EVICT_PENDING);
+	prepare_to_wait(wqh, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(&ofs->upper_inodes_lock);
+	schedule();
+	finish_wait(wqh, &wait.wq_entry);
+}
+
+/**
+ * ovl_sync_inodes
+ * @sb: The overlayfs super block
+ *
+ * upper_inodes list is used for organizing ovl inode which has upper inode,
+ * by iterating the list to looking for and syncing dirty upper inode.
+ *
+ * When starting syncing inode, we add the inode to wait list explicitly,
+ * in order to save memory we reuse inode->i_wb_list and sb->s_inodes_wb,
+ * so those fields have slightly differnt meanings in overlayfs.
+ */
+static void ovl_sync_inodes(struct super_block *sb)
+{
+	struct ovl_fs *ofs = sb->s_fs_info;
+	struct ovl_inode *oi;
+	struct inode *inode;
+	struct inode *upper_inode;
+	struct blk_plug plug;
+	LIST_HEAD(sync_tmp_list);
+
+	struct writeback_control wbc_for_sync = {
+		.sync_mode		= WB_SYNC_ALL,
+		.for_sync		= 1,
+		.range_start		= 0,
+		.range_end		= LLONG_MAX,
+		.nr_to_write		= LONG_MAX,
+	};
+
+	blk_start_plug(&plug);
+	spin_lock(&ofs->upper_inodes_lock);
+	list_splice_init(&ofs->upper_inodes, &sync_tmp_list);
+
+	while (!list_empty(&sync_tmp_list)) {
+		oi = list_first_entry(&sync_tmp_list, struct ovl_inode,
+						upper_inodes_list);
+		inode = &oi->vfs_inode;
+		spin_lock(&inode->i_lock);
+
+		if (inode->i_state & I_NEW) {
+			list_move_tail(&oi->upper_inodes_list,
+					&ofs->upper_inodes);
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+
+		/*
+		 * If ovl_inode flags is OVL_EVICT_PENDING,
+		 * left sync operation to the ovl_evict_inode(),
+		 * so wait here until OVL_EVICT_PENDING flag to be cleared.
+		 */
+		if (ovl_test_flag(OVL_EVICT_PENDING, inode)) {
+			ovl_wait_evict_pending(inode);
+			goto next;
+		}
+
+		list_move_tail(&oi->upper_inodes_list,
+				&ofs->upper_inodes);
+		ihold(inode);
+		upper_inode = ovl_inode_upper(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&ofs->upper_inodes_lock);
+
+		if (!(upper_inode->i_state & I_DIRTY_ALL)) {
+			if (!mapping_tagged(upper_inode->i_mapping,
+						PAGECACHE_TAG_WRITEBACK)) {
+				iput(inode);
+				goto next;
+			}
+		} else {
+			sync_inode(upper_inode, &wbc_for_sync);
+		}
+
+		spin_lock(&sb->s_inode_wblist_lock);
+		list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
+		spin_unlock(&sb->s_inode_wblist_lock);
+
+next:
+		cond_resched();
+		spin_lock(&ofs->upper_inodes_lock);
+	}
+	spin_unlock(&ofs->upper_inodes_lock);
+	blk_finish_plug(&plug);
+}
+
+/**
+ * ovl_wait_inodes
+ * @sb: The overlayfs super block
+ *
+ * Waiting writeback inodes which are in s_inodes_wb list,
+ * all the IO that has been issued up to the time this
+ * function is enter is guaranteed to be completed.
+ */
+static void ovl_wait_inodes(struct super_block *sb)
+{
+	LIST_HEAD(sync_wait_list);
+	struct inode *inode;
+	struct inode *upper_inode;
+
+	/*
+	 * ovl inode in sb->s_inodes_wb list has already got inode reference,
+	 * this will avoid silent inode droping during waiting on it.
+	 *
+	 * Splice the sync wait list onto a temporary list to avoid waiting on
+	 * inodes that have started writeback after this point.
+	 */
+	spin_lock(&sb->s_inode_wblist_lock);
+	list_splice_init(&sb->s_inodes_wb, &sync_wait_list);
+
+	while (!list_empty(&sync_wait_list)) {
+		inode = list_first_entry(&sync_wait_list, struct inode,
+							i_wb_list);
+		list_del_init(&inode->i_wb_list);
+		upper_inode = ovl_inode_upper(inode);
+		spin_unlock(&sb->s_inode_wblist_lock);
+
+		if (!mapping_tagged(upper_inode->i_mapping,
+				PAGECACHE_TAG_WRITEBACK)) {
+			goto next;
+		}
+
+		filemap_fdatawait_keep_errors(upper_inode->i_mapping);
+		cond_resched();
+
+next:
+		iput(inode);
+		spin_lock(&sb->s_inode_wblist_lock);
+	}
+	spin_unlock(&sb->s_inode_wblist_lock);
+}
+
+/**
+ * ovl_sync_filesystem
+ * @sb: The overlayfs super block
+ *
+ * Sync underlying dirty inodes in upper filesystem and
+ * wait for completion.
+ */
+static int ovl_sync_filesystem(struct super_block *sb)
+{
+	struct ovl_fs *ofs = sb->s_fs_info;
+	struct super_block *upper_sb = ofs->upper_mnt->mnt_sb;
+	int ret = 0;
+
+	down_read(&upper_sb->s_umount);
+	if (!sb_rdonly(upper_sb)) {
+		/*
+		 * s_sync_lock is used for serializing concurrent
+		 * syncfs operations.
+		 */
+		mutex_lock(&sb->s_sync_lock);
+		ovl_sync_inodes(sb);
+		/* Calling sync_fs with no-wait for better performance. */
+		if (upper_sb->s_op->sync_fs)
+			upper_sb->s_op->sync_fs(upper_sb, 0);
+
+		ovl_wait_inodes(sb);
+		if (upper_sb->s_op->sync_fs)
+			upper_sb->s_op->sync_fs(upper_sb, 1);
+
+		ret = sync_blockdev(upper_sb->s_bdev);
+		mutex_unlock(&sb->s_sync_lock);
+	}
+	up_read(&upper_sb->s_umount);
+	return ret;
+}
+
+/**
+ * ovl_sync_fs
+ * @sb: The overlayfs super block
+ * @wait: Wait for I/O to complete
+ *
+ * Sync real dirty inodes in upper filesystem (if it exists)
+ */
+int ovl_sync_fs(struct super_block *sb, int wait)
+{
+	struct ovl_fs *ofs = sb->s_fs_info;
+
+	if (!ofs->upper_mnt)
+		return 0;
+
+	/*
+	 * If this is a sync(2) call or an emergency sync, all the super blocks
+	 * will be iterated, including upper_sb, so no need to do anything.
+	 *
+	 * If this is a syncfs(2) call, then we do need to call
+	 * sync_filesystem() on upper_sb, but enough if we do it when being
+	 * called with wait == 1.
+	 */
+	if (!wait)
+		return 0;
+
+	return ovl_sync_filesystem(sb);
+}
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index ea005085803f..73ef195d9aef 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -386,13 +386,33 @@ void ovl_dentry_set_redirect(struct dentry *dentry, const char *redirect)
 	oi->redirect = redirect;
 }
 
+void ovl_attach_upper_inodes_list(struct inode *inode)
+{
+	struct ovl_fs *ofs = inode->i_sb->s_fs_info;
+
+	spin_lock(&ofs->upper_inodes_lock);
+	list_add(&OVL_I(inode)->upper_inodes_list, &ofs->upper_inodes);
+	spin_unlock(&ofs->upper_inodes_lock);
+}
+
+void ovl_detach_upper_inodes_list(struct inode *inode)
+{
+	struct ovl_fs *ofs = inode->i_sb->s_fs_info;
+
+	spin_lock(&ofs->upper_inodes_lock);
+	list_del_init(&OVL_I(inode)->upper_inodes_list);
+	spin_unlock(&ofs->upper_inodes_lock);
+}
+
 void ovl_inode_init(struct inode *inode, struct dentry *upperdentry,
 		    struct dentry *lowerdentry, struct dentry *lowerdata)
 {
 	struct inode *realinode = d_inode(upperdentry ?: lowerdentry);
 
-	if (upperdentry)
+	if (upperdentry) {
 		OVL_I(inode)->__upperdentry = upperdentry;
+		ovl_attach_upper_inodes_list(inode);
+	}
 	if (lowerdentry)
 		OVL_I(inode)->lower = igrab(d_inode(lowerdentry));
 	if (lowerdata)
@@ -421,6 +441,7 @@ void ovl_inode_update(struct inode *inode, struct dentry *upperdentry)
 		inode->i_private = upperinode;
 		__insert_inode_hash(inode, (unsigned long) upperinode);
 	}
+	ovl_attach_upper_inodes_list(inode);
 }
 
 static void ovl_dentry_version_inc(struct dentry *dentry, bool impurity)
-- 
2.21.1




^ permalink raw reply related	[flat|nested] 14+ messages in thread

* 回复:[PATCH v11] ovl: Improving syncfs efficiency
  2020-02-10  3:10 [PATCH v11] ovl: Improving syncfs efficiency Chengguang Xu
@ 2020-03-17  4:41 ` Chengguang Xu
  2020-03-17  4:49   ` Chengguang Xu
  2020-04-15 19:19 ` [PATCH " Miklos Szeredi
  1 sibling, 1 reply; 14+ messages in thread
From: Chengguang Xu @ 2020-03-17  4:41 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: miklos, amir73il, jack, linux-unionfs

 ---- 在 星期一, 2020-02-10 11:10:09 Chengguang Xu <cgxu519@mykernel.net> 撰写 ----
 > Current syncfs(2) syscall on overlayfs just calls sync_filesystem()
 > on upper_sb to synchronize whole dirty inodes in upper filesystem
 > regardless of the overlay ownership of the inode. In the use case of
 > container, when multiple containers using the same underlying upper
 > filesystem, it has some shortcomings as below.
 > 
 > (1) Performance
 > Synchronization is probably heavy because it actually syncs unnecessary
 > inodes for target overlayfs.
 > 
 > (2) Interference
 > Unplanned synchronization will probably impact IO performance of
 > unrelated container processes on the other overlayfs.
 > 
 > This patch iterates upper inode list in overlayfs to only sync target
 > dirty inodes and wait for completion. By doing this, It is able to
 > reduce cost of synchronization and will not seriously impact IO performance
 > of unrelated processes. In special case, when having very few dirty inodes
 > and a lot of clean upper inodes in overlayfs, then iteration may waste
 > time so that synchronization is slower than before, but we think the
 > potential for performance improvement far surpass the potential for
 > performance regression in most cases.
 > 
 > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
 > ---

Hi Miklos,

Have you get time to review this patch?
I tested xfstest overlay part (and generic/427 which is testcase for syncfs) and heavy untar + syncfs test and I found no problem.

Thanks,
cgxu

 > Changes since v10:
 > - Add special handling in ovl_evict_inode() for inode eviction trigered
 > by kswapd memory reclamation(with PF_MEMALLOC flag).
 > - Rebase on current latest overlayfs-next tree.
 > - Slightly modify license information.
 > 
 > Changes since v9:
 > - Rebase on current latest overlayfs-next tree.
 > - Calling clear_inode() regardless of having upper inode.
 > 
 > Changes since v8:
 > - Remove unnecessary blk_start_plug() call in ovl_sync_inodes().
 > 
 > Changes since v7:
 > - Check OVL_EVICT_PENDING flag instead of I_FREEING to recognize
 > inode which is in eviction process.
 > - Do not move ovl_inode back to ofs->upper_inodes when inode is in
 > eviction process.
 > - Delete unnecessary memory barrier in ovl_evict_inode().
 > 
 > Changes since v6:
 > - Take/release sb->s_sync_lock bofore/after sync_fs to serialize
 > concurrent syncfs.
 > - Change list iterating method to improve code readability.
 > - Fix typo in commit log and comments.
 > - Add header comment to sync.c.
 > 
 > Changes since v5:
 > - Move sync related functions to new file sync.c.
 > - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
 > - If ovl inode flag is I_WILL_FREE or I_FREEING then syncfs
 > will wait until OVL_EVICT_PENDING to be cleared.
 > - Move inode sync operation into evict_inode from drop_inode.
 > - Call upper_sb->sync_fs with no-wait after sync dirty upper
 > inodes.
 > - Hold s_inode_wblist_lock until deleting item from list.
 > - Remove write_inode fuction because no more used.
 > 
 > Changes since v4:
 > - Add syncing operation and deleting from upper_inodes list
 > during ovl inode destruction.
 > - Export symbol of inode_wait_for_writeback() for waiting
 > writeback on ovl inode.
 > - Reuse I_SYNC flag to avoid race conditions between syncfs
 > and drop_inode.
 > 
 > Changes since v3:
 > - Introduce upper_inode list to organize ovl inode which has upper
 >   inode.
 > - Avoid iput operation inside spin lock.
 > - Change list iteration method for safety.
 > - Reuse inode->i_wb_list and sb->s_inodes_wb to save memory.
 > - Modify commit log and add more comments to the code.
 > 
 > Changes since v2:
 > - Decoupling with syncfs of upper fs by taking sb->s_sync_lock of
 > overlayfs.
 > - Factoring out sync operation to different helpers for easy
 >   understanding.
 > 
 > Changes since v1:
 > - If upper filesystem is readonly mode then skip synchronization.
 > - Introduce global wait list to replace temporary wait list for
 > concurrent synchronization.
 > 
 >  fs/overlayfs/Makefile    |   2 +-
 >  fs/overlayfs/overlayfs.h |   9 ++
 >  fs/overlayfs/ovl_entry.h |   5 +
 >  fs/overlayfs/super.c     |  44 ++----
 >  fs/overlayfs/sync.c      | 331 +++++++++++++++++++++++++++++++++++++++
 >  fs/overlayfs/util.c      |  23 ++-
 >  6 files changed, 381 insertions(+), 33 deletions(-)
 >  create mode 100644 fs/overlayfs/sync.c
 > 
 > diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
 > index 9164c585eb2f..0c4e53fcf4ef 100644
 > --- a/fs/overlayfs/Makefile
 > +++ b/fs/overlayfs/Makefile
 > @@ -6,4 +6,4 @@
 >  obj-$(CONFIG_OVERLAY_FS) += overlay.o
 >  
 >  overlay-objs := super.o namei.o util.o inode.o file.o dir.o readdir.o \
 > -        copy_up.o export.o
 > +        copy_up.o export.o sync.o
 > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
 > index 3623d28aa4fa..3cdb06e206a3 100644
 > --- a/fs/overlayfs/overlayfs.h
 > +++ b/fs/overlayfs/overlayfs.h
 > @@ -40,6 +40,10 @@ enum ovl_inode_flag {
 >      OVL_UPPERDATA,
 >      /* Inode number will remain constant over copy up. */
 >      OVL_CONST_INO,
 > +    /* Set when ovl inode is about to be freed */
 > +    OVL_EVICT_PENDING,
 > +    /* Set when sync upper inode in workqueue work */
 > +    OVL_WRITE_INODE_PENDING,
 >  };
 >  
 >  enum ovl_entry_flag {
 > @@ -295,6 +299,7 @@ bool ovl_is_metacopy_dentry(struct dentry *dentry);
 >  char *ovl_get_redirect_xattr(struct dentry *dentry, int padding);
 >  ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value,
 >               size_t padding);
 > +void ovl_detach_upper_inodes_list(struct inode *inode);
 >  
 >  static inline bool ovl_is_impuredir(struct dentry *dentry)
 >  {
 > @@ -466,3 +471,7 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
 >  
 >  /* export.c */
 >  extern const struct export_operations ovl_export_operations;
 > +
 > +/* sync.c */
 > +void ovl_evict_inode(struct inode *inode);
 > +int ovl_sync_fs(struct super_block *sb, int wait);
 > diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
 > index 89015ea822e7..42e092e8cd61 100644
 > --- a/fs/overlayfs/ovl_entry.h
 > +++ b/fs/overlayfs/ovl_entry.h
 > @@ -75,6 +75,9 @@ struct ovl_fs {
 >      struct inode *indexdir_trap;
 >      /* -1: disabled, 0: same fs, 1..32: number of unused ino bits */
 >      int xino_mode;
 > +    /* Upper inode list and lock */
 > +    spinlock_t upper_inodes_lock;
 > +    struct list_head upper_inodes;
 >  };
 >  
 >  static inline struct ovl_fs *OVL_FS(struct super_block *sb)
 > @@ -115,6 +118,8 @@ struct ovl_inode {
 >  
 >      /* synchronize copy up and more */
 >      struct mutex lock;
 > +    /* Upper inodes list */
 > +    struct list_head upper_inodes_list;
 >  };
 >  
 >  static inline struct ovl_inode *OVL_I(struct inode *inode)
 > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
 > index 319fe0d355b0..3e6871e1a7ba 100644
 > --- a/fs/overlayfs/super.c
 > +++ b/fs/overlayfs/super.c
 > @@ -183,6 +183,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb)
 >      oi->lower = NULL;
 >      oi->lowerdata = NULL;
 >      mutex_init(&oi->lock);
 > +    INIT_LIST_HEAD(&oi->upper_inodes_list);
 >  
 >      return &oi->vfs_inode;
 >  }
 > @@ -249,36 +250,6 @@ static void ovl_put_super(struct super_block *sb)
 >      ovl_free_fs(ofs);
 >  }
 >  
 > -/* Sync real dirty inodes in upper filesystem (if it exists) */
 > -static int ovl_sync_fs(struct super_block *sb, int wait)
 > -{
 > -    struct ovl_fs *ofs = sb->s_fs_info;
 > -    struct super_block *upper_sb;
 > -    int ret;
 > -
 > -    if (!ofs->upper_mnt)
 > -        return 0;
 > -
 > -    /*
 > -     * If this is a sync(2) call or an emergency sync, all the super blocks
 > -     * will be iterated, including upper_sb, so no need to do anything.
 > -     *
 > -     * If this is a syncfs(2) call, then we do need to call
 > -     * sync_filesystem() on upper_sb, but enough if we do it when being
 > -     * called with wait == 1.
 > -     */
 > -    if (!wait)
 > -        return 0;
 > -
 > -    upper_sb = ofs->upper_mnt->mnt_sb;
 > -
 > -    down_read(&upper_sb->s_umount);
 > -    ret = sync_filesystem(upper_sb);
 > -    up_read(&upper_sb->s_umount);
 > -
 > -    return ret;
 > -}
 > -
 >  /**
 >   * ovl_statfs
 >   * @sb: The overlayfs super block
 > @@ -376,11 +347,19 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
 >      return 0;
 >  }
 >  
 > +static int ovl_drop_inode(struct inode *inode)
 > +{
 > +    if (ovl_inode_upper(inode))
 > +        ovl_set_flag(OVL_EVICT_PENDING, inode);
 > +    return 1;
 > +}
 > +
 >  static const struct super_operations ovl_super_operations = {
 >      .alloc_inode    = ovl_alloc_inode,
 >      .free_inode    = ovl_free_inode,
 >      .destroy_inode    = ovl_destroy_inode,
 > -    .drop_inode    = generic_delete_inode,
 > +    .drop_inode    = ovl_drop_inode,
 > +    .evict_inode    = ovl_evict_inode,
 >      .put_super    = ovl_put_super,
 >      .sync_fs    = ovl_sync_fs,
 >      .statfs        = ovl_statfs,
 > @@ -1601,6 +1580,9 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 >      if (!ofs)
 >          goto out;
 >  
 > +    spin_lock_init(&ofs->upper_inodes_lock);
 > +    INIT_LIST_HEAD(&ofs->upper_inodes);
 > +
 >      ofs->creator_cred = cred = prepare_creds();
 >      if (!cred)
 >          goto out_err;
 > diff --git a/fs/overlayfs/sync.c b/fs/overlayfs/sync.c
 > new file mode 100644
 > index 000000000000..aecd312ec851
 > --- /dev/null
 > +++ b/fs/overlayfs/sync.c
 > @@ -0,0 +1,331 @@
 > +// SPDX-License-Identifier: GPL-2.0-only
 > +/*
 > + * Copyright (C) 2020 All Rights Reserved.
 > + * Author Chengguang Xu <cgxu519@mykernel.net>
 > + */
 > +
 > +#include <linux/fs.h>
 > +#include <linux/xattr.h>
 > +#include <linux/mount.h>
 > +#include <linux/writeback.h>
 > +#include <linux/blkdev.h>
 > +#include "overlayfs.h"
 > +
 > +/**
 > + * upper_inodes list is used for organizing potential target of syncfs,
 > + * ovl inode which has upper inode will be added to this list while
 > + * initializing or updating and will be deleted from this list while
 > + * evicting.
 > + *
 > + * Introduce ovl_inode flag "OVL_EVICT_PENDING" to indicate the ovl inode
 > + * is in eviction process, syncfs(actually ovl_sync_inodes()) will wait on
 > + * evicting inode until the IO to be completed in evict_inode().
 > + *
 > + * inode state/ovl_inode flags cycle in syncfs context will be as below.
 > + * OVL_EVICT_PENDING (ovl_inode->flags) is only marked when inode having
 > + * upper inode.
 > + *
 > + * (allocation)
 > + *   |
 > + * I_NEW (inode->i_state)
 > + *   |
 > + * NONE  (inode->i_state)
 > + *   |
 > + * OVL_EVICT_PENDING (ovl_inode->flags)
 > + *   |
 > + * I_FREEING (inode->i_state) | OVL_EVICT_PENDING (ovl_inode->flags)
 > + *   |
 > + * I_FREEING (inode->i_state) | I_CLEAR (inode->i_state)
 > + *   |
 > + * (destruction)
 > + *
 > + *
 > + * There are five locks in in syncfs contexts:
 > + *
 > + * upper_sb->s_umount(semaphore)    : avoiding r/o to r/w or vice versa
 > + * sb->s_sync_lock(mutex)           : avoiding concorrent syncfs running
 > + * ofs->upper_inodes_lock(spinlock) : protecting upper_inodes list
 > + * sb->s_inode_wblist_lock(spinlock): protecting s_inodes_wb(sync waiting) list
 > + * inode->i_lock(spinlock)          : protecting inode fields
 > + *
 > + * Lock dependencies in syncfs contexts:
 > + *
 > + * upper_sb->s_umount
 > + *    sb->s_sync_lock
 > + *        ofs->upper_inodes_lock
 > + *            inode->i_lock
 > + *
 > + * upper_sb->s_umount
 > + *    sb->s_sync_lock
 > + *        sb->s_inode_wblist_lock
 > + *
 > + */
 > +
 > +struct ovl_write_inode_work {
 > +    struct work_struct work;
 > +    struct inode *inode;
 > +};
 > +
 > +static void ovl_write_inode_work_fn(struct work_struct *work)
 > +{
 > +    struct ovl_write_inode_work *ovl_wiw;
 > +    struct inode *inode;
 > +
 > +    ovl_wiw = container_of(work, struct ovl_write_inode_work, work);
 > +    inode = ovl_wiw->inode;
 > +    write_inode_now(ovl_inode_upper(inode), 1);
 > +
 > +    spin_lock(&inode->i_lock);
 > +    ovl_clear_flag(OVL_WRITE_INODE_PENDING, inode);
 > +    wake_up_bit(&OVL_I(inode)->flags, OVL_WRITE_INODE_PENDING);
 > +    spin_unlock(&inode->i_lock);
 > +}
 > +
 > +void ovl_evict_inode(struct inode *inode)
 > +{
 > +    struct ovl_inode *oi = OVL_I(inode);
 > +    struct ovl_write_inode_work ovl_wiw;
 > +    DEFINE_WAIT_BIT(wait, &oi->flags, OVL_WRITE_INODE_PENDING);
 > +    wait_queue_head_t *wqh;
 > +
 > +    if (ovl_inode_upper(inode)) {
 > +        if (current->flags & PF_MEMALLOC) {
 > +            spin_lock(&inode->i_lock);
 > +            ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
 > +            wqh = bit_waitqueue(&oi->flags,
 > +                    OVL_WRITE_INODE_PENDING);
 > +            prepare_to_wait(wqh, &wait.wq_entry,
 > +                    TASK_UNINTERRUPTIBLE);
 > +            spin_unlock(&inode->i_lock);
 > +
 > +            ovl_wiw.inode = inode;
 > +            INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
 > +            schedule_work(&ovl_wiw.work);
 > +
 > +            schedule();
 > +            finish_wait(wqh, &wait.wq_entry);
 > +        } else {
 > +            write_inode_now(ovl_inode_upper(inode), 1);
 > +        }
 > +        ovl_detach_upper_inodes_list(inode);
 > +
 > +        /*
 > +         * ovl_sync_inodes() may wait until
 > +         * flag OVL_EVICT_PENDING to be cleared.
 > +         */
 > +        spin_lock(&inode->i_lock);
 > +        ovl_clear_flag(OVL_EVICT_PENDING, inode);
 > +        wake_up_bit(&OVL_I(inode)->flags, OVL_EVICT_PENDING);
 > +        spin_unlock(&inode->i_lock);
 > +    }
 > +    clear_inode(inode);
 > +}
 > +
 > +void ovl_wait_evict_pending(struct inode *inode)
 > +{
 > +    struct ovl_fs *ofs = inode->i_sb->s_fs_info;
 > +    struct ovl_inode *oi = OVL_I(inode);
 > +    DEFINE_WAIT_BIT(wait, &oi->flags, OVL_EVICT_PENDING);
 > +    wait_queue_head_t *wqh;
 > +
 > +    wqh = bit_waitqueue(&oi->flags, OVL_EVICT_PENDING);
 > +    prepare_to_wait(wqh, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
 > +    spin_unlock(&inode->i_lock);
 > +    spin_unlock(&ofs->upper_inodes_lock);
 > +    schedule();
 > +    finish_wait(wqh, &wait.wq_entry);
 > +}
 > +
 > +/**
 > + * ovl_sync_inodes
 > + * @sb: The overlayfs super block
 > + *
 > + * upper_inodes list is used for organizing ovl inode which has upper inode,
 > + * by iterating the list to looking for and syncing dirty upper inode.
 > + *
 > + * When starting syncing inode, we add the inode to wait list explicitly,
 > + * in order to save memory we reuse inode->i_wb_list and sb->s_inodes_wb,
 > + * so those fields have slightly differnt meanings in overlayfs.
 > + */
 > +static void ovl_sync_inodes(struct super_block *sb)
 > +{
 > +    struct ovl_fs *ofs = sb->s_fs_info;
 > +    struct ovl_inode *oi;
 > +    struct inode *inode;
 > +    struct inode *upper_inode;
 > +    struct blk_plug plug;
 > +    LIST_HEAD(sync_tmp_list);
 > +
 > +    struct writeback_control wbc_for_sync = {
 > +        .sync_mode        = WB_SYNC_ALL,
 > +        .for_sync        = 1,
 > +        .range_start        = 0,
 > +        .range_end        = LLONG_MAX,
 > +        .nr_to_write        = LONG_MAX,
 > +    };
 > +
 > +    blk_start_plug(&plug);
 > +    spin_lock(&ofs->upper_inodes_lock);
 > +    list_splice_init(&ofs->upper_inodes, &sync_tmp_list);
 > +
 > +    while (!list_empty(&sync_tmp_list)) {
 > +        oi = list_first_entry(&sync_tmp_list, struct ovl_inode,
 > +                        upper_inodes_list);
 > +        inode = &oi->vfs_inode;
 > +        spin_lock(&inode->i_lock);
 > +
 > +        if (inode->i_state & I_NEW) {
 > +            list_move_tail(&oi->upper_inodes_list,
 > +                    &ofs->upper_inodes);
 > +            spin_unlock(&inode->i_lock);
 > +            continue;
 > +        }
 > +
 > +        /*
 > +         * If ovl_inode flags is OVL_EVICT_PENDING,
 > +         * left sync operation to the ovl_evict_inode(),
 > +         * so wait here until OVL_EVICT_PENDING flag to be cleared.
 > +         */
 > +        if (ovl_test_flag(OVL_EVICT_PENDING, inode)) {
 > +            ovl_wait_evict_pending(inode);
 > +            goto next;
 > +        }
 > +
 > +        list_move_tail(&oi->upper_inodes_list,
 > +                &ofs->upper_inodes);
 > +        ihold(inode);
 > +        upper_inode = ovl_inode_upper(inode);
 > +        spin_unlock(&inode->i_lock);
 > +        spin_unlock(&ofs->upper_inodes_lock);
 > +
 > +        if (!(upper_inode->i_state & I_DIRTY_ALL)) {
 > +            if (!mapping_tagged(upper_inode->i_mapping,
 > +                        PAGECACHE_TAG_WRITEBACK)) {
 > +                iput(inode);
 > +                goto next;
 > +            }
 > +        } else {
 > +            sync_inode(upper_inode, &wbc_for_sync);
 > +        }
 > +
 > +        spin_lock(&sb->s_inode_wblist_lock);
 > +        list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
 > +        spin_unlock(&sb->s_inode_wblist_lock);
 > +
 > +next:
 > +        cond_resched();
 > +        spin_lock(&ofs->upper_inodes_lock);
 > +    }
 > +    spin_unlock(&ofs->upper_inodes_lock);
 > +    blk_finish_plug(&plug);
 > +}
 > +
 > +/**
 > + * ovl_wait_inodes
 > + * @sb: The overlayfs super block
 > + *
 > + * Waiting writeback inodes which are in s_inodes_wb list,
 > + * all the IO that has been issued up to the time this
 > + * function is enter is guaranteed to be completed.
 > + */
 > +static void ovl_wait_inodes(struct super_block *sb)
 > +{
 > +    LIST_HEAD(sync_wait_list);
 > +    struct inode *inode;
 > +    struct inode *upper_inode;
 > +
 > +    /*
 > +     * ovl inode in sb->s_inodes_wb list has already got inode reference,
 > +     * this will avoid silent inode droping during waiting on it.
 > +     *
 > +     * Splice the sync wait list onto a temporary list to avoid waiting on
 > +     * inodes that have started writeback after this point.
 > +     */
 > +    spin_lock(&sb->s_inode_wblist_lock);
 > +    list_splice_init(&sb->s_inodes_wb, &sync_wait_list);
 > +
 > +    while (!list_empty(&sync_wait_list)) {
 > +        inode = list_first_entry(&sync_wait_list, struct inode,
 > +                            i_wb_list);
 > +        list_del_init(&inode->i_wb_list);
 > +        upper_inode = ovl_inode_upper(inode);
 > +        spin_unlock(&sb->s_inode_wblist_lock);
 > +
 > +        if (!mapping_tagged(upper_inode->i_mapping,
 > +                PAGECACHE_TAG_WRITEBACK)) {
 > +            goto next;
 > +        }
 > +
 > +        filemap_fdatawait_keep_errors(upper_inode->i_mapping);
 > +        cond_resched();
 > +
 > +next:
 > +        iput(inode);
 > +        spin_lock(&sb->s_inode_wblist_lock);
 > +    }
 > +    spin_unlock(&sb->s_inode_wblist_lock);
 > +}
 > +
 > +/**
 > + * ovl_sync_filesystem
 > + * @sb: The overlayfs super block
 > + *
 > + * Sync underlying dirty inodes in upper filesystem and
 > + * wait for completion.
 > + */
 > +static int ovl_sync_filesystem(struct super_block *sb)
 > +{
 > +    struct ovl_fs *ofs = sb->s_fs_info;
 > +    struct super_block *upper_sb = ofs->upper_mnt->mnt_sb;
 > +    int ret = 0;
 > +
 > +    down_read(&upper_sb->s_umount);
 > +    if (!sb_rdonly(upper_sb)) {
 > +        /*
 > +         * s_sync_lock is used for serializing concurrent
 > +         * syncfs operations.
 > +         */
 > +        mutex_lock(&sb->s_sync_lock);
 > +        ovl_sync_inodes(sb);
 > +        /* Calling sync_fs with no-wait for better performance. */
 > +        if (upper_sb->s_op->sync_fs)
 > +            upper_sb->s_op->sync_fs(upper_sb, 0);
 > +
 > +        ovl_wait_inodes(sb);
 > +        if (upper_sb->s_op->sync_fs)
 > +            upper_sb->s_op->sync_fs(upper_sb, 1);
 > +
 > +        ret = sync_blockdev(upper_sb->s_bdev);
 > +        mutex_unlock(&sb->s_sync_lock);
 > +    }
 > +    up_read(&upper_sb->s_umount);
 > +    return ret;
 > +}
 > +
 > +/**
 > + * ovl_sync_fs
 > + * @sb: The overlayfs super block
 > + * @wait: Wait for I/O to complete
 > + *
 > + * Sync real dirty inodes in upper filesystem (if it exists)
 > + */
 > +int ovl_sync_fs(struct super_block *sb, int wait)
 > +{
 > +    struct ovl_fs *ofs = sb->s_fs_info;
 > +
 > +    if (!ofs->upper_mnt)
 > +        return 0;
 > +
 > +    /*
 > +     * If this is a sync(2) call or an emergency sync, all the super blocks
 > +     * will be iterated, including upper_sb, so no need to do anything.
 > +     *
 > +     * If this is a syncfs(2) call, then we do need to call
 > +     * sync_filesystem() on upper_sb, but enough if we do it when being
 > +     * called with wait == 1.
 > +     */
 > +    if (!wait)
 > +        return 0;
 > +
 > +    return ovl_sync_filesystem(sb);
 > +}
 > diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
 > index ea005085803f..73ef195d9aef 100644
 > --- a/fs/overlayfs/util.c
 > +++ b/fs/overlayfs/util.c
 > @@ -386,13 +386,33 @@ void ovl_dentry_set_redirect(struct dentry *dentry, const char *redirect)
 >      oi->redirect = redirect;
 >  }
 >  
 > +void ovl_attach_upper_inodes_list(struct inode *inode)
 > +{
 > +    struct ovl_fs *ofs = inode->i_sb->s_fs_info;
 > +
 > +    spin_lock(&ofs->upper_inodes_lock);
 > +    list_add(&OVL_I(inode)->upper_inodes_list, &ofs->upper_inodes);
 > +    spin_unlock(&ofs->upper_inodes_lock);
 > +}
 > +
 > +void ovl_detach_upper_inodes_list(struct inode *inode)
 > +{
 > +    struct ovl_fs *ofs = inode->i_sb->s_fs_info;
 > +
 > +    spin_lock(&ofs->upper_inodes_lock);
 > +    list_del_init(&OVL_I(inode)->upper_inodes_list);
 > +    spin_unlock(&ofs->upper_inodes_lock);
 > +}
 > +
 >  void ovl_inode_init(struct inode *inode, struct dentry *upperdentry,
 >              struct dentry *lowerdentry, struct dentry *lowerdata)
 >  {
 >      struct inode *realinode = d_inode(upperdentry ?: lowerdentry);
 >  
 > -    if (upperdentry)
 > +    if (upperdentry) {
 >          OVL_I(inode)->__upperdentry = upperdentry;
 > +        ovl_attach_upper_inodes_list(inode);
 > +    }
 >      if (lowerdentry)
 >          OVL_I(inode)->lower = igrab(d_inode(lowerdentry));
 >      if (lowerdata)
 > @@ -421,6 +441,7 @@ void ovl_inode_update(struct inode *inode, struct dentry *upperdentry)
 >          inode->i_private = upperinode;
 >          __insert_inode_hash(inode, (unsigned long) upperinode);
 >      }
 > +    ovl_attach_upper_inodes_list(inode);
 >  }
 >  
 >  static void ovl_dentry_version_inc(struct dentry *dentry, bool impurity)
 > -- 
 > 2.21.1
 > 
 > 
 >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* 回复:[PATCH v11] ovl: Improving syncfs efficiency
  2020-03-17  4:41 ` 回复:[PATCH " Chengguang Xu
@ 2020-03-17  4:49   ` Chengguang Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Chengguang Xu @ 2020-03-17  4:49 UTC (permalink / raw)
  To: cgxu519; +Cc: miklos, amir73il, jack, linux-unionfs

 ---- 在 星期二, 2020-03-17 12:41:17 Chengguang Xu <cgxu519@mykernel.net> 撰写 ----
 >  ---- 在 星期一, 2020-02-10 11:10:09 Chengguang Xu <cgxu519@mykernel.net> 撰写 ----
 >  > Current syncfs(2) syscall on overlayfs just calls sync_filesystem()
 >  > on upper_sb to synchronize whole dirty inodes in upper filesystem
 >  > regardless of the overlay ownership of the inode. In the use case of
 >  > container, when multiple containers using the same underlying upper
 >  > filesystem, it has some shortcomings as below.
 >  > 
 >  > (1) Performance
 >  > Synchronization is probably heavy because it actually syncs unnecessary
 >  > inodes for target overlayfs.
 >  > 
 >  > (2) Interference
 >  > Unplanned synchronization will probably impact IO performance of
 >  > unrelated container processes on the other overlayfs.
 >  > 
 >  > This patch iterates upper inode list in overlayfs to only sync target
 >  > dirty inodes and wait for completion. By doing this, It is able to
 >  > reduce cost of synchronization and will not seriously impact IO performance
 >  > of unrelated processes. In special case, when having very few dirty inodes
 >  > and a lot of clean upper inodes in overlayfs, then iteration may waste
 >  > time so that synchronization is slower than before, but we think the
 >  > potential for performance improvement far surpass the potential for
 >  > performance regression in most cases.
 >  > 
 >  > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
 >  > ---
 > 
 > Hi Miklos,
 > 
 > Have you get time to review this patch?
 > I tested xfstest overlay part (and generic/427 which is testcase for syncfs) and heavy untar + syncfs test and I found no problem.
 > 

Additionally, Jack and Amir have already reviewd until V10 version.

Thanks,
cgxu

 >  > Changes since v10:
 >  > - Add special handling in ovl_evict_inode() for inode eviction trigered
 >  > by kswapd memory reclamation(with PF_MEMALLOC flag).
 >  > - Rebase on current latest overlayfs-next tree.
 >  > - Slightly modify license information.
 >  > 
 >  > Changes since v9:
 >  > - Rebase on current latest overlayfs-next tree.
 >  > - Calling clear_inode() regardless of having upper inode.
 >  > 
 >  > Changes since v8:
 >  > - Remove unnecessary blk_start_plug() call in ovl_sync_inodes().
 >  > 
 >  > Changes since v7:
 >  > - Check OVL_EVICT_PENDING flag instead of I_FREEING to recognize
 >  > inode which is in eviction process.
 >  > - Do not move ovl_inode back to ofs->upper_inodes when inode is in
 >  > eviction process.
 >  > - Delete unnecessary memory barrier in ovl_evict_inode().
 >  > 
 >  > Changes since v6:
 >  > - Take/release sb->s_sync_lock bofore/after sync_fs to serialize
 >  > concurrent syncfs.
 >  > - Change list iterating method to improve code readability.
 >  > - Fix typo in commit log and comments.
 >  > - Add header comment to sync.c.
 >  > 
 >  > Changes since v5:
 >  > - Move sync related functions to new file sync.c.
 >  > - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
 >  > - If ovl inode flag is I_WILL_FREE or I_FREEING then syncfs
 >  > will wait until OVL_EVICT_PENDING to be cleared.
 >  > - Move inode sync operation into evict_inode from drop_inode.
 >  > - Call upper_sb->sync_fs with no-wait after sync dirty upper
 >  > inodes.
 >  > - Hold s_inode_wblist_lock until deleting item from list.
 >  > - Remove write_inode fuction because no more used.
 >  > 
 >  > Changes since v4:
 >  > - Add syncing operation and deleting from upper_inodes list
 >  > during ovl inode destruction.
 >  > - Export symbol of inode_wait_for_writeback() for waiting
 >  > writeback on ovl inode.
 >  > - Reuse I_SYNC flag to avoid race conditions between syncfs
 >  > and drop_inode.
 >  > 
 >  > Changes since v3:
 >  > - Introduce upper_inode list to organize ovl inode which has upper
 >  >   inode.
 >  > - Avoid iput operation inside spin lock.
 >  > - Change list iteration method for safety.
 >  > - Reuse inode->i_wb_list and sb->s_inodes_wb to save memory.
 >  > - Modify commit log and add more comments to the code.
 >  > 
 >  > Changes since v2:
 >  > - Decoupling with syncfs of upper fs by taking sb->s_sync_lock of
 >  > overlayfs.
 >  > - Factoring out sync operation to different helpers for easy
 >  >   understanding.
 >  > 
 >  > Changes since v1:
 >  > - If upper filesystem is readonly mode then skip synchronization.
 >  > - Introduce global wait list to replace temporary wait list for
 >  > concurrent synchronization.
 >  > 
 >  >  fs/overlayfs/Makefile    |   2 +-
 >  >  fs/overlayfs/overlayfs.h |   9 ++
 >  >  fs/overlayfs/ovl_entry.h |   5 +
 >  >  fs/overlayfs/super.c     |  44 ++----
 >  >  fs/overlayfs/sync.c      | 331 +++++++++++++++++++++++++++++++++++++++
 >  >  fs/overlayfs/util.c      |  23 ++-
 >  >  6 files changed, 381 insertions(+), 33 deletions(-)
 >  >  create mode 100644 fs/overlayfs/sync.c
 >  > 
 >  > diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
 >  > index 9164c585eb2f..0c4e53fcf4ef 100644
 >  > --- a/fs/overlayfs/Makefile
 >  > +++ b/fs/overlayfs/Makefile
 >  > @@ -6,4 +6,4 @@
 >  >  obj-$(CONFIG_OVERLAY_FS) += overlay.o
 >  >  
 >  >  overlay-objs := super.o namei.o util.o inode.o file.o dir.o readdir.o \
 >  > -        copy_up.o export.o
 >  > +        copy_up.o export.o sync.o
 >  > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
 >  > index 3623d28aa4fa..3cdb06e206a3 100644
 >  > --- a/fs/overlayfs/overlayfs.h
 >  > +++ b/fs/overlayfs/overlayfs.h
 >  > @@ -40,6 +40,10 @@ enum ovl_inode_flag {
 >  >      OVL_UPPERDATA,
 >  >      /* Inode number will remain constant over copy up. */
 >  >      OVL_CONST_INO,
 >  > +    /* Set when ovl inode is about to be freed */
 >  > +    OVL_EVICT_PENDING,
 >  > +    /* Set when sync upper inode in workqueue work */
 >  > +    OVL_WRITE_INODE_PENDING,
 >  >  };
 >  >  
 >  >  enum ovl_entry_flag {
 >  > @@ -295,6 +299,7 @@ bool ovl_is_metacopy_dentry(struct dentry *dentry);
 >  >  char *ovl_get_redirect_xattr(struct dentry *dentry, int padding);
 >  >  ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value,
 >  >               size_t padding);
 >  > +void ovl_detach_upper_inodes_list(struct inode *inode);
 >  >  
 >  >  static inline bool ovl_is_impuredir(struct dentry *dentry)
 >  >  {
 >  > @@ -466,3 +471,7 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
 >  >  
 >  >  /* export.c */
 >  >  extern const struct export_operations ovl_export_operations;
 >  > +
 >  > +/* sync.c */
 >  > +void ovl_evict_inode(struct inode *inode);
 >  > +int ovl_sync_fs(struct super_block *sb, int wait);
 >  > diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
 >  > index 89015ea822e7..42e092e8cd61 100644
 >  > --- a/fs/overlayfs/ovl_entry.h
 >  > +++ b/fs/overlayfs/ovl_entry.h
 >  > @@ -75,6 +75,9 @@ struct ovl_fs {
 >  >      struct inode *indexdir_trap;
 >  >      /* -1: disabled, 0: same fs, 1..32: number of unused ino bits */
 >  >      int xino_mode;
 >  > +    /* Upper inode list and lock */
 >  > +    spinlock_t upper_inodes_lock;
 >  > +    struct list_head upper_inodes;
 >  >  };
 >  >  
 >  >  static inline struct ovl_fs *OVL_FS(struct super_block *sb)
 >  > @@ -115,6 +118,8 @@ struct ovl_inode {
 >  >  
 >  >      /* synchronize copy up and more */
 >  >      struct mutex lock;
 >  > +    /* Upper inodes list */
 >  > +    struct list_head upper_inodes_list;
 >  >  };
 >  >  
 >  >  static inline struct ovl_inode *OVL_I(struct inode *inode)
 >  > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
 >  > index 319fe0d355b0..3e6871e1a7ba 100644
 >  > --- a/fs/overlayfs/super.c
 >  > +++ b/fs/overlayfs/super.c
 >  > @@ -183,6 +183,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb)
 >  >      oi->lower = NULL;
 >  >      oi->lowerdata = NULL;
 >  >      mutex_init(&oi->lock);
 >  > +    INIT_LIST_HEAD(&oi->upper_inodes_list);
 >  >  
 >  >      return &oi->vfs_inode;
 >  >  }
 >  > @@ -249,36 +250,6 @@ static void ovl_put_super(struct super_block *sb)
 >  >      ovl_free_fs(ofs);
 >  >  }
 >  >  
 >  > -/* Sync real dirty inodes in upper filesystem (if it exists) */
 >  > -static int ovl_sync_fs(struct super_block *sb, int wait)
 >  > -{
 >  > -    struct ovl_fs *ofs = sb->s_fs_info;
 >  > -    struct super_block *upper_sb;
 >  > -    int ret;
 >  > -
 >  > -    if (!ofs->upper_mnt)
 >  > -        return 0;
 >  > -
 >  > -    /*
 >  > -     * If this is a sync(2) call or an emergency sync, all the super blocks
 >  > -     * will be iterated, including upper_sb, so no need to do anything.
 >  > -     *
 >  > -     * If this is a syncfs(2) call, then we do need to call
 >  > -     * sync_filesystem() on upper_sb, but enough if we do it when being
 >  > -     * called with wait == 1.
 >  > -     */
 >  > -    if (!wait)
 >  > -        return 0;
 >  > -
 >  > -    upper_sb = ofs->upper_mnt->mnt_sb;
 >  > -
 >  > -    down_read(&upper_sb->s_umount);
 >  > -    ret = sync_filesystem(upper_sb);
 >  > -    up_read(&upper_sb->s_umount);
 >  > -
 >  > -    return ret;
 >  > -}
 >  > -
 >  >  /**
 >  >   * ovl_statfs
 >  >   * @sb: The overlayfs super block
 >  > @@ -376,11 +347,19 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
 >  >      return 0;
 >  >  }
 >  >  
 >  > +static int ovl_drop_inode(struct inode *inode)
 >  > +{
 >  > +    if (ovl_inode_upper(inode))
 >  > +        ovl_set_flag(OVL_EVICT_PENDING, inode);
 >  > +    return 1;
 >  > +}
 >  > +
 >  >  static const struct super_operations ovl_super_operations = {
 >  >      .alloc_inode    = ovl_alloc_inode,
 >  >      .free_inode    = ovl_free_inode,
 >  >      .destroy_inode    = ovl_destroy_inode,
 >  > -    .drop_inode    = generic_delete_inode,
 >  > +    .drop_inode    = ovl_drop_inode,
 >  > +    .evict_inode    = ovl_evict_inode,
 >  >      .put_super    = ovl_put_super,
 >  >      .sync_fs    = ovl_sync_fs,
 >  >      .statfs        = ovl_statfs,
 >  > @@ -1601,6 +1580,9 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 >  >      if (!ofs)
 >  >          goto out;
 >  >  
 >  > +    spin_lock_init(&ofs->upper_inodes_lock);
 >  > +    INIT_LIST_HEAD(&ofs->upper_inodes);
 >  > +
 >  >      ofs->creator_cred = cred = prepare_creds();
 >  >      if (!cred)
 >  >          goto out_err;
 >  > diff --git a/fs/overlayfs/sync.c b/fs/overlayfs/sync.c
 >  > new file mode 100644
 >  > index 000000000000..aecd312ec851
 >  > --- /dev/null
 >  > +++ b/fs/overlayfs/sync.c
 >  > @@ -0,0 +1,331 @@
 >  > +// SPDX-License-Identifier: GPL-2.0-only
 >  > +/*
 >  > + * Copyright (C) 2020 All Rights Reserved.
 >  > + * Author Chengguang Xu <cgxu519@mykernel.net>
 >  > + */
 >  > +
 >  > +#include <linux/fs.h>
 >  > +#include <linux/xattr.h>
 >  > +#include <linux/mount.h>
 >  > +#include <linux/writeback.h>
 >  > +#include <linux/blkdev.h>
 >  > +#include "overlayfs.h"
 >  > +
 >  > +/**
 >  > + * upper_inodes list is used for organizing potential target of syncfs,
 >  > + * ovl inode which has upper inode will be added to this list while
 >  > + * initializing or updating and will be deleted from this list while
 >  > + * evicting.
 >  > + *
 >  > + * Introduce ovl_inode flag "OVL_EVICT_PENDING" to indicate the ovl inode
 >  > + * is in eviction process, syncfs(actually ovl_sync_inodes()) will wait on
 >  > + * evicting inode until the IO to be completed in evict_inode().
 >  > + *
 >  > + * inode state/ovl_inode flags cycle in syncfs context will be as below.
 >  > + * OVL_EVICT_PENDING (ovl_inode->flags) is only marked when inode having
 >  > + * upper inode.
 >  > + *
 >  > + * (allocation)
 >  > + *   |
 >  > + * I_NEW (inode->i_state)
 >  > + *   |
 >  > + * NONE  (inode->i_state)
 >  > + *   |
 >  > + * OVL_EVICT_PENDING (ovl_inode->flags)
 >  > + *   |
 >  > + * I_FREEING (inode->i_state) | OVL_EVICT_PENDING (ovl_inode->flags)
 >  > + *   |
 >  > + * I_FREEING (inode->i_state) | I_CLEAR (inode->i_state)
 >  > + *   |
 >  > + * (destruction)
 >  > + *
 >  > + *
 >  > + * There are five locks in in syncfs contexts:
 >  > + *
 >  > + * upper_sb->s_umount(semaphore)    : avoiding r/o to r/w or vice versa
 >  > + * sb->s_sync_lock(mutex)           : avoiding concorrent syncfs running
 >  > + * ofs->upper_inodes_lock(spinlock) : protecting upper_inodes list
 >  > + * sb->s_inode_wblist_lock(spinlock): protecting s_inodes_wb(sync waiting) list
 >  > + * inode->i_lock(spinlock)          : protecting inode fields
 >  > + *
 >  > + * Lock dependencies in syncfs contexts:
 >  > + *
 >  > + * upper_sb->s_umount
 >  > + *    sb->s_sync_lock
 >  > + *        ofs->upper_inodes_lock
 >  > + *            inode->i_lock
 >  > + *
 >  > + * upper_sb->s_umount
 >  > + *    sb->s_sync_lock
 >  > + *        sb->s_inode_wblist_lock
 >  > + *
 >  > + */
 >  > +
 >  > +struct ovl_write_inode_work {
 >  > +    struct work_struct work;
 >  > +    struct inode *inode;
 >  > +};
 >  > +
 >  > +static void ovl_write_inode_work_fn(struct work_struct *work)
 >  > +{
 >  > +    struct ovl_write_inode_work *ovl_wiw;
 >  > +    struct inode *inode;
 >  > +
 >  > +    ovl_wiw = container_of(work, struct ovl_write_inode_work, work);
 >  > +    inode = ovl_wiw->inode;
 >  > +    write_inode_now(ovl_inode_upper(inode), 1);
 >  > +
 >  > +    spin_lock(&inode->i_lock);
 >  > +    ovl_clear_flag(OVL_WRITE_INODE_PENDING, inode);
 >  > +    wake_up_bit(&OVL_I(inode)->flags, OVL_WRITE_INODE_PENDING);
 >  > +    spin_unlock(&inode->i_lock);
 >  > +}
 >  > +
 >  > +void ovl_evict_inode(struct inode *inode)
 >  > +{
 >  > +    struct ovl_inode *oi = OVL_I(inode);
 >  > +    struct ovl_write_inode_work ovl_wiw;
 >  > +    DEFINE_WAIT_BIT(wait, &oi->flags, OVL_WRITE_INODE_PENDING);
 >  > +    wait_queue_head_t *wqh;
 >  > +
 >  > +    if (ovl_inode_upper(inode)) {
 >  > +        if (current->flags & PF_MEMALLOC) {
 >  > +            spin_lock(&inode->i_lock);
 >  > +            ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
 >  > +            wqh = bit_waitqueue(&oi->flags,
 >  > +                    OVL_WRITE_INODE_PENDING);
 >  > +            prepare_to_wait(wqh, &wait.wq_entry,
 >  > +                    TASK_UNINTERRUPTIBLE);
 >  > +            spin_unlock(&inode->i_lock);
 >  > +
 >  > +            ovl_wiw.inode = inode;
 >  > +            INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
 >  > +            schedule_work(&ovl_wiw.work);
 >  > +
 >  > +            schedule();
 >  > +            finish_wait(wqh, &wait.wq_entry);
 >  > +        } else {
 >  > +            write_inode_now(ovl_inode_upper(inode), 1);
 >  > +        }
 >  > +        ovl_detach_upper_inodes_list(inode);
 >  > +
 >  > +        /*
 >  > +         * ovl_sync_inodes() may wait until
 >  > +         * flag OVL_EVICT_PENDING to be cleared.
 >  > +         */
 >  > +        spin_lock(&inode->i_lock);
 >  > +        ovl_clear_flag(OVL_EVICT_PENDING, inode);
 >  > +        wake_up_bit(&OVL_I(inode)->flags, OVL_EVICT_PENDING);
 >  > +        spin_unlock(&inode->i_lock);
 >  > +    }
 >  > +    clear_inode(inode);
 >  > +}
 >  > +
 >  > +void ovl_wait_evict_pending(struct inode *inode)
 >  > +{
 >  > +    struct ovl_fs *ofs = inode->i_sb->s_fs_info;
 >  > +    struct ovl_inode *oi = OVL_I(inode);
 >  > +    DEFINE_WAIT_BIT(wait, &oi->flags, OVL_EVICT_PENDING);
 >  > +    wait_queue_head_t *wqh;
 >  > +
 >  > +    wqh = bit_waitqueue(&oi->flags, OVL_EVICT_PENDING);
 >  > +    prepare_to_wait(wqh, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
 >  > +    spin_unlock(&inode->i_lock);
 >  > +    spin_unlock(&ofs->upper_inodes_lock);
 >  > +    schedule();
 >  > +    finish_wait(wqh, &wait.wq_entry);
 >  > +}
 >  > +
 >  > +/**
 >  > + * ovl_sync_inodes
 >  > + * @sb: The overlayfs super block
 >  > + *
 >  > + * upper_inodes list is used for organizing ovl inode which has upper inode,
 >  > + * by iterating the list to looking for and syncing dirty upper inode.
 >  > + *
 >  > + * When starting syncing inode, we add the inode to wait list explicitly,
 >  > + * in order to save memory we reuse inode->i_wb_list and sb->s_inodes_wb,
 >  > + * so those fields have slightly differnt meanings in overlayfs.
 >  > + */
 >  > +static void ovl_sync_inodes(struct super_block *sb)
 >  > +{
 >  > +    struct ovl_fs *ofs = sb->s_fs_info;
 >  > +    struct ovl_inode *oi;
 >  > +    struct inode *inode;
 >  > +    struct inode *upper_inode;
 >  > +    struct blk_plug plug;
 >  > +    LIST_HEAD(sync_tmp_list);
 >  > +
 >  > +    struct writeback_control wbc_for_sync = {
 >  > +        .sync_mode        = WB_SYNC_ALL,
 >  > +        .for_sync        = 1,
 >  > +        .range_start        = 0,
 >  > +        .range_end        = LLONG_MAX,
 >  > +        .nr_to_write        = LONG_MAX,
 >  > +    };
 >  > +
 >  > +    blk_start_plug(&plug);
 >  > +    spin_lock(&ofs->upper_inodes_lock);
 >  > +    list_splice_init(&ofs->upper_inodes, &sync_tmp_list);
 >  > +
 >  > +    while (!list_empty(&sync_tmp_list)) {
 >  > +        oi = list_first_entry(&sync_tmp_list, struct ovl_inode,
 >  > +                        upper_inodes_list);
 >  > +        inode = &oi->vfs_inode;
 >  > +        spin_lock(&inode->i_lock);
 >  > +
 >  > +        if (inode->i_state & I_NEW) {
 >  > +            list_move_tail(&oi->upper_inodes_list,
 >  > +                    &ofs->upper_inodes);
 >  > +            spin_unlock(&inode->i_lock);
 >  > +            continue;
 >  > +        }
 >  > +
 >  > +        /*
 >  > +         * If ovl_inode flags is OVL_EVICT_PENDING,
 >  > +         * left sync operation to the ovl_evict_inode(),
 >  > +         * so wait here until OVL_EVICT_PENDING flag to be cleared.
 >  > +         */
 >  > +        if (ovl_test_flag(OVL_EVICT_PENDING, inode)) {
 >  > +            ovl_wait_evict_pending(inode);
 >  > +            goto next;
 >  > +        }
 >  > +
 >  > +        list_move_tail(&oi->upper_inodes_list,
 >  > +                &ofs->upper_inodes);
 >  > +        ihold(inode);
 >  > +        upper_inode = ovl_inode_upper(inode);
 >  > +        spin_unlock(&inode->i_lock);
 >  > +        spin_unlock(&ofs->upper_inodes_lock);
 >  > +
 >  > +        if (!(upper_inode->i_state & I_DIRTY_ALL)) {
 >  > +            if (!mapping_tagged(upper_inode->i_mapping,
 >  > +                        PAGECACHE_TAG_WRITEBACK)) {
 >  > +                iput(inode);
 >  > +                goto next;
 >  > +            }
 >  > +        } else {
 >  > +            sync_inode(upper_inode, &wbc_for_sync);
 >  > +        }
 >  > +
 >  > +        spin_lock(&sb->s_inode_wblist_lock);
 >  > +        list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
 >  > +        spin_unlock(&sb->s_inode_wblist_lock);
 >  > +
 >  > +next:
 >  > +        cond_resched();
 >  > +        spin_lock(&ofs->upper_inodes_lock);
 >  > +    }
 >  > +    spin_unlock(&ofs->upper_inodes_lock);
 >  > +    blk_finish_plug(&plug);
 >  > +}
 >  > +
 >  > +/**
 >  > + * ovl_wait_inodes
 >  > + * @sb: The overlayfs super block
 >  > + *
 >  > + * Waiting writeback inodes which are in s_inodes_wb list,
 >  > + * all the IO that has been issued up to the time this
 >  > + * function is enter is guaranteed to be completed.
 >  > + */
 >  > +static void ovl_wait_inodes(struct super_block *sb)
 >  > +{
 >  > +    LIST_HEAD(sync_wait_list);
 >  > +    struct inode *inode;
 >  > +    struct inode *upper_inode;
 >  > +
 >  > +    /*
 >  > +     * ovl inode in sb->s_inodes_wb list has already got inode reference,
 >  > +     * this will avoid silent inode droping during waiting on it.
 >  > +     *
 >  > +     * Splice the sync wait list onto a temporary list to avoid waiting on
 >  > +     * inodes that have started writeback after this point.
 >  > +     */
 >  > +    spin_lock(&sb->s_inode_wblist_lock);
 >  > +    list_splice_init(&sb->s_inodes_wb, &sync_wait_list);
 >  > +
 >  > +    while (!list_empty(&sync_wait_list)) {
 >  > +        inode = list_first_entry(&sync_wait_list, struct inode,
 >  > +                            i_wb_list);
 >  > +        list_del_init(&inode->i_wb_list);
 >  > +        upper_inode = ovl_inode_upper(inode);
 >  > +        spin_unlock(&sb->s_inode_wblist_lock);
 >  > +
 >  > +        if (!mapping_tagged(upper_inode->i_mapping,
 >  > +                PAGECACHE_TAG_WRITEBACK)) {
 >  > +            goto next;
 >  > +        }
 >  > +
 >  > +        filemap_fdatawait_keep_errors(upper_inode->i_mapping);
 >  > +        cond_resched();
 >  > +
 >  > +next:
 >  > +        iput(inode);
 >  > +        spin_lock(&sb->s_inode_wblist_lock);
 >  > +    }
 >  > +    spin_unlock(&sb->s_inode_wblist_lock);
 >  > +}
 >  > +
 >  > +/**
 >  > + * ovl_sync_filesystem
 >  > + * @sb: The overlayfs super block
 >  > + *
 >  > + * Sync underlying dirty inodes in upper filesystem and
 >  > + * wait for completion.
 >  > + */
 >  > +static int ovl_sync_filesystem(struct super_block *sb)
 >  > +{
 >  > +    struct ovl_fs *ofs = sb->s_fs_info;
 >  > +    struct super_block *upper_sb = ofs->upper_mnt->mnt_sb;
 >  > +    int ret = 0;
 >  > +
 >  > +    down_read(&upper_sb->s_umount);
 >  > +    if (!sb_rdonly(upper_sb)) {
 >  > +        /*
 >  > +         * s_sync_lock is used for serializing concurrent
 >  > +         * syncfs operations.
 >  > +         */
 >  > +        mutex_lock(&sb->s_sync_lock);
 >  > +        ovl_sync_inodes(sb);
 >  > +        /* Calling sync_fs with no-wait for better performance. */
 >  > +        if (upper_sb->s_op->sync_fs)
 >  > +            upper_sb->s_op->sync_fs(upper_sb, 0);
 >  > +
 >  > +        ovl_wait_inodes(sb);
 >  > +        if (upper_sb->s_op->sync_fs)
 >  > +            upper_sb->s_op->sync_fs(upper_sb, 1);
 >  > +
 >  > +        ret = sync_blockdev(upper_sb->s_bdev);
 >  > +        mutex_unlock(&sb->s_sync_lock);
 >  > +    }
 >  > +    up_read(&upper_sb->s_umount);
 >  > +    return ret;
 >  > +}
 >  > +
 >  > +/**
 >  > + * ovl_sync_fs
 >  > + * @sb: The overlayfs super block
 >  > + * @wait: Wait for I/O to complete
 >  > + *
 >  > + * Sync real dirty inodes in upper filesystem (if it exists)
 >  > + */
 >  > +int ovl_sync_fs(struct super_block *sb, int wait)
 >  > +{
 >  > +    struct ovl_fs *ofs = sb->s_fs_info;
 >  > +
 >  > +    if (!ofs->upper_mnt)
 >  > +        return 0;
 >  > +
 >  > +    /*
 >  > +     * If this is a sync(2) call or an emergency sync, all the super blocks
 >  > +     * will be iterated, including upper_sb, so no need to do anything.
 >  > +     *
 >  > +     * If this is a syncfs(2) call, then we do need to call
 >  > +     * sync_filesystem() on upper_sb, but enough if we do it when being
 >  > +     * called with wait == 1.
 >  > +     */
 >  > +    if (!wait)
 >  > +        return 0;
 >  > +
 >  > +    return ovl_sync_filesystem(sb);
 >  > +}
 >  > diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
 >  > index ea005085803f..73ef195d9aef 100644
 >  > --- a/fs/overlayfs/util.c
 >  > +++ b/fs/overlayfs/util.c
 >  > @@ -386,13 +386,33 @@ void ovl_dentry_set_redirect(struct dentry *dentry, const char *redirect)
 >  >      oi->redirect = redirect;
 >  >  }
 >  >  
 >  > +void ovl_attach_upper_inodes_list(struct inode *inode)
 >  > +{
 >  > +    struct ovl_fs *ofs = inode->i_sb->s_fs_info;
 >  > +
 >  > +    spin_lock(&ofs->upper_inodes_lock);
 >  > +    list_add(&OVL_I(inode)->upper_inodes_list, &ofs->upper_inodes);
 >  > +    spin_unlock(&ofs->upper_inodes_lock);
 >  > +}
 >  > +
 >  > +void ovl_detach_upper_inodes_list(struct inode *inode)
 >  > +{
 >  > +    struct ovl_fs *ofs = inode->i_sb->s_fs_info;
 >  > +
 >  > +    spin_lock(&ofs->upper_inodes_lock);
 >  > +    list_del_init(&OVL_I(inode)->upper_inodes_list);
 >  > +    spin_unlock(&ofs->upper_inodes_lock);
 >  > +}
 >  > +
 >  >  void ovl_inode_init(struct inode *inode, struct dentry *upperdentry,
 >  >              struct dentry *lowerdentry, struct dentry *lowerdata)
 >  >  {
 >  >      struct inode *realinode = d_inode(upperdentry ?: lowerdentry);
 >  >  
 >  > -    if (upperdentry)
 >  > +    if (upperdentry) {
 >  >          OVL_I(inode)->__upperdentry = upperdentry;
 >  > +        ovl_attach_upper_inodes_list(inode);
 >  > +    }
 >  >      if (lowerdentry)
 >  >          OVL_I(inode)->lower = igrab(d_inode(lowerdentry));
 >  >      if (lowerdata)
 >  > @@ -421,6 +441,7 @@ void ovl_inode_update(struct inode *inode, struct dentry *upperdentry)
 >  >          inode->i_private = upperinode;
 >  >          __insert_inode_hash(inode, (unsigned long) upperinode);
 >  >      }
 >  > +    ovl_attach_upper_inodes_list(inode);
 >  >  }
 >  >  
 >  >  static void ovl_dentry_version_inc(struct dentry *dentry, bool impurity)
 >  > -- 
 >  > 2.21.1
 >  > 
 >  > 
 >  >
 >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v11] ovl: Improving syncfs efficiency
  2020-02-10  3:10 [PATCH v11] ovl: Improving syncfs efficiency Chengguang Xu
  2020-03-17  4:41 ` 回复:[PATCH " Chengguang Xu
@ 2020-04-15 19:19 ` Miklos Szeredi
  2020-04-16  6:08   ` Chengguang Xu
  1 sibling, 1 reply; 14+ messages in thread
From: Miklos Szeredi @ 2020-04-15 19:19 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: Amir Goldstein, Jan Kara, overlayfs

On Mon, Feb 10, 2020 at 4:10 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
>
> Current syncfs(2) syscall on overlayfs just calls sync_filesystem()
> on upper_sb to synchronize whole dirty inodes in upper filesystem
> regardless of the overlay ownership of the inode. In the use case of
> container, when multiple containers using the same underlying upper
> filesystem, it has some shortcomings as below.
>
> (1) Performance
> Synchronization is probably heavy because it actually syncs unnecessary
> inodes for target overlayfs.
>
> (2) Interference
> Unplanned synchronization will probably impact IO performance of
> unrelated container processes on the other overlayfs.
>
> This patch iterates upper inode list in overlayfs to only sync target
> dirty inodes and wait for completion. By doing this, It is able to
> reduce cost of synchronization and will not seriously impact IO performance
> of unrelated processes. In special case, when having very few dirty inodes
> and a lot of clean upper inodes in overlayfs, then iteration may waste
> time so that synchronization is slower than before, but we think the
> potential for performance improvement far surpass the potential for
> performance regression in most cases.

A possible optimization would be to only add inodes to the list that
may have been modified (e.g. opened for write, truncated, etc..).
This might not be a big win due to the fact that upper inodes are
likely those that have been recently copied up.  But as I see it, it
does not not have any drawbacks compared to adding the inode to the
list on lookup.




>
> Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
> ---
> Changes since v10:
> - Add special handling in ovl_evict_inode() for inode eviction trigered
> by kswapd memory reclamation(with PF_MEMALLOC flag).
> - Rebase on current latest overlayfs-next tree.
> - Slightly modify license information.
>
> Changes since v9:
> - Rebase on current latest overlayfs-next tree.
> - Calling clear_inode() regardless of having upper inode.
>
> Changes since v8:
> - Remove unnecessary blk_start_plug() call in ovl_sync_inodes().
>
> Changes since v7:
> - Check OVL_EVICT_PENDING flag instead of I_FREEING to recognize
> inode which is in eviction process.
> - Do not move ovl_inode back to ofs->upper_inodes when inode is in
> eviction process.
> - Delete unnecessary memory barrier in ovl_evict_inode().
>
> Changes since v6:
> - Take/release sb->s_sync_lock bofore/after sync_fs to serialize
> concurrent syncfs.
> - Change list iterating method to improve code readability.
> - Fix typo in commit log and comments.
> - Add header comment to sync.c.
>
> Changes since v5:
> - Move sync related functions to new file sync.c.
> - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
> - If ovl inode flag is I_WILL_FREE or I_FREEING then syncfs
> will wait until OVL_EVICT_PENDING to be cleared.
> - Move inode sync operation into evict_inode from drop_inode.
> - Call upper_sb->sync_fs with no-wait after sync dirty upper
> inodes.
> - Hold s_inode_wblist_lock until deleting item from list.
> - Remove write_inode fuction because no more used.
>
> Changes since v4:
> - Add syncing operation and deleting from upper_inodes list
> during ovl inode destruction.
> - Export symbol of inode_wait_for_writeback() for waiting
> writeback on ovl inode.
> - Reuse I_SYNC flag to avoid race conditions between syncfs
> and drop_inode.
>
> Changes since v3:
> - Introduce upper_inode list to organize ovl inode which has upper
>   inode.
> - Avoid iput operation inside spin lock.
> - Change list iteration method for safety.
> - Reuse inode->i_wb_list and sb->s_inodes_wb to save memory.
> - Modify commit log and add more comments to the code.
>
> Changes since v2:
> - Decoupling with syncfs of upper fs by taking sb->s_sync_lock of
> overlayfs.
> - Factoring out sync operation to different helpers for easy
>   understanding.
>
> Changes since v1:
> - If upper filesystem is readonly mode then skip synchronization.
> - Introduce global wait list to replace temporary wait list for
> concurrent synchronization.
>
>  fs/overlayfs/Makefile    |   2 +-
>  fs/overlayfs/overlayfs.h |   9 ++
>  fs/overlayfs/ovl_entry.h |   5 +
>  fs/overlayfs/super.c     |  44 ++----
>  fs/overlayfs/sync.c      | 331 +++++++++++++++++++++++++++++++++++++++
>  fs/overlayfs/util.c      |  23 ++-
>  6 files changed, 381 insertions(+), 33 deletions(-)
>  create mode 100644 fs/overlayfs/sync.c
>
> diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
> index 9164c585eb2f..0c4e53fcf4ef 100644
> --- a/fs/overlayfs/Makefile
> +++ b/fs/overlayfs/Makefile
> @@ -6,4 +6,4 @@
>  obj-$(CONFIG_OVERLAY_FS) += overlay.o
>
>  overlay-objs := super.o namei.o util.o inode.o file.o dir.o readdir.o \
> -               copy_up.o export.o
> +               copy_up.o export.o sync.o
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index 3623d28aa4fa..3cdb06e206a3 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -40,6 +40,10 @@ enum ovl_inode_flag {
>         OVL_UPPERDATA,
>         /* Inode number will remain constant over copy up. */
>         OVL_CONST_INO,
> +       /* Set when ovl inode is about to be freed */
> +       OVL_EVICT_PENDING,
> +       /* Set when sync upper inode in workqueue work */
> +       OVL_WRITE_INODE_PENDING,
>  };
>
>  enum ovl_entry_flag {
> @@ -295,6 +299,7 @@ bool ovl_is_metacopy_dentry(struct dentry *dentry);
>  char *ovl_get_redirect_xattr(struct dentry *dentry, int padding);
>  ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value,
>                      size_t padding);
> +void ovl_detach_upper_inodes_list(struct inode *inode);
>
>  static inline bool ovl_is_impuredir(struct dentry *dentry)
>  {
> @@ -466,3 +471,7 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
>
>  /* export.c */
>  extern const struct export_operations ovl_export_operations;
> +
> +/* sync.c */
> +void ovl_evict_inode(struct inode *inode);
> +int ovl_sync_fs(struct super_block *sb, int wait);
> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> index 89015ea822e7..42e092e8cd61 100644
> --- a/fs/overlayfs/ovl_entry.h
> +++ b/fs/overlayfs/ovl_entry.h
> @@ -75,6 +75,9 @@ struct ovl_fs {
>         struct inode *indexdir_trap;
>         /* -1: disabled, 0: same fs, 1..32: number of unused ino bits */
>         int xino_mode;
> +       /* Upper inode list and lock */
> +       spinlock_t upper_inodes_lock;
> +       struct list_head upper_inodes;
>  };
>
>  static inline struct ovl_fs *OVL_FS(struct super_block *sb)
> @@ -115,6 +118,8 @@ struct ovl_inode {
>
>         /* synchronize copy up and more */
>         struct mutex lock;
> +       /* Upper inodes list */
> +       struct list_head upper_inodes_list;
>  };
>
>  static inline struct ovl_inode *OVL_I(struct inode *inode)
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 319fe0d355b0..3e6871e1a7ba 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -183,6 +183,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb)
>         oi->lower = NULL;
>         oi->lowerdata = NULL;
>         mutex_init(&oi->lock);
> +       INIT_LIST_HEAD(&oi->upper_inodes_list);
>
>         return &oi->vfs_inode;
>  }
> @@ -249,36 +250,6 @@ static void ovl_put_super(struct super_block *sb)
>         ovl_free_fs(ofs);
>  }
>
> -/* Sync real dirty inodes in upper filesystem (if it exists) */
> -static int ovl_sync_fs(struct super_block *sb, int wait)
> -{
> -       struct ovl_fs *ofs = sb->s_fs_info;
> -       struct super_block *upper_sb;
> -       int ret;
> -
> -       if (!ofs->upper_mnt)
> -               return 0;
> -
> -       /*
> -        * If this is a sync(2) call or an emergency sync, all the super blocks
> -        * will be iterated, including upper_sb, so no need to do anything.
> -        *
> -        * If this is a syncfs(2) call, then we do need to call
> -        * sync_filesystem() on upper_sb, but enough if we do it when being
> -        * called with wait == 1.
> -        */
> -       if (!wait)
> -               return 0;
> -
> -       upper_sb = ofs->upper_mnt->mnt_sb;
> -
> -       down_read(&upper_sb->s_umount);
> -       ret = sync_filesystem(upper_sb);
> -       up_read(&upper_sb->s_umount);
> -
> -       return ret;
> -}
> -
>  /**
>   * ovl_statfs
>   * @sb: The overlayfs super block
> @@ -376,11 +347,19 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
>         return 0;
>  }
>
> +static int ovl_drop_inode(struct inode *inode)
> +{
> +       if (ovl_inode_upper(inode))
> +               ovl_set_flag(OVL_EVICT_PENDING, inode);
> +       return 1;
> +}
> +
>  static const struct super_operations ovl_super_operations = {
>         .alloc_inode    = ovl_alloc_inode,
>         .free_inode     = ovl_free_inode,
>         .destroy_inode  = ovl_destroy_inode,
> -       .drop_inode     = generic_delete_inode,
> +       .drop_inode     = ovl_drop_inode,
> +       .evict_inode    = ovl_evict_inode,
>         .put_super      = ovl_put_super,
>         .sync_fs        = ovl_sync_fs,
>         .statfs         = ovl_statfs,
> @@ -1601,6 +1580,9 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>         if (!ofs)
>                 goto out;
>
> +       spin_lock_init(&ofs->upper_inodes_lock);
> +       INIT_LIST_HEAD(&ofs->upper_inodes);
> +
>         ofs->creator_cred = cred = prepare_creds();
>         if (!cred)
>                 goto out_err;
> diff --git a/fs/overlayfs/sync.c b/fs/overlayfs/sync.c
> new file mode 100644
> index 000000000000..aecd312ec851
> --- /dev/null
> +++ b/fs/overlayfs/sync.c
> @@ -0,0 +1,331 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 All Rights Reserved.
> + * Author Chengguang Xu <cgxu519@mykernel.net>
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/xattr.h>
> +#include <linux/mount.h>
> +#include <linux/writeback.h>
> +#include <linux/blkdev.h>
> +#include "overlayfs.h"
> +
> +/**
> + * upper_inodes list is used for organizing potential target of syncfs,
> + * ovl inode which has upper inode will be added to this list while
> + * initializing or updating and will be deleted from this list while
> + * evicting.
> + *
> + * Introduce ovl_inode flag "OVL_EVICT_PENDING" to indicate the ovl inode
> + * is in eviction process, syncfs(actually ovl_sync_inodes()) will wait on
> + * evicting inode until the IO to be completed in evict_inode().
> + *
> + * inode state/ovl_inode flags cycle in syncfs context will be as below.
> + * OVL_EVICT_PENDING (ovl_inode->flags) is only marked when inode having
> + * upper inode.
> + *
> + * (allocation)
> + *   |
> + * I_NEW (inode->i_state)
> + *   |
> + * NONE  (inode->i_state)
> + *   |
> + * OVL_EVICT_PENDING (ovl_inode->flags)
> + *   |
> + * I_FREEING (inode->i_state) | OVL_EVICT_PENDING (ovl_inode->flags)
> + *   |
> + * I_FREEING (inode->i_state) | I_CLEAR (inode->i_state)
> + *   |
> + * (destruction)
> + *
> + *
> + * There are five locks in in syncfs contexts:
> + *
> + * upper_sb->s_umount(semaphore)    : avoiding r/o to r/w or vice versa
> + * sb->s_sync_lock(mutex)           : avoiding concorrent syncfs running
> + * ofs->upper_inodes_lock(spinlock) : protecting upper_inodes list
> + * sb->s_inode_wblist_lock(spinlock): protecting s_inodes_wb(sync waiting) list
> + * inode->i_lock(spinlock)          : protecting inode fields
> + *
> + * Lock dependencies in syncfs contexts:
> + *
> + * upper_sb->s_umount
> + *     sb->s_sync_lock
> + *             ofs->upper_inodes_lock
> + *                     inode->i_lock
> + *
> + * upper_sb->s_umount
> + *     sb->s_sync_lock
> + *             sb->s_inode_wblist_lock
> + *
> + */
> +
> +struct ovl_write_inode_work {
> +       struct work_struct work;
> +       struct inode *inode;
> +};
> +
> +static void ovl_write_inode_work_fn(struct work_struct *work)
> +{
> +       struct ovl_write_inode_work *ovl_wiw;
> +       struct inode *inode;
> +
> +       ovl_wiw = container_of(work, struct ovl_write_inode_work, work);
> +       inode = ovl_wiw->inode;
> +       write_inode_now(ovl_inode_upper(inode), 1);
> +
> +       spin_lock(&inode->i_lock);
> +       ovl_clear_flag(OVL_WRITE_INODE_PENDING, inode);
> +       wake_up_bit(&OVL_I(inode)->flags, OVL_WRITE_INODE_PENDING);
> +       spin_unlock(&inode->i_lock);
> +}
> +
> +void ovl_evict_inode(struct inode *inode)
> +{
> +       struct ovl_inode *oi = OVL_I(inode);
> +       struct ovl_write_inode_work ovl_wiw;
> +       DEFINE_WAIT_BIT(wait, &oi->flags, OVL_WRITE_INODE_PENDING);
> +       wait_queue_head_t *wqh;
> +
> +       if (ovl_inode_upper(inode)) {
> +               if (current->flags & PF_MEMALLOC) {
> +                       spin_lock(&inode->i_lock);
> +                       ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
> +                       wqh = bit_waitqueue(&oi->flags,
> +                                       OVL_WRITE_INODE_PENDING);
> +                       prepare_to_wait(wqh, &wait.wq_entry,
> +                                       TASK_UNINTERRUPTIBLE);
> +                       spin_unlock(&inode->i_lock);
> +
> +                       ovl_wiw.inode = inode;
> +                       INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
> +                       schedule_work(&ovl_wiw.work);
> +
> +                       schedule();
> +                       finish_wait(wqh, &wait.wq_entry);

What is the reason to do this in another thread if this is a PF_MEMALLOC task?

> +               } else {
> +                       write_inode_now(ovl_inode_upper(inode), 1);
> +               }
> +               ovl_detach_upper_inodes_list(inode);
> +
> +               /*
> +                * ovl_sync_inodes() may wait until
> +                * flag OVL_EVICT_PENDING to be cleared.
> +                */
> +               spin_lock(&inode->i_lock);
> +               ovl_clear_flag(OVL_EVICT_PENDING, inode);
> +               wake_up_bit(&OVL_I(inode)->flags, OVL_EVICT_PENDING);
> +               spin_unlock(&inode->i_lock);
> +       }
> +       clear_inode(inode);
> +}
> +
> +void ovl_wait_evict_pending(struct inode *inode)
> +{
> +       struct ovl_fs *ofs = inode->i_sb->s_fs_info;
> +       struct ovl_inode *oi = OVL_I(inode);
> +       DEFINE_WAIT_BIT(wait, &oi->flags, OVL_EVICT_PENDING);
> +       wait_queue_head_t *wqh;
> +
> +       wqh = bit_waitqueue(&oi->flags, OVL_EVICT_PENDING);
> +       prepare_to_wait(wqh, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> +       spin_unlock(&inode->i_lock);
> +       spin_unlock(&ofs->upper_inodes_lock);
> +       schedule();
> +       finish_wait(wqh, &wait.wq_entry);
> +}
> +
> +/**
> + * ovl_sync_inodes
> + * @sb: The overlayfs super block
> + *
> + * upper_inodes list is used for organizing ovl inode which has upper inode,
> + * by iterating the list to looking for and syncing dirty upper inode.
> + *
> + * When starting syncing inode, we add the inode to wait list explicitly,
> + * in order to save memory we reuse inode->i_wb_list and sb->s_inodes_wb,
> + * so those fields have slightly differnt meanings in overlayfs.
> + */
> +static void ovl_sync_inodes(struct super_block *sb)
> +{
> +       struct ovl_fs *ofs = sb->s_fs_info;
> +       struct ovl_inode *oi;
> +       struct inode *inode;
> +       struct inode *upper_inode;
> +       struct blk_plug plug;
> +       LIST_HEAD(sync_tmp_list);
> +
> +       struct writeback_control wbc_for_sync = {
> +               .sync_mode              = WB_SYNC_ALL,
> +               .for_sync               = 1,
> +               .range_start            = 0,
> +               .range_end              = LLONG_MAX,
> +               .nr_to_write            = LONG_MAX,
> +       };
> +
> +       blk_start_plug(&plug);
> +       spin_lock(&ofs->upper_inodes_lock);
> +       list_splice_init(&ofs->upper_inodes, &sync_tmp_list);
> +
> +       while (!list_empty(&sync_tmp_list)) {
> +               oi = list_first_entry(&sync_tmp_list, struct ovl_inode,
> +                                               upper_inodes_list);
> +               inode = &oi->vfs_inode;
> +               spin_lock(&inode->i_lock);
> +
> +               if (inode->i_state & I_NEW) {
> +                       list_move_tail(&oi->upper_inodes_list,
> +                                       &ofs->upper_inodes);
> +                       spin_unlock(&inode->i_lock);
> +                       continue;
> +               }
> +
> +               /*
> +                * If ovl_inode flags is OVL_EVICT_PENDING,
> +                * left sync operation to the ovl_evict_inode(),
> +                * so wait here until OVL_EVICT_PENDING flag to be cleared.
> +                */
> +               if (ovl_test_flag(OVL_EVICT_PENDING, inode)) {
> +                       ovl_wait_evict_pending(inode);
> +                       goto next;

Does this skip re-adding to upper_inodes_list?

> +               }
> +
> +               list_move_tail(&oi->upper_inodes_list,
> +                               &ofs->upper_inodes);
> +               ihold(inode);
> +               upper_inode = ovl_inode_upper(inode);
> +               spin_unlock(&inode->i_lock);
> +               spin_unlock(&ofs->upper_inodes_lock);
> +
> +               if (!(upper_inode->i_state & I_DIRTY_ALL)) {
> +                       if (!mapping_tagged(upper_inode->i_mapping,
> +                                               PAGECACHE_TAG_WRITEBACK)) {
> +                               iput(inode);
> +                               goto next;
> +                       }
> +               } else {
> +                       sync_inode(upper_inode, &wbc_for_sync);
> +               }
> +
> +               spin_lock(&sb->s_inode_wblist_lock);
> +               list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> +               spin_unlock(&sb->s_inode_wblist_lock);
> +
> +next:
> +               cond_resched();
> +               spin_lock(&ofs->upper_inodes_lock);
> +       }
> +       spin_unlock(&ofs->upper_inodes_lock);
> +       blk_finish_plug(&plug);
> +}
> +
> +/**
> + * ovl_wait_inodes
> + * @sb: The overlayfs super block
> + *
> + * Waiting writeback inodes which are in s_inodes_wb list,
> + * all the IO that has been issued up to the time this
> + * function is enter is guaranteed to be completed.
> + */
> +static void ovl_wait_inodes(struct super_block *sb)
> +{
> +       LIST_HEAD(sync_wait_list);
> +       struct inode *inode;
> +       struct inode *upper_inode;
> +
> +       /*
> +        * ovl inode in sb->s_inodes_wb list has already got inode reference,
> +        * this will avoid silent inode droping during waiting on it.
> +        *
> +        * Splice the sync wait list onto a temporary list to avoid waiting on
> +        * inodes that have started writeback after this point.
> +        */
> +       spin_lock(&sb->s_inode_wblist_lock);
> +       list_splice_init(&sb->s_inodes_wb, &sync_wait_list);
> +
> +       while (!list_empty(&sync_wait_list)) {
> +               inode = list_first_entry(&sync_wait_list, struct inode,
> +                                                       i_wb_list);
> +               list_del_init(&inode->i_wb_list);
> +               upper_inode = ovl_inode_upper(inode);
> +               spin_unlock(&sb->s_inode_wblist_lock);
> +
> +               if (!mapping_tagged(upper_inode->i_mapping,
> +                               PAGECACHE_TAG_WRITEBACK)) {
> +                       goto next;
> +               }
> +
> +               filemap_fdatawait_keep_errors(upper_inode->i_mapping);
> +               cond_resched();
> +
> +next:
> +               iput(inode);
> +               spin_lock(&sb->s_inode_wblist_lock);
> +       }
> +       spin_unlock(&sb->s_inode_wblist_lock);
> +}
> +
> +/**
> + * ovl_sync_filesystem
> + * @sb: The overlayfs super block
> + *
> + * Sync underlying dirty inodes in upper filesystem and
> + * wait for completion.
> + */
> +static int ovl_sync_filesystem(struct super_block *sb)
> +{
> +       struct ovl_fs *ofs = sb->s_fs_info;
> +       struct super_block *upper_sb = ofs->upper_mnt->mnt_sb;
> +       int ret = 0;
> +
> +       down_read(&upper_sb->s_umount);
> +       if (!sb_rdonly(upper_sb)) {
> +               /*
> +                * s_sync_lock is used for serializing concurrent
> +                * syncfs operations.
> +                */
> +               mutex_lock(&sb->s_sync_lock);
> +               ovl_sync_inodes(sb);
> +               /* Calling sync_fs with no-wait for better performance. */
> +               if (upper_sb->s_op->sync_fs)
> +                       upper_sb->s_op->sync_fs(upper_sb, 0);
> +
> +               ovl_wait_inodes(sb);
> +               if (upper_sb->s_op->sync_fs)
> +                       upper_sb->s_op->sync_fs(upper_sb, 1);
> +
> +               ret = sync_blockdev(upper_sb->s_bdev);
> +               mutex_unlock(&sb->s_sync_lock);
> +       }
> +       up_read(&upper_sb->s_umount);
> +       return ret;
> +}
> +
> +/**
> + * ovl_sync_fs
> + * @sb: The overlayfs super block
> + * @wait: Wait for I/O to complete
> + *
> + * Sync real dirty inodes in upper filesystem (if it exists)
> + */
> +int ovl_sync_fs(struct super_block *sb, int wait)
> +{
> +       struct ovl_fs *ofs = sb->s_fs_info;
> +
> +       if (!ofs->upper_mnt)
> +               return 0;
> +
> +       /*
> +        * If this is a sync(2) call or an emergency sync, all the super blocks
> +        * will be iterated, including upper_sb, so no need to do anything.
> +        *
> +        * If this is a syncfs(2) call, then we do need to call
> +        * sync_filesystem() on upper_sb, but enough if we do it when being
> +        * called with wait == 1.
> +        */
> +       if (!wait)
> +               return 0;
> +
> +       return ovl_sync_filesystem(sb);
> +}
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index ea005085803f..73ef195d9aef 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -386,13 +386,33 @@ void ovl_dentry_set_redirect(struct dentry *dentry, const char *redirect)
>         oi->redirect = redirect;
>  }
>
> +void ovl_attach_upper_inodes_list(struct inode *inode)
> +{
> +       struct ovl_fs *ofs = inode->i_sb->s_fs_info;
> +
> +       spin_lock(&ofs->upper_inodes_lock);
> +       list_add(&OVL_I(inode)->upper_inodes_list, &ofs->upper_inodes);
> +       spin_unlock(&ofs->upper_inodes_lock);
> +}
> +
> +void ovl_detach_upper_inodes_list(struct inode *inode)
> +{
> +       struct ovl_fs *ofs = inode->i_sb->s_fs_info;
> +
> +       spin_lock(&ofs->upper_inodes_lock);
> +       list_del_init(&OVL_I(inode)->upper_inodes_list);
> +       spin_unlock(&ofs->upper_inodes_lock);
> +}
> +
>  void ovl_inode_init(struct inode *inode, struct dentry *upperdentry,
>                     struct dentry *lowerdentry, struct dentry *lowerdata)
>  {
>         struct inode *realinode = d_inode(upperdentry ?: lowerdentry);
>
> -       if (upperdentry)
> +       if (upperdentry) {
>                 OVL_I(inode)->__upperdentry = upperdentry;
> +               ovl_attach_upper_inodes_list(inode);
> +       }
>         if (lowerdentry)
>                 OVL_I(inode)->lower = igrab(d_inode(lowerdentry));
>         if (lowerdata)
> @@ -421,6 +441,7 @@ void ovl_inode_update(struct inode *inode, struct dentry *upperdentry)
>                 inode->i_private = upperinode;
>                 __insert_inode_hash(inode, (unsigned long) upperinode);
>         }
> +       ovl_attach_upper_inodes_list(inode);
>  }
>
>  static void ovl_dentry_version_inc(struct dentry *dentry, bool impurity)
> --
> 2.21.1
>
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v11] ovl: Improving syncfs efficiency
  2020-04-15 19:19 ` [PATCH " Miklos Szeredi
@ 2020-04-16  6:08   ` Chengguang Xu
  2020-04-16  7:21     ` Miklos Szeredi
  2020-04-16 11:14     ` Jan Kara
  0 siblings, 2 replies; 14+ messages in thread
From: Chengguang Xu @ 2020-04-16  6:08 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Amir Goldstein, Jan Kara, overlayfs

 ---- 在 星期四, 2020-04-16 03:19:50 Miklos Szeredi <miklos@szeredi.hu> 撰写 ----
 > On Mon, Feb 10, 2020 at 4:10 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
 > >
 > > Current syncfs(2) syscall on overlayfs just calls sync_filesystem()
 > > on upper_sb to synchronize whole dirty inodes in upper filesystem
 > > regardless of the overlay ownership of the inode. In the use case of
 > > container, when multiple containers using the same underlying upper
 > > filesystem, it has some shortcomings as below.
 > >
 > > (1) Performance
 > > Synchronization is probably heavy because it actually syncs unnecessary
 > > inodes for target overlayfs.
 > >
 > > (2) Interference
 > > Unplanned synchronization will probably impact IO performance of
 > > unrelated container processes on the other overlayfs.
 > >
 > > This patch iterates upper inode list in overlayfs to only sync target
 > > dirty inodes and wait for completion. By doing this, It is able to
 > > reduce cost of synchronization and will not seriously impact IO performance
 > > of unrelated processes. In special case, when having very few dirty inodes
 > > and a lot of clean upper inodes in overlayfs, then iteration may waste
 > > time so that synchronization is slower than before, but we think the
 > > potential for performance improvement far surpass the potential for
 > > performance regression in most cases.
 > 
 > A possible optimization would be to only add inodes to the list that
 > may have been modified (e.g. opened for write, truncated, etc..).
 > This might not be a big win due to the fact that upper inodes are
 > likely those that have been recently copied up.  But as I see it, it
 > does not not have any drawbacks compared to adding the inode to the
 > list on lookup.
 > 

Yes, I agree with you, I'll fix it in next version.

 > 
 > >
 > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
 > > ---
 > > Changes since v10:
 > > - Add special handling in ovl_evict_inode() for inode eviction trigered
 > > by kswapd memory reclamation(with PF_MEMALLOC flag).
 > > - Rebase on current latest overlayfs-next tree.
 > > - Slightly modify license information.
 > >
 > > Changes since v9:
 > > - Rebase on current latest overlayfs-next tree.
 > > - Calling clear_inode() regardless of having upper inode.
 > >
 > > Changes since v8:
 > > - Remove unnecessary blk_start_plug() call in ovl_sync_inodes().
 > >
 > > Changes since v7:
 > > - Check OVL_EVICT_PENDING flag instead of I_FREEING to recognize
 > > inode which is in eviction process.
 > > - Do not move ovl_inode back to ofs->upper_inodes when inode is in
 > > eviction process.
 > > - Delete unnecessary memory barrier in ovl_evict_inode().
 > >
 > > Changes since v6:
 > > - Take/release sb->s_sync_lock bofore/after sync_fs to serialize
 > > concurrent syncfs.
 > > - Change list iterating method to improve code readability.
 > > - Fix typo in commit log and comments.
 > > - Add header comment to sync.c.
 > >
 > > Changes since v5:
 > > - Move sync related functions to new file sync.c.
 > > - Introduce OVL_EVICT_PENDING flag to avoid race conditions.
 > > - If ovl inode flag is I_WILL_FREE or I_FREEING then syncfs
 > > will wait until OVL_EVICT_PENDING to be cleared.
 > > - Move inode sync operation into evict_inode from drop_inode.
 > > - Call upper_sb->sync_fs with no-wait after sync dirty upper
 > > inodes.
 > > - Hold s_inode_wblist_lock until deleting item from list.
 > > - Remove write_inode fuction because no more used.
 > >
 > > Changes since v4:
 > > - Add syncing operation and deleting from upper_inodes list
 > > during ovl inode destruction.
 > > - Export symbol of inode_wait_for_writeback() for waiting
 > > writeback on ovl inode.
 > > - Reuse I_SYNC flag to avoid race conditions between syncfs
 > > and drop_inode.
 > >
 > > Changes since v3:
 > > - Introduce upper_inode list to organize ovl inode which has upper
 > >   inode.
 > > - Avoid iput operation inside spin lock.
 > > - Change list iteration method for safety.
 > > - Reuse inode->i_wb_list and sb->s_inodes_wb to save memory.
 > > - Modify commit log and add more comments to the code.
 > >
 > > Changes since v2:
 > > - Decoupling with syncfs of upper fs by taking sb->s_sync_lock of
 > > overlayfs.
 > > - Factoring out sync operation to different helpers for easy
 > >   understanding.
 > >
 > > Changes since v1:
 > > - If upper filesystem is readonly mode then skip synchronization.
 > > - Introduce global wait list to replace temporary wait list for
 > > concurrent synchronization.
 > >
 > >  fs/overlayfs/Makefile    |   2 +-
 > >  fs/overlayfs/overlayfs.h |   9 ++
 > >  fs/overlayfs/ovl_entry.h |   5 +
 > >  fs/overlayfs/super.c     |  44 ++----
 > >  fs/overlayfs/sync.c      | 331 +++++++++++++++++++++++++++++++++++++++
 > >  fs/overlayfs/util.c      |  23 ++-
 > >  6 files changed, 381 insertions(+), 33 deletions(-)
 > >  create mode 100644 fs/overlayfs/sync.c
 > >
 > > diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
 > > index 9164c585eb2f..0c4e53fcf4ef 100644
 > > --- a/fs/overlayfs/Makefile
 > > +++ b/fs/overlayfs/Makefile
 > > @@ -6,4 +6,4 @@
 > >  obj-$(CONFIG_OVERLAY_FS) += overlay.o
 > >
 > >  overlay-objs := super.o namei.o util.o inode.o file.o dir.o readdir.o \
 > > -               copy_up.o export.o
 > > +               copy_up.o export.o sync.o
 > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
 > > index 3623d28aa4fa..3cdb06e206a3 100644
 > > --- a/fs/overlayfs/overlayfs.h
 > > +++ b/fs/overlayfs/overlayfs.h
 > > @@ -40,6 +40,10 @@ enum ovl_inode_flag {
 > >         OVL_UPPERDATA,
 > >         /* Inode number will remain constant over copy up. */
 > >         OVL_CONST_INO,
 > > +       /* Set when ovl inode is about to be freed */
 > > +       OVL_EVICT_PENDING,
 > > +       /* Set when sync upper inode in workqueue work */
 > > +       OVL_WRITE_INODE_PENDING,
 > >  };
 > >
 > >  enum ovl_entry_flag {
 > > @@ -295,6 +299,7 @@ bool ovl_is_metacopy_dentry(struct dentry *dentry);
 > >  char *ovl_get_redirect_xattr(struct dentry *dentry, int padding);
 > >  ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value,
 > >                      size_t padding);
 > > +void ovl_detach_upper_inodes_list(struct inode *inode);
 > >
 > >  static inline bool ovl_is_impuredir(struct dentry *dentry)
 > >  {
 > > @@ -466,3 +471,7 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
 > >
 > >  /* export.c */
 > >  extern const struct export_operations ovl_export_operations;
 > > +
 > > +/* sync.c */
 > > +void ovl_evict_inode(struct inode *inode);
 > > +int ovl_sync_fs(struct super_block *sb, int wait);
 > > diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
 > > index 89015ea822e7..42e092e8cd61 100644
 > > --- a/fs/overlayfs/ovl_entry.h
 > > +++ b/fs/overlayfs/ovl_entry.h
 > > @@ -75,6 +75,9 @@ struct ovl_fs {
 > >         struct inode *indexdir_trap;
 > >         /* -1: disabled, 0: same fs, 1..32: number of unused ino bits */
 > >         int xino_mode;
 > > +       /* Upper inode list and lock */
 > > +       spinlock_t upper_inodes_lock;
 > > +       struct list_head upper_inodes;
 > >  };
 > >
 > >  static inline struct ovl_fs *OVL_FS(struct super_block *sb)
 > > @@ -115,6 +118,8 @@ struct ovl_inode {
 > >
 > >         /* synchronize copy up and more */
 > >         struct mutex lock;
 > > +       /* Upper inodes list */
 > > +       struct list_head upper_inodes_list;
 > >  };
 > >
 > >  static inline struct ovl_inode *OVL_I(struct inode *inode)
 > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
 > > index 319fe0d355b0..3e6871e1a7ba 100644
 > > --- a/fs/overlayfs/super.c
 > > +++ b/fs/overlayfs/super.c
 > > @@ -183,6 +183,7 @@ static struct inode *ovl_alloc_inode(struct super_block *sb)
 > >         oi->lower = NULL;
 > >         oi->lowerdata = NULL;
 > >         mutex_init(&oi->lock);
 > > +       INIT_LIST_HEAD(&oi->upper_inodes_list);
 > >
 > >         return &oi->vfs_inode;
 > >  }
 > > @@ -249,36 +250,6 @@ static void ovl_put_super(struct super_block *sb)
 > >         ovl_free_fs(ofs);
 > >  }
 > >
 > > -/* Sync real dirty inodes in upper filesystem (if it exists) */
 > > -static int ovl_sync_fs(struct super_block *sb, int wait)
 > > -{
 > > -       struct ovl_fs *ofs = sb->s_fs_info;
 > > -       struct super_block *upper_sb;
 > > -       int ret;
 > > -
 > > -       if (!ofs->upper_mnt)
 > > -               return 0;
 > > -
 > > -       /*
 > > -        * If this is a sync(2) call or an emergency sync, all the super blocks
 > > -        * will be iterated, including upper_sb, so no need to do anything.
 > > -        *
 > > -        * If this is a syncfs(2) call, then we do need to call
 > > -        * sync_filesystem() on upper_sb, but enough if we do it when being
 > > -        * called with wait == 1.
 > > -        */
 > > -       if (!wait)
 > > -               return 0;
 > > -
 > > -       upper_sb = ofs->upper_mnt->mnt_sb;
 > > -
 > > -       down_read(&upper_sb->s_umount);
 > > -       ret = sync_filesystem(upper_sb);
 > > -       up_read(&upper_sb->s_umount);
 > > -
 > > -       return ret;
 > > -}
 > > -
 > >  /**
 > >   * ovl_statfs
 > >   * @sb: The overlayfs super block
 > > @@ -376,11 +347,19 @@ static int ovl_remount(struct super_block *sb, int *flags, char *data)
 > >         return 0;
 > >  }
 > >
 > > +static int ovl_drop_inode(struct inode *inode)
 > > +{
 > > +       if (ovl_inode_upper(inode))
 > > +               ovl_set_flag(OVL_EVICT_PENDING, inode);
 > > +       return 1;
 > > +}
 > > +
 > >  static const struct super_operations ovl_super_operations = {
 > >         .alloc_inode    = ovl_alloc_inode,
 > >         .free_inode     = ovl_free_inode,
 > >         .destroy_inode  = ovl_destroy_inode,
 > > -       .drop_inode     = generic_delete_inode,
 > > +       .drop_inode     = ovl_drop_inode,
 > > +       .evict_inode    = ovl_evict_inode,
 > >         .put_super      = ovl_put_super,
 > >         .sync_fs        = ovl_sync_fs,
 > >         .statfs         = ovl_statfs,
 > > @@ -1601,6 +1580,9 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 > >         if (!ofs)
 > >                 goto out;
 > >
 > > +       spin_lock_init(&ofs->upper_inodes_lock);
 > > +       INIT_LIST_HEAD(&ofs->upper_inodes);
 > > +
 > >         ofs->creator_cred = cred = prepare_creds();
 > >         if (!cred)
 > >                 goto out_err;
 > > diff --git a/fs/overlayfs/sync.c b/fs/overlayfs/sync.c
 > > new file mode 100644
 > > index 000000000000..aecd312ec851
 > > --- /dev/null
 > > +++ b/fs/overlayfs/sync.c
 > > @@ -0,0 +1,331 @@
 > > +// SPDX-License-Identifier: GPL-2.0-only
 > > +/*
 > > + * Copyright (C) 2020 All Rights Reserved.
 > > + * Author Chengguang Xu <cgxu519@mykernel.net>
 > > + */
 > > +
 > > +#include <linux/fs.h>
 > > +#include <linux/xattr.h>
 > > +#include <linux/mount.h>
 > > +#include <linux/writeback.h>
 > > +#include <linux/blkdev.h>
 > > +#include "overlayfs.h"
 > > +
 > > +/**
 > > + * upper_inodes list is used for organizing potential target of syncfs,
 > > + * ovl inode which has upper inode will be added to this list while
 > > + * initializing or updating and will be deleted from this list while
 > > + * evicting.
 > > + *
 > > + * Introduce ovl_inode flag "OVL_EVICT_PENDING" to indicate the ovl inode
 > > + * is in eviction process, syncfs(actually ovl_sync_inodes()) will wait on
 > > + * evicting inode until the IO to be completed in evict_inode().
 > > + *
 > > + * inode state/ovl_inode flags cycle in syncfs context will be as below.
 > > + * OVL_EVICT_PENDING (ovl_inode->flags) is only marked when inode having
 > > + * upper inode.
 > > + *
 > > + * (allocation)
 > > + *   |
 > > + * I_NEW (inode->i_state)
 > > + *   |
 > > + * NONE  (inode->i_state)
 > > + *   |
 > > + * OVL_EVICT_PENDING (ovl_inode->flags)
 > > + *   |
 > > + * I_FREEING (inode->i_state) | OVL_EVICT_PENDING (ovl_inode->flags)
 > > + *   |
 > > + * I_FREEING (inode->i_state) | I_CLEAR (inode->i_state)
 > > + *   |
 > > + * (destruction)
 > > + *
 > > + *
 > > + * There are five locks in in syncfs contexts:
 > > + *
 > > + * upper_sb->s_umount(semaphore)    : avoiding r/o to r/w or vice versa
 > > + * sb->s_sync_lock(mutex)           : avoiding concorrent syncfs running
 > > + * ofs->upper_inodes_lock(spinlock) : protecting upper_inodes list
 > > + * sb->s_inode_wblist_lock(spinlock): protecting s_inodes_wb(sync waiting) list
 > > + * inode->i_lock(spinlock)          : protecting inode fields
 > > + *
 > > + * Lock dependencies in syncfs contexts:
 > > + *
 > > + * upper_sb->s_umount
 > > + *     sb->s_sync_lock
 > > + *             ofs->upper_inodes_lock
 > > + *                     inode->i_lock
 > > + *
 > > + * upper_sb->s_umount
 > > + *     sb->s_sync_lock
 > > + *             sb->s_inode_wblist_lock
 > > + *
 > > + */
 > > +
 > > +struct ovl_write_inode_work {
 > > +       struct work_struct work;
 > > +       struct inode *inode;
 > > +};
 > > +
 > > +static void ovl_write_inode_work_fn(struct work_struct *work)
 > > +{
 > > +       struct ovl_write_inode_work *ovl_wiw;
 > > +       struct inode *inode;
 > > +
 > > +       ovl_wiw = container_of(work, struct ovl_write_inode_work, work);
 > > +       inode = ovl_wiw->inode;
 > > +       write_inode_now(ovl_inode_upper(inode), 1);
 > > +
 > > +       spin_lock(&inode->i_lock);
 > > +       ovl_clear_flag(OVL_WRITE_INODE_PENDING, inode);
 > > +       wake_up_bit(&OVL_I(inode)->flags, OVL_WRITE_INODE_PENDING);
 > > +       spin_unlock(&inode->i_lock);
 > > +}
 > > +
 > > +void ovl_evict_inode(struct inode *inode)
 > > +{
 > > +       struct ovl_inode *oi = OVL_I(inode);
 > > +       struct ovl_write_inode_work ovl_wiw;
 > > +       DEFINE_WAIT_BIT(wait, &oi->flags, OVL_WRITE_INODE_PENDING);
 > > +       wait_queue_head_t *wqh;
 > > +
 > > +       if (ovl_inode_upper(inode)) {
 > > +               if (current->flags & PF_MEMALLOC) {
 > > +                       spin_lock(&inode->i_lock);
 > > +                       ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
 > > +                       wqh = bit_waitqueue(&oi->flags,
 > > +                                       OVL_WRITE_INODE_PENDING);
 > > +                       prepare_to_wait(wqh, &wait.wq_entry,
 > > +                                       TASK_UNINTERRUPTIBLE);
 > > +                       spin_unlock(&inode->i_lock);
 > > +
 > > +                       ovl_wiw.inode = inode;
 > > +                       INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
 > > +                       schedule_work(&ovl_wiw.work);
 > > +
 > > +                       schedule();
 > > +                       finish_wait(wqh, &wait.wq_entry);
 > 
 > What is the reason to do this in another thread if this is a PF_MEMALLOC task?

Some underlying filesystems(for example ext4) check the flag in ->write_inode()
and treate it as an abnormal case.(warn and return)

ext4_write_inode():
        if (WARN_ON_ONCE(current->flags & PF_MEMALLOC) ||
                sb_rdonly(inode->i_sb))
                        return 0;

overlayfs inodes are always keeping clean even after wring/modifying  upperfile , 
so they are right target of kswapd  but in the point of lower layer, ext4 just thinks
kswapd is choosing a wrong dirty inode to reclam memory.

 > 
 > > +               } else {
 > > +                       write_inode_now(ovl_inode_upper(inode), 1);
 > > +               }
 > > +               ovl_detach_upper_inodes_list(inode);
 > > +
 > > +               /*
 > > +                * ovl_sync_inodes() may wait until
 > > +                * flag OVL_EVICT_PENDING to be cleared.
 > > +                */
 > > +               spin_lock(&inode->i_lock);
 > > +               ovl_clear_flag(OVL_EVICT_PENDING, inode);
 > > +               wake_up_bit(&OVL_I(inode)->flags, OVL_EVICT_PENDING);
 > > +               spin_unlock(&inode->i_lock);
 > > +       }
 > > +       clear_inode(inode);
 > > +}
 > > +
 > > +void ovl_wait_evict_pending(struct inode *inode)
 > > +{
 > > +       struct ovl_fs *ofs = inode->i_sb->s_fs_info;
 > > +       struct ovl_inode *oi = OVL_I(inode);
 > > +       DEFINE_WAIT_BIT(wait, &oi->flags, OVL_EVICT_PENDING);
 > > +       wait_queue_head_t *wqh;
 > > +
 > > +       wqh = bit_waitqueue(&oi->flags, OVL_EVICT_PENDING);
 > > +       prepare_to_wait(wqh, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
 > > +       spin_unlock(&inode->i_lock);
 > > +       spin_unlock(&ofs->upper_inodes_lock);
 > > +       schedule();
 > > +       finish_wait(wqh, &wait.wq_entry);
 > > +}
 > > +
 > > +/**
 > > + * ovl_sync_inodes
 > > + * @sb: The overlayfs super block
 > > + *
 > > + * upper_inodes list is used for organizing ovl inode which has upper inode,
 > > + * by iterating the list to looking for and syncing dirty upper inode.
 > > + *
 > > + * When starting syncing inode, we add the inode to wait list explicitly,
 > > + * in order to save memory we reuse inode->i_wb_list and sb->s_inodes_wb,
 > > + * so those fields have slightly differnt meanings in overlayfs.
 > > + */
 > > +static void ovl_sync_inodes(struct super_block *sb)
 > > +{
 > > +       struct ovl_fs *ofs = sb->s_fs_info;
 > > +       struct ovl_inode *oi;
 > > +       struct inode *inode;
 > > +       struct inode *upper_inode;
 > > +       struct blk_plug plug;
 > > +       LIST_HEAD(sync_tmp_list);
 > > +
 > > +       struct writeback_control wbc_for_sync = {
 > > +               .sync_mode              = WB_SYNC_ALL,
 > > +               .for_sync               = 1,
 > > +               .range_start            = 0,
 > > +               .range_end              = LLONG_MAX,
 > > +               .nr_to_write            = LONG_MAX,
 > > +       };
 > > +
 > > +       blk_start_plug(&plug);
 > > +       spin_lock(&ofs->upper_inodes_lock);
 > > +       list_splice_init(&ofs->upper_inodes, &sync_tmp_list);
 > > +
 > > +       while (!list_empty(&sync_tmp_list)) {
 > > +               oi = list_first_entry(&sync_tmp_list, struct ovl_inode,
 > > +                                               upper_inodes_list);
 > > +               inode = &oi->vfs_inode;
 > > +               spin_lock(&inode->i_lock);
 > > +
 > > +               if (inode->i_state & I_NEW) {
 > > +                       list_move_tail(&oi->upper_inodes_list,
 > > +                                       &ofs->upper_inodes);
 > > +                       spin_unlock(&inode->i_lock);
 > > +                       continue;
 > > +               }
 > > +
 > > +               /*
 > > +                * If ovl_inode flags is OVL_EVICT_PENDING,
 > > +                * left sync operation to the ovl_evict_inode(),
 > > +                * so wait here until OVL_EVICT_PENDING flag to be cleared.
 > > +                */
 > > +               if (ovl_test_flag(OVL_EVICT_PENDING, inode)) {
 > > +                       ovl_wait_evict_pending(inode);
 > > +                       goto next;
 > 
 > Does this skip re-adding to upper_inodes_list?

No,  the inode will be destroyed after evicting operation.


Thanks,
cgxu


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v11] ovl: Improving syncfs efficiency
  2020-04-16  6:08   ` Chengguang Xu
@ 2020-04-16  7:21     ` Miklos Szeredi
  2020-04-16  7:39       ` Miklos Szeredi
  2020-04-16 11:14     ` Jan Kara
  1 sibling, 1 reply; 14+ messages in thread
From: Miklos Szeredi @ 2020-04-16  7:21 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: Amir Goldstein, Jan Kara, overlayfs

On Thu, Apr 16, 2020 at 8:09 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
>
>  ---- 在 星期四, 2020-04-16 03:19:50 Miklos Szeredi <miklos@szeredi.hu> 撰写 ----
>  > On Mon, Feb 10, 2020 at 4:10 AM Chengguang Xu <cgxu519@mykernel.net> wrote:

>  > > +               if (current->flags & PF_MEMALLOC) {
>  > > +                       spin_lock(&inode->i_lock);
>  > > +                       ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
>  > > +                       wqh = bit_waitqueue(&oi->flags,
>  > > +                                       OVL_WRITE_INODE_PENDING);
>  > > +                       prepare_to_wait(wqh, &wait.wq_entry,
>  > > +                                       TASK_UNINTERRUPTIBLE);
>  > > +                       spin_unlock(&inode->i_lock);
>  > > +
>  > > +                       ovl_wiw.inode = inode;
>  > > +                       INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
>  > > +                       schedule_work(&ovl_wiw.work);
>  > > +
>  > > +                       schedule();
>  > > +                       finish_wait(wqh, &wait.wq_entry);
>  >
>  > What is the reason to do this in another thread if this is a PF_MEMALLOC task?
>
> Some underlying filesystems(for example ext4) check the flag in ->write_inode()
> and treate it as an abnormal case.(warn and return)
>
> ext4_write_inode():
>         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC) ||
>                 sb_rdonly(inode->i_sb))
>                         return 0;
>
> overlayfs inodes are always keeping clean even after wring/modifying  upperfile ,
> so they are right target of kswapd  but in the point of lower layer, ext4 just thinks
> kswapd is choosing a wrong dirty inode to reclam memory.

I don't get it.  Why can't overlayfs just skip the writeback of upper
inode in the reclaim case?  It will be written back through the normal
relcaim channels.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v11] ovl: Improving syncfs efficiency
  2020-04-16  7:21     ` Miklos Szeredi
@ 2020-04-16  7:39       ` Miklos Szeredi
  2020-04-16  8:00         ` Miklos Szeredi
  0 siblings, 1 reply; 14+ messages in thread
From: Miklos Szeredi @ 2020-04-16  7:39 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: Amir Goldstein, Jan Kara, overlayfs

On Thu, Apr 16, 2020 at 9:21 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Thu, Apr 16, 2020 at 8:09 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
> >
> >  ---- 在 星期四, 2020-04-16 03:19:50 Miklos Szeredi <miklos@szeredi.hu> 撰写 ----
> >  > On Mon, Feb 10, 2020 at 4:10 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
>
> >  > > +               if (current->flags & PF_MEMALLOC) {
> >  > > +                       spin_lock(&inode->i_lock);
> >  > > +                       ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
> >  > > +                       wqh = bit_waitqueue(&oi->flags,
> >  > > +                                       OVL_WRITE_INODE_PENDING);
> >  > > +                       prepare_to_wait(wqh, &wait.wq_entry,
> >  > > +                                       TASK_UNINTERRUPTIBLE);
> >  > > +                       spin_unlock(&inode->i_lock);
> >  > > +
> >  > > +                       ovl_wiw.inode = inode;
> >  > > +                       INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
> >  > > +                       schedule_work(&ovl_wiw.work);
> >  > > +
> >  > > +                       schedule();
> >  > > +                       finish_wait(wqh, &wait.wq_entry);
> >  >
> >  > What is the reason to do this in another thread if this is a PF_MEMALLOC task?
> >
> > Some underlying filesystems(for example ext4) check the flag in ->write_inode()
> > and treate it as an abnormal case.(warn and return)
> >
> > ext4_write_inode():
> >         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC) ||
> >                 sb_rdonly(inode->i_sb))
> >                         return 0;
> >
> > overlayfs inodes are always keeping clean even after wring/modifying  upperfile ,
> > so they are right target of kswapd  but in the point of lower layer, ext4 just thinks
> > kswapd is choosing a wrong dirty inode to reclam memory.
>
> I don't get it.  Why can't overlayfs just skip the writeback of upper
> inode in the reclaim case?  It will be written back through the normal
> relcaim channels.

And how do we get reclaim on overlay inode at all?   Overlay inodes
will get evicted immediately after their refcount drops to zero, so
there's absolutely no chance that memory reclaim will encounter them,
no?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v11] ovl: Improving syncfs efficiency
  2020-04-16  7:39       ` Miklos Szeredi
@ 2020-04-16  8:00         ` Miklos Szeredi
  2020-04-16 11:16           ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Miklos Szeredi @ 2020-04-16  8:00 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: Amir Goldstein, Jan Kara, overlayfs

On Thu, Apr 16, 2020 at 9:39 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Thu, Apr 16, 2020 at 9:21 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Thu, Apr 16, 2020 at 8:09 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
> > >
> > >  ---- 在 星期四, 2020-04-16 03:19:50 Miklos Szeredi <miklos@szeredi.hu> 撰写 ----
> > >  > On Mon, Feb 10, 2020 at 4:10 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
> >
> > >  > > +               if (current->flags & PF_MEMALLOC) {
> > >  > > +                       spin_lock(&inode->i_lock);
> > >  > > +                       ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
> > >  > > +                       wqh = bit_waitqueue(&oi->flags,
> > >  > > +                                       OVL_WRITE_INODE_PENDING);
> > >  > > +                       prepare_to_wait(wqh, &wait.wq_entry,
> > >  > > +                                       TASK_UNINTERRUPTIBLE);
> > >  > > +                       spin_unlock(&inode->i_lock);
> > >  > > +
> > >  > > +                       ovl_wiw.inode = inode;
> > >  > > +                       INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
> > >  > > +                       schedule_work(&ovl_wiw.work);
> > >  > > +
> > >  > > +                       schedule();
> > >  > > +                       finish_wait(wqh, &wait.wq_entry);
> > >  >
> > >  > What is the reason to do this in another thread if this is a PF_MEMALLOC task?
> > >
> > > Some underlying filesystems(for example ext4) check the flag in ->write_inode()
> > > and treate it as an abnormal case.(warn and return)
> > >
> > > ext4_write_inode():
> > >         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC) ||
> > >                 sb_rdonly(inode->i_sb))
> > >                         return 0;
> > >
> > > overlayfs inodes are always keeping clean even after wring/modifying  upperfile ,
> > > so they are right target of kswapd  but in the point of lower layer, ext4 just thinks
> > > kswapd is choosing a wrong dirty inode to reclam memory.
> >
> > I don't get it.  Why can't overlayfs just skip the writeback of upper
> > inode in the reclaim case?  It will be written back through the normal
> > relcaim channels.
>
> And how do we get reclaim on overlay inode at all?   Overlay inodes
> will get evicted immediately after their refcount drops to zero, so
> there's absolutely no chance that memory reclaim will encounter them,
> no?

Spoke too soon.   Obviously this case is about dentry reclaim, not
inode reclaim.

So how about temporarily clearing PF_MEMALLOC in this case?  Doing
this is a kernel thread doesn't seem to add any advantages.

The real solution, as has already been stated, would be to have
changes to upper go though the overlay mapping.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v11] ovl: Improving syncfs efficiency
  2020-04-16  6:08   ` Chengguang Xu
  2020-04-16  7:21     ` Miklos Szeredi
@ 2020-04-16 11:14     ` Jan Kara
  2020-04-16 13:52       ` Chengguang Xu
  1 sibling, 1 reply; 14+ messages in thread
From: Jan Kara @ 2020-04-16 11:14 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: Miklos Szeredi, Amir Goldstein, Jan Kara, overlayfs

On Thu 16-04-20 14:08:59, Chengguang Xu wrote:
>  ---- 在 星期四, 2020-04-16 03:19:50 Miklos Szeredi <miklos@szeredi.hu> 撰写 ----
>  > On Mon, Feb 10, 2020 at 4:10 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
>  > > +void ovl_evict_inode(struct inode *inode)
>  > > +{
>  > > +       struct ovl_inode *oi = OVL_I(inode);
>  > > +       struct ovl_write_inode_work ovl_wiw;
>  > > +       DEFINE_WAIT_BIT(wait, &oi->flags, OVL_WRITE_INODE_PENDING);
>  > > +       wait_queue_head_t *wqh;
>  > > +
>  > > +       if (ovl_inode_upper(inode)) {
>  > > +               if (current->flags & PF_MEMALLOC) {
>  > > +                       spin_lock(&inode->i_lock);
>  > > +                       ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
>  > > +                       wqh = bit_waitqueue(&oi->flags,
>  > > +                                       OVL_WRITE_INODE_PENDING);
>  > > +                       prepare_to_wait(wqh, &wait.wq_entry,
>  > > +                                       TASK_UNINTERRUPTIBLE);
>  > > +                       spin_unlock(&inode->i_lock);
>  > > +
>  > > +                       ovl_wiw.inode = inode;
>  > > +                       INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
>  > > +                       schedule_work(&ovl_wiw.work);
>  > > +
>  > > +                       schedule();
>  > > +                       finish_wait(wqh, &wait.wq_entry);
>  > 
>  > What is the reason to do this in another thread if this is a PF_MEMALLOC task?
> 
> Some underlying filesystems(for example ext4) check the flag in
> ->write_inode() and treate it as an abnormal case.(warn and return)
> 
> ext4_write_inode():
>         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC) ||
>                 sb_rdonly(inode->i_sb))
>                         return 0;
> 
> overlayfs inodes are always keeping clean even after wring/modifying
> upperfile , so they are right target of kswapd  but in the point of lower
> layer, ext4 just thinks kswapd is choosing a wrong dirty inode to reclam
> memory.

In ext4, it isn't a big problem if ext4_write_inode() is called from
kswapd. But if ext4_write_inode() is called from direct reclaim (which also
sets PF_MEMALLOC) we can deadlock because we may wait for transaction
commit and transaction commit may require locks (such as page lock or
waiting for page writeback to complete) which are held by the task
currently in direct reclaim. Your push to workqueue will silence the
warning but does not solve the possible deadlock.

I'm actually not sure why you need to writeback the upper inode when
reclaiming overlayfs inode. Why not leave it on flush worker on upper fs?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v11] ovl: Improving syncfs efficiency
  2020-04-16  8:00         ` Miklos Szeredi
@ 2020-04-16 11:16           ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2020-04-16 11:16 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Chengguang Xu, Amir Goldstein, Jan Kara, overlayfs

On Thu 16-04-20 10:00:13, Miklos Szeredi wrote:
> On Thu, Apr 16, 2020 at 9:39 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Thu, Apr 16, 2020 at 9:21 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Thu, Apr 16, 2020 at 8:09 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
> > > >
> > > >  ---- 在 星期四, 2020-04-16 03:19:50 Miklos Szeredi <miklos@szeredi.hu> 撰写 ----
> > > >  > On Mon, Feb 10, 2020 at 4:10 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
> > >
> > > >  > > +               if (current->flags & PF_MEMALLOC) {
> > > >  > > +                       spin_lock(&inode->i_lock);
> > > >  > > +                       ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
> > > >  > > +                       wqh = bit_waitqueue(&oi->flags,
> > > >  > > +                                       OVL_WRITE_INODE_PENDING);
> > > >  > > +                       prepare_to_wait(wqh, &wait.wq_entry,
> > > >  > > +                                       TASK_UNINTERRUPTIBLE);
> > > >  > > +                       spin_unlock(&inode->i_lock);
> > > >  > > +
> > > >  > > +                       ovl_wiw.inode = inode;
> > > >  > > +                       INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
> > > >  > > +                       schedule_work(&ovl_wiw.work);
> > > >  > > +
> > > >  > > +                       schedule();
> > > >  > > +                       finish_wait(wqh, &wait.wq_entry);
> > > >  >
> > > >  > What is the reason to do this in another thread if this is a PF_MEMALLOC task?
> > > >
> > > > Some underlying filesystems(for example ext4) check the flag in ->write_inode()
> > > > and treate it as an abnormal case.(warn and return)
> > > >
> > > > ext4_write_inode():
> > > >         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC) ||
> > > >                 sb_rdonly(inode->i_sb))
> > > >                         return 0;
> > > >
> > > > overlayfs inodes are always keeping clean even after wring/modifying  upperfile ,
> > > > so they are right target of kswapd  but in the point of lower layer, ext4 just thinks
> > > > kswapd is choosing a wrong dirty inode to reclam memory.
> > >
> > > I don't get it.  Why can't overlayfs just skip the writeback of upper
> > > inode in the reclaim case?  It will be written back through the normal
> > > relcaim channels.
> >
> > And how do we get reclaim on overlay inode at all?   Overlay inodes
> > will get evicted immediately after their refcount drops to zero, so
> > there's absolutely no chance that memory reclaim will encounter them,
> > no?
> 
> Spoke too soon.   Obviously this case is about dentry reclaim, not
> inode reclaim.
> 
> So how about temporarily clearing PF_MEMALLOC in this case?  Doing
> this is a kernel thread doesn't seem to add any advantages.

Clearing PF_MEMALLOC will not solve the deadlock I've described in the
reply to Chengguang. Ext4 really cannot safely handle data integrity
writeback (which is what write_inode_now(inode, 1) does) from direct
reclaim.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v11] ovl: Improving syncfs efficiency
  2020-04-16 11:14     ` Jan Kara
@ 2020-04-16 13:52       ` Chengguang Xu
  2020-04-16 14:33         ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Chengguang Xu @ 2020-04-16 13:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Amir Goldstein, overlayfs

 ---- 在 星期四, 2020-04-16 19:14:24 Jan Kara <jack@suse.cz> 撰写 ----
 > On Thu 16-04-20 14:08:59, Chengguang Xu wrote:
 > >  ---- 在 星期四, 2020-04-16 03:19:50 Miklos Szeredi <miklos@szeredi.hu> 撰写 ----
 > >  > On Mon, Feb 10, 2020 at 4:10 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
 > >  > > +void ovl_evict_inode(struct inode *inode)
 > >  > > +{
 > >  > > +       struct ovl_inode *oi = OVL_I(inode);
 > >  > > +       struct ovl_write_inode_work ovl_wiw;
 > >  > > +       DEFINE_WAIT_BIT(wait, &oi->flags, OVL_WRITE_INODE_PENDING);
 > >  > > +       wait_queue_head_t *wqh;
 > >  > > +
 > >  > > +       if (ovl_inode_upper(inode)) {
 > >  > > +               if (current->flags & PF_MEMALLOC) {
 > >  > > +                       spin_lock(&inode->i_lock);
 > >  > > +                       ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
 > >  > > +                       wqh = bit_waitqueue(&oi->flags,
 > >  > > +                                       OVL_WRITE_INODE_PENDING);
 > >  > > +                       prepare_to_wait(wqh, &wait.wq_entry,
 > >  > > +                                       TASK_UNINTERRUPTIBLE);
 > >  > > +                       spin_unlock(&inode->i_lock);
 > >  > > +
 > >  > > +                       ovl_wiw.inode = inode;
 > >  > > +                       INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
 > >  > > +                       schedule_work(&ovl_wiw.work);
 > >  > > +
 > >  > > +                       schedule();
 > >  > > +                       finish_wait(wqh, &wait.wq_entry);
 > >  > 
 > >  > What is the reason to do this in another thread if this is a PF_MEMALLOC task?
 > > 
 > > Some underlying filesystems(for example ext4) check the flag in
 > > ->write_inode() and treate it as an abnormal case.(warn and return)
 > > 
 > > ext4_write_inode():
 > >         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC) ||
 > >                 sb_rdonly(inode->i_sb))
 > >                         return 0;
 > > 
 > > overlayfs inodes are always keeping clean even after wring/modifying
 > > upperfile , so they are right target of kswapd  but in the point of lower
 > > layer, ext4 just thinks kswapd is choosing a wrong dirty inode to reclam
 > > memory.
 > 
 > In ext4, it isn't a big problem if ext4_write_inode() is called from
 > kswapd. But if ext4_write_inode() is called from direct reclaim (which also
 > sets PF_MEMALLOC) we can deadlock because we may wait for transaction
 > commit and transaction commit may require locks (such as page lock or
 > waiting for page writeback to complete) which are held by the task
 > currently in direct reclaim. Your push to workqueue will silence the
 > warning but does not solve the possible deadlock.
 > 
 > I'm actually not sure why you need to writeback the upper inode when
 > reclaiming overlayfs inode. Why not leave it on flush worker on upper fs?
 > 

Because it is the last chance we can sync dirty upper inode, I mean after evicting
overlayfs inode we can not find the associated dirty upper inode from any list and
that dirty upper inode will be skipped from the target of syncfs().

Thanks,
cgxu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v11] ovl: Improving syncfs efficiency
  2020-04-16 13:52       ` Chengguang Xu
@ 2020-04-16 14:33         ` Jan Kara
  2020-04-17  1:23           ` Chengguang Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2020-04-16 14:33 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: Jan Kara, Miklos Szeredi, Amir Goldstein, overlayfs

On Thu 16-04-20 21:52:27, Chengguang Xu wrote:
>  ---- 在 星期四, 2020-04-16 19:14:24 Jan Kara <jack@suse.cz> 撰写 ----
>  > On Thu 16-04-20 14:08:59, Chengguang Xu wrote:
>  > >  ---- 在 星期四, 2020-04-16 03:19:50 Miklos Szeredi <miklos@szeredi.hu> 撰写 ----
>  > >  > On Mon, Feb 10, 2020 at 4:10 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
>  > >  > > +void ovl_evict_inode(struct inode *inode)
>  > >  > > +{
>  > >  > > +       struct ovl_inode *oi = OVL_I(inode);
>  > >  > > +       struct ovl_write_inode_work ovl_wiw;
>  > >  > > +       DEFINE_WAIT_BIT(wait, &oi->flags, OVL_WRITE_INODE_PENDING);
>  > >  > > +       wait_queue_head_t *wqh;
>  > >  > > +
>  > >  > > +       if (ovl_inode_upper(inode)) {
>  > >  > > +               if (current->flags & PF_MEMALLOC) {
>  > >  > > +                       spin_lock(&inode->i_lock);
>  > >  > > +                       ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
>  > >  > > +                       wqh = bit_waitqueue(&oi->flags,
>  > >  > > +                                       OVL_WRITE_INODE_PENDING);
>  > >  > > +                       prepare_to_wait(wqh, &wait.wq_entry,
>  > >  > > +                                       TASK_UNINTERRUPTIBLE);
>  > >  > > +                       spin_unlock(&inode->i_lock);
>  > >  > > +
>  > >  > > +                       ovl_wiw.inode = inode;
>  > >  > > +                       INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
>  > >  > > +                       schedule_work(&ovl_wiw.work);
>  > >  > > +
>  > >  > > +                       schedule();
>  > >  > > +                       finish_wait(wqh, &wait.wq_entry);
>  > >  > 
>  > >  > What is the reason to do this in another thread if this is a PF_MEMALLOC task?
>  > > 
>  > > Some underlying filesystems(for example ext4) check the flag in
>  > > ->write_inode() and treate it as an abnormal case.(warn and return)
>  > > 
>  > > ext4_write_inode():
>  > >         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC) ||
>  > >                 sb_rdonly(inode->i_sb))
>  > >                         return 0;
>  > > 
>  > > overlayfs inodes are always keeping clean even after wring/modifying
>  > > upperfile , so they are right target of kswapd  but in the point of lower
>  > > layer, ext4 just thinks kswapd is choosing a wrong dirty inode to reclam
>  > > memory.
>  > 
>  > In ext4, it isn't a big problem if ext4_write_inode() is called from
>  > kswapd. But if ext4_write_inode() is called from direct reclaim (which also
>  > sets PF_MEMALLOC) we can deadlock because we may wait for transaction
>  > commit and transaction commit may require locks (such as page lock or
>  > waiting for page writeback to complete) which are held by the task
>  > currently in direct reclaim. Your push to workqueue will silence the
>  > warning but does not solve the possible deadlock.
>  > 
>  > I'm actually not sure why you need to writeback the upper inode when
>  > reclaiming overlayfs inode. Why not leave it on flush worker on upper fs?
>  > 
> 
> Because it is the last chance we can sync dirty upper inode, I mean after
> evicting overlayfs inode we can not find the associated dirty upper inode
> from any list and that dirty upper inode will be skipped from the target
> of syncfs().

I see. But this flushing of dirty inodes on reclaim really isn't a great
idea. It can also stall reclaim (due to it being stuck waiting for IO) and
thus lead to bad behavior in low memory situations. It's better to just
skip reclaiming such inodes - but then I agree it's a difficult question
when to reclaim them. Ideally you'd need to hook into inode_lru_isolate()
but that's just too ugly (but maybe it could be done in some clean manner).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v11] ovl: Improving syncfs efficiency
  2020-04-16 14:33         ` Jan Kara
@ 2020-04-17  1:23           ` Chengguang Xu
  2020-04-17  9:26             ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Chengguang Xu @ 2020-04-17  1:23 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Amir Goldstein, overlayfs

 ---- 在 星期四, 2020-04-16 22:33:49 Jan Kara <jack@suse.cz> 撰写 ----
 > On Thu 16-04-20 21:52:27, Chengguang Xu wrote:
 > >  ---- 在 星期四, 2020-04-16 19:14:24 Jan Kara <jack@suse.cz> 撰写 ----
 > >  > On Thu 16-04-20 14:08:59, Chengguang Xu wrote:
 > >  > >  ---- 在 星期四, 2020-04-16 03:19:50 Miklos Szeredi <miklos@szeredi.hu> 撰写 ----
 > >  > >  > On Mon, Feb 10, 2020 at 4:10 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
 > >  > >  > > +void ovl_evict_inode(struct inode *inode)
 > >  > >  > > +{
 > >  > >  > > +       struct ovl_inode *oi = OVL_I(inode);
 > >  > >  > > +       struct ovl_write_inode_work ovl_wiw;
 > >  > >  > > +       DEFINE_WAIT_BIT(wait, &oi->flags, OVL_WRITE_INODE_PENDING);
 > >  > >  > > +       wait_queue_head_t *wqh;
 > >  > >  > > +
 > >  > >  > > +       if (ovl_inode_upper(inode)) {
 > >  > >  > > +               if (current->flags & PF_MEMALLOC) {
 > >  > >  > > +                       spin_lock(&inode->i_lock);
 > >  > >  > > +                       ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
 > >  > >  > > +                       wqh = bit_waitqueue(&oi->flags,
 > >  > >  > > +                                       OVL_WRITE_INODE_PENDING);
 > >  > >  > > +                       prepare_to_wait(wqh, &wait.wq_entry,
 > >  > >  > > +                                       TASK_UNINTERRUPTIBLE);
 > >  > >  > > +                       spin_unlock(&inode->i_lock);
 > >  > >  > > +
 > >  > >  > > +                       ovl_wiw.inode = inode;
 > >  > >  > > +                       INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
 > >  > >  > > +                       schedule_work(&ovl_wiw.work);
 > >  > >  > > +
 > >  > >  > > +                       schedule();
 > >  > >  > > +                       finish_wait(wqh, &wait.wq_entry);
 > >  > >  > 
 > >  > >  > What is the reason to do this in another thread if this is a PF_MEMALLOC task?
 > >  > > 
 > >  > > Some underlying filesystems(for example ext4) check the flag in
 > >  > > ->write_inode() and treate it as an abnormal case.(warn and return)
 > >  > > 
 > >  > > ext4_write_inode():
 > >  > >         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC) ||
 > >  > >                 sb_rdonly(inode->i_sb))
 > >  > >                         return 0;
 > >  > > 
 > >  > > overlayfs inodes are always keeping clean even after wring/modifying
 > >  > > upperfile , so they are right target of kswapd  but in the point of lower
 > >  > > layer, ext4 just thinks kswapd is choosing a wrong dirty inode to reclam
 > >  > > memory.
 > >  > 
 > >  > In ext4, it isn't a big problem if ext4_write_inode() is called from
 > >  > kswapd. But if ext4_write_inode() is called from direct reclaim (which also
 > >  > sets PF_MEMALLOC) we can deadlock because we may wait for transaction
 > >  > commit and transaction commit may require locks (such as page lock or
 > >  > waiting for page writeback to complete) which are held by the task
 > >  > currently in direct reclaim. Your push to workqueue will silence the
 > >  > warning but does not solve the possible deadlock.
 > >  > 
 > >  > I'm actually not sure why you need to writeback the upper inode when
 > >  > reclaiming overlayfs inode. Why not leave it on flush worker on upper fs?
 > >  > 
 > > 
 > > Because it is the last chance we can sync dirty upper inode, I mean after
 > > evicting overlayfs inode we can not find the associated dirty upper inode
 > > from any list and that dirty upper inode will be skipped from the target
 > > of syncfs().
 > 
 > I see. But this flushing of dirty inodes on reclaim really isn't a great
 > idea. It can also stall reclaim (due to it being stuck waiting for IO) and
 > thus lead to bad behavior in low memory situations. It's better to just
 > skip reclaiming such inodes - but then I agree it's a difficult question
 > when to reclaim them. Ideally you'd need to hook into inode_lru_isolate()
 > but that's just too ugly (but maybe it could be done in some clean manner).
 > 

How about prepare another list to organize  this kind of inode, when evicting overlay
inode we grab dirty upper inode reference, put a entry(a field point to upper inode) 
into this new list and retrun, after that periodically checking list entry to release(iput)
clean upper inode.

In overlayfs's syncfs(), we also need to iterate this new list after iterating oi->upper_inodes_list.

Thanks,
cgxu






^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v11] ovl: Improving syncfs efficiency
  2020-04-17  1:23           ` Chengguang Xu
@ 2020-04-17  9:26             ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2020-04-17  9:26 UTC (permalink / raw)
  To: Chengguang Xu; +Cc: Jan Kara, Miklos Szeredi, Amir Goldstein, overlayfs

On Fri 17-04-20 09:23:07, Chengguang Xu wrote:
>  ---- 在 星期四, 2020-04-16 22:33:49 Jan Kara <jack@suse.cz> 撰写 ----
>  > On Thu 16-04-20 21:52:27, Chengguang Xu wrote:
>  > >  ---- 在 星期四, 2020-04-16 19:14:24 Jan Kara <jack@suse.cz> 撰写 ----
>  > >  > On Thu 16-04-20 14:08:59, Chengguang Xu wrote:
>  > >  > >  ---- 在 星期四, 2020-04-16 03:19:50 Miklos Szeredi <miklos@szeredi.hu> 撰写 ----
>  > >  > >  > On Mon, Feb 10, 2020 at 4:10 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
>  > >  > >  > > +void ovl_evict_inode(struct inode *inode)
>  > >  > >  > > +{
>  > >  > >  > > +       struct ovl_inode *oi = OVL_I(inode);
>  > >  > >  > > +       struct ovl_write_inode_work ovl_wiw;
>  > >  > >  > > +       DEFINE_WAIT_BIT(wait, &oi->flags, OVL_WRITE_INODE_PENDING);
>  > >  > >  > > +       wait_queue_head_t *wqh;
>  > >  > >  > > +
>  > >  > >  > > +       if (ovl_inode_upper(inode)) {
>  > >  > >  > > +               if (current->flags & PF_MEMALLOC) {
>  > >  > >  > > +                       spin_lock(&inode->i_lock);
>  > >  > >  > > +                       ovl_set_flag(OVL_WRITE_INODE_PENDING, inode);
>  > >  > >  > > +                       wqh = bit_waitqueue(&oi->flags,
>  > >  > >  > > +                                       OVL_WRITE_INODE_PENDING);
>  > >  > >  > > +                       prepare_to_wait(wqh, &wait.wq_entry,
>  > >  > >  > > +                                       TASK_UNINTERRUPTIBLE);
>  > >  > >  > > +                       spin_unlock(&inode->i_lock);
>  > >  > >  > > +
>  > >  > >  > > +                       ovl_wiw.inode = inode;
>  > >  > >  > > +                       INIT_WORK(&ovl_wiw.work, ovl_write_inode_work_fn);
>  > >  > >  > > +                       schedule_work(&ovl_wiw.work);
>  > >  > >  > > +
>  > >  > >  > > +                       schedule();
>  > >  > >  > > +                       finish_wait(wqh, &wait.wq_entry);
>  > >  > >  > 
>  > >  > >  > What is the reason to do this in another thread if this is a PF_MEMALLOC task?
>  > >  > > 
>  > >  > > Some underlying filesystems(for example ext4) check the flag in
>  > >  > > ->write_inode() and treate it as an abnormal case.(warn and return)
>  > >  > > 
>  > >  > > ext4_write_inode():
>  > >  > >         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC) ||
>  > >  > >                 sb_rdonly(inode->i_sb))
>  > >  > >                         return 0;
>  > >  > > 
>  > >  > > overlayfs inodes are always keeping clean even after wring/modifying
>  > >  > > upperfile , so they are right target of kswapd  but in the point of lower
>  > >  > > layer, ext4 just thinks kswapd is choosing a wrong dirty inode to reclam
>  > >  > > memory.
>  > >  > 
>  > >  > In ext4, it isn't a big problem if ext4_write_inode() is called from
>  > >  > kswapd. But if ext4_write_inode() is called from direct reclaim (which also
>  > >  > sets PF_MEMALLOC) we can deadlock because we may wait for transaction
>  > >  > commit and transaction commit may require locks (such as page lock or
>  > >  > waiting for page writeback to complete) which are held by the task
>  > >  > currently in direct reclaim. Your push to workqueue will silence the
>  > >  > warning but does not solve the possible deadlock.
>  > >  > 
>  > >  > I'm actually not sure why you need to writeback the upper inode when
>  > >  > reclaiming overlayfs inode. Why not leave it on flush worker on upper fs?
>  > >  > 
>  > > 
>  > > Because it is the last chance we can sync dirty upper inode, I mean after
>  > > evicting overlayfs inode we can not find the associated dirty upper inode
>  > > from any list and that dirty upper inode will be skipped from the target
>  > > of syncfs().
>  > 
>  > I see. But this flushing of dirty inodes on reclaim really isn't a great
>  > idea. It can also stall reclaim (due to it being stuck waiting for IO) and
>  > thus lead to bad behavior in low memory situations. It's better to just
>  > skip reclaiming such inodes - but then I agree it's a difficult question
>  > when to reclaim them. Ideally you'd need to hook into inode_lru_isolate()
>  > but that's just too ugly (but maybe it could be done in some clean manner).
>  > 
> 
> How about prepare another list to organize  this kind of inode, when evicting overlay
> inode we grab dirty upper inode reference, put a entry(a field point to upper inode) 
> into this new list and retrun, after that periodically checking list entry to release(iput)
> clean upper inode.
> 
> In overlayfs's syncfs(), we also need to iterate this new list after iterating oi->upper_inodes_list.
 
Sure, that would work.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-04-17  9:26 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-10  3:10 [PATCH v11] ovl: Improving syncfs efficiency Chengguang Xu
2020-03-17  4:41 ` 回复:[PATCH " Chengguang Xu
2020-03-17  4:49   ` Chengguang Xu
2020-04-15 19:19 ` [PATCH " Miklos Szeredi
2020-04-16  6:08   ` Chengguang Xu
2020-04-16  7:21     ` Miklos Szeredi
2020-04-16  7:39       ` Miklos Szeredi
2020-04-16  8:00         ` Miklos Szeredi
2020-04-16 11:16           ` Jan Kara
2020-04-16 11:14     ` Jan Kara
2020-04-16 13:52       ` Chengguang Xu
2020-04-16 14:33         ` Jan Kara
2020-04-17  1:23           ` Chengguang Xu
2020-04-17  9:26             ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).