[RFC PATCH v5 0/2] improve sync efficiency with sb inode wb list

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v5 0/2] improve sync efficiency with sb inode wb list
@ 2016-01-12 18:35 Brian Foster
  2016-01-12 18:35 ` [RFC PATCH v5 1/2] sb: add a new writeback list for sync Brian Foster
  2016-01-12 18:35 ` [RFC PATCH v5 2/2] wb: inode writeback list tracking tracepoints Brian Foster
  0 siblings, 2 replies; 7+ messages in thread
From: Brian Foster @ 2016-01-12 18:35 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: dchinner, jbacik, jack

Hi all,

This is a continuation of the sync efficiency fix originally from Dave
Chinner and resurrected by Josef Bacik. The last version posted (by
Josef) is referenced below. My understanding is that it wasn't merged
along with the rest of that patchset as cgroup aware writeback (cgaw)
had just been merged and conflicts prevented this from going in.

After addressing those conflicts, testing had resulted in a lockdep
splat that had shown the locking for the new list was not quite correct.
Further discussion with Jan K. called out possible race issues between
partitions of the same backing dev. Jan suggested to use a
per-superblock list rather than a per-bdi list to avoid the race and
minimize unnecessary complexity due to the cgaw stuff.

In summary, this patch updates the previous version to use the per-sb
list rather than the per-bdi list. The locking is updated to use an
irq-safe lock for the wb list, to piggy back off of the existing
s_inode_list_lock to gain inode references and to otherwise explicitly
try to handle any races between sync and writeback completion to keep
the list coherent with the mapping state. The lazy list removal has been
dropped as IIUC this was originally added to mitigate locking
complexity, but that didn't exactly work out, as noted above. I'm still
not 100% sure that the locking is sane, but it passes my spot testing so
far without any explosions or lockdep splats...

Thoughts, reviews, flames appreciated.

Brian

v5:
- Converted from per-bdi list to per-sb list. Also marked as RFC and
  dropped testing/review tags.
- Updated to use new irq-safe lock for wb list.
- Dropped lazy list removal. Inodes are removed when the mapping is
  cleared of the writeback tag.
- Tweaked wait_sb_inodes() to remove deferred iput(), other cleanups.
- Added wb list tracepoint patch.
v4: http://marc.info/?l=linux-fsdevel&m=143511628828000&w=2

Brian Foster (1):
  wb: inode writeback list tracking tracepoints

Dave Chinner (1):
  sb: add a new writeback list for sync

 fs/block_dev.c                   |   1 +
 fs/fs-writeback.c                | 148 ++++++++++++++++++++++++++++++++-------
 fs/inode.c                       |   1 +
 fs/super.c                       |   2 +
 include/linux/fs.h               |   4 ++
 include/linux/writeback.h        |   3 +
 include/trace/events/writeback.h |  22 ++++--
 mm/page-writeback.c              |  19 +++++
 8 files changed, 171 insertions(+), 29 deletions(-)

-- 
2.4.3

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH v5 1/2] sb: add a new writeback list for sync
  2016-01-12 18:35 [RFC PATCH v5 0/2] improve sync efficiency with sb inode wb list Brian Foster
@ 2016-01-12 18:35 ` Brian Foster
  2016-01-12 19:24   ` Brian Foster
  2016-01-13 10:43   ` Jan Kara
  2016-01-12 18:35 ` [RFC PATCH v5 2/2] wb: inode writeback list tracking tracepoints Brian Foster
  1 sibling, 2 replies; 7+ messages in thread
From: Brian Foster @ 2016-01-12 18:35 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: dchinner, jbacik, jack

From: Dave Chinner <dchinner@redhat.com>

wait_sb_inodes() currently does a walk of all inodes in the
filesystem to find dirty one to wait on during sync. This is highly
inefficient and wastes a lot of CPU when there are lots of clean
cached inodes that we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for. We do this by
adding inodes to a writeback list on the sb when the mapping is
first tagged as having pages under writeback. wait_sb_inodes() can
then walk this list of "inodes under IO" and wait specifically just
for the inodes that the current sync(2) needs to wait for.

Define a couple helpers to add/remove an inode from the writeback
list and call them when the overall mapping is tagged for or cleared
from writeback. Update wait_sb_inodes() to walk only the inodes
under writeback due to the sync.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/block_dev.c            |   1 +
 fs/fs-writeback.c         | 139 +++++++++++++++++++++++++++++++++++++---------
 fs/inode.c                |   1 +
 fs/super.c                |   2 +
 include/linux/fs.h        |   4 ++
 include/linux/writeback.h |   3 +
 mm/page-writeback.c       |  19 +++++++
 7 files changed, 144 insertions(+), 25 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 44d4a1e..d9552bc 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -75,6 +75,7 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
+	sb_clear_inode_writeback(bdev->bd_inode);
 	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
 		return;
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 023f6a1..6821347 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -945,6 +945,37 @@ void inode_io_list_del(struct inode *inode)
 }
 
 /*
+ * mark an inode as under writeback on the sb
+ */
+void sb_mark_inode_writeback(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+	unsigned long flags;
+
+	if (list_empty(&inode->i_wb_list)) {
+		spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
+		if (list_empty(&inode->i_wb_list))
+			list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
+		spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
+	}
+}
+
+/*
+ * clear an inode as under writeback on the sb
+ */
+void sb_clear_inode_writeback(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+	unsigned long flags;
+
+	if (!list_empty(&inode->i_wb_list)) {
+		spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
+		list_del_init(&inode->i_wb_list);
+		spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
+	}
+}
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -1118,13 +1149,28 @@ static void __inode_wait_for_writeback(struct inode *inode)
 }
 
 /*
- * Wait for writeback on an inode to complete. Caller must have inode pinned.
+ * Wait for writeback on an inode to complete during eviction. Caller must have
+ * inode pinned.
  */
 void inode_wait_for_writeback(struct inode *inode)
 {
+	BUG_ON(!(inode->i_state & I_FREEING));
+
 	spin_lock(&inode->i_lock);
 	__inode_wait_for_writeback(inode);
 	spin_unlock(&inode->i_lock);
+
+	/*
+	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
+	 * the inode and then deref bdev->bd_disk, which at this point has been
+	 * set to NULL, so we would panic.  At the point we are dropping our
+	 * bd_inode we won't have any pages under writeback on the device so
+	 * this is safe.  But just in case we'll assert to make sure we don't
+	 * screw this up.
+	 */
+	if (!sb_is_blkdev_sb(inode->i_sb))
+		sb_clear_inode_writeback(inode);
+	BUG_ON(!list_empty(&inode->i_wb_list));
 }
 
 /*
@@ -2108,7 +2154,7 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  */
 static void wait_sb_inodes(struct super_block *sb)
 {
-	struct inode *inode, *old_inode = NULL;
+	LIST_HEAD(sync_list);
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -2116,23 +2162,56 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
+	/*
+	 * Data integrity sync. Must wait for all pages under writeback, because
+	 * there may have been pages dirtied before our sync call, but which had
+	 * writeout started before we write it out.  In which case, the inode
+	 * may not be on the dirty list, but we still have to wait for that
+	 * writeout.
+	 *
+	 * To avoid syncing inodes put under IO after we have started here,
+	 * splice the io list to a temporary list head and walk that. Newly
+	 * dirtied inodes will go onto the primary list so we won't wait for
+	 * them. This is safe to do as we are serialised by the s_sync_lock,
+	 * so we'll complete processing the complete list before the next
+	 * sync operation repeats the splice-and-walk process.
+	 *
+	 * s_inode_wblist_lock protects the wb list and is irq-safe as it is
+	 * acquired inside of the mapping lock by __test_set_page_writeback().
+	 * We cannot acquire i_lock while the wblist lock is held without
+	 * introducing irq inversion issues. Since s_inodes_wb is a subset of
+	 * s_inodes, use s_inode_list_lock to prevent inodes from disappearing
+	 * until we have a reference. Note that s_inode_wblist_lock protects the
+	 * local sync_list as well because inodes can be dropped from either
+	 * list by writeback completion.
+	 */
 	mutex_lock(&sb->s_sync_lock);
+
 	spin_lock(&sb->s_inode_list_lock);
+	spin_lock_irq(&sb->s_inode_wblist_lock);
+	list_splice_init(&sb->s_inodes_wb, &sync_list);
 
-	/*
-	 * Data integrity sync. Must wait for all pages under writeback,
-	 * because there may have been pages dirtied before our sync
-	 * call, but which had writeout started before we write it out.
-	 * In which case, the inode may not be on the dirty list, but
-	 * we still have to wait for that writeout.
-	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	while (!list_empty(&sync_list)) {
+		struct inode *inode = list_first_entry(&sync_list, struct inode,
+						       i_wb_list);
 		struct address_space *mapping = inode->i_mapping;
 
+		list_del_init(&inode->i_wb_list);
+
+		/*
+		 * The mapping can be cleaned before we wait on the inode since
+		 * we do not have the mapping lock.
+		 */
+		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
+			continue;
+
+		spin_unlock_irq(&sb->s_inode_wblist_lock);
+
 		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping->nrpages == 0)) {
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
 			spin_unlock(&inode->i_lock);
+
+			spin_lock_irq(&sb->s_inode_wblist_lock);
 			continue;
 		}
 		__iget(inode);
@@ -2140,29 +2219,39 @@ static void wait_sb_inodes(struct super_block *sb)
 		spin_unlock(&sb->s_inode_list_lock);
 
 		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		iput(old_inode);
-		old_inode = inode;
-
-		/*
 		 * We keep the error status of individual mapping so that
 		 * applications can catch the writeback error using fsync(2).
 		 * See filemap_fdatawait_keep_errors() for details.
 		 */
 		filemap_fdatawait_keep_errors(mapping);
-
 		cond_resched();
 
+		/*
+		 * The inode has been written back. Check whether we still have
+		 * pages under I/O and move the inode back to the primary list
+		 * if necessary. sb_mark_inode_writeback() might not have done
+		 * anything if the writeback tag hadn't been cleared from the
+		 * mapping by the time more wb had started.
+		 */
+		spin_lock_irq(&mapping->tree_lock);
+		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			spin_lock(&sb->s_inode_wblist_lock);
+			if (list_empty(&inode->i_wb_list))
+				list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
+			spin_unlock(&sb->s_inode_wblist_lock);
+		} else
+			WARN_ON(!list_empty(&inode->i_wb_list));
+		spin_unlock_irq(&mapping->tree_lock);
+
+		iput(inode);
+
 		spin_lock(&sb->s_inode_list_lock);
+		spin_lock_irq(&sb->s_inode_wblist_lock);
 	}
+
+	spin_unlock_irq(&sb->s_inode_wblist_lock);
 	spin_unlock(&sb->s_inode_list_lock);
-	iput(old_inode);
+
 	mutex_unlock(&sb->s_sync_lock);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 5bb85a0..c7a9581 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -358,6 +358,7 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_io_list);
+	INIT_LIST_HEAD(&inode->i_wb_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
diff --git a/fs/super.c b/fs/super.c
index 954aeb8..86822b8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -206,6 +206,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	mutex_init(&s->s_sync_lock);
 	INIT_LIST_HEAD(&s->s_inodes);
 	spin_lock_init(&s->s_inode_list_lock);
+	INIT_LIST_HEAD(&s->s_inodes_wb);
+	spin_lock_init(&s->s_inode_wblist_lock);
 
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ef3cd36..aef01c7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -648,6 +648,7 @@ struct inode {
 #endif
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
+	struct list_head	i_wb_list;	/* backing dev writeback list */
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
@@ -1374,6 +1375,9 @@ struct super_block {
 	/* s_inode_list_lock protects s_inodes */
 	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
 	struct list_head	s_inodes;	/* all inodes */
+
+	spinlock_t		s_inode_wblist_lock;
+	struct list_head	s_inodes_wb;	/* writeback inodes */
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index b333c94..90a380c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -379,4 +379,7 @@ void tag_pages_for_writeback(struct address_space *mapping,
 
 void account_page_redirty(struct page *page);
 
+void sb_mark_inode_writeback(struct inode *inode);
+void sb_clear_inode_writeback(struct inode *inode);
+
 #endif		/* WRITEBACK_H */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d15d88c..dcd4e2b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2735,6 +2735,11 @@ int test_clear_page_writeback(struct page *page)
 				__wb_writeout_inc(wb);
 			}
 		}
+
+		if (mapping->host && !mapping_tagged(mapping,
+						     PAGECACHE_TAG_WRITEBACK))
+			sb_clear_inode_writeback(mapping->host);
+
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
@@ -2763,11 +2768,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret) {
+			bool on_wblist;
+
+			on_wblist = mapping_tagged(mapping,
+						   PAGECACHE_TAG_WRITEBACK);
+
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi))
 				__inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
+
+			/*
+			 * we can come through here when swapping anonymous
+			 * pages, so we don't necessarily have an inode to
+			 * track for sync. Inodes are remove lazily, so there is
+			 * no equivalent in test_clear_page_writeback().
+			 */
+			if (!on_wblist && mapping->host)
+				sb_mark_inode_writeback(mapping->host);
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH v5 2/2] wb: inode writeback list tracking tracepoints
  2016-01-12 18:35 [RFC PATCH v5 0/2] improve sync efficiency with sb inode wb list Brian Foster
  2016-01-12 18:35 ` [RFC PATCH v5 1/2] sb: add a new writeback list for sync Brian Foster
@ 2016-01-12 18:35 ` Brian Foster
  1 sibling, 0 replies; 7+ messages in thread
From: Brian Foster @ 2016-01-12 18:35 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: dchinner, jbacik, jack

The per-sb inode writeback list tracks inodes currently under writeback
to facilitate efficient sync processing. In particular, it ensures that
sync only needs to walk through a list of inodes affected by the sync.

Add a couple tracepoints to help identify when inodes are added/removed
to and from the writeback lists. Piggyback off of the writeback lazytime
tracepoint template as it already tracks the relevant inode information.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/fs-writeback.c                | 17 +++++++++++++----
 include/trace/events/writeback.h | 22 ++++++++++++++++++----
 2 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6821347..672664b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -954,8 +954,10 @@ void sb_mark_inode_writeback(struct inode *inode)
 
 	if (list_empty(&inode->i_wb_list)) {
 		spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
-		if (list_empty(&inode->i_wb_list))
+		if (list_empty(&inode->i_wb_list)) {
 			list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
+			trace_sb_mark_inode_writeback(inode);
+		}
 		spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
 	}
 }
@@ -970,7 +972,10 @@ void sb_clear_inode_writeback(struct inode *inode)
 
 	if (!list_empty(&inode->i_wb_list)) {
 		spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
-		list_del_init(&inode->i_wb_list);
+		if (!list_empty(&inode->i_wb_list)) {
+			list_del_init(&inode->i_wb_list);
+			trace_sb_clear_inode_writeback(inode);
+		}
 		spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
 	}
 }
@@ -2202,8 +2207,10 @@ static void wait_sb_inodes(struct super_block *sb)
 		 * The mapping can be cleaned before we wait on the inode since
 		 * we do not have the mapping lock.
 		 */
-		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
+		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			trace_sb_clear_inode_writeback(inode);
 			continue;
+		}
 
 		spin_unlock_irq(&sb->s_inode_wblist_lock);
 
@@ -2239,8 +2246,10 @@ static void wait_sb_inodes(struct super_block *sb)
 			if (list_empty(&inode->i_wb_list))
 				list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
 			spin_unlock(&sb->s_inode_wblist_lock);
-		} else
+		} else {
 			WARN_ON(!list_empty(&inode->i_wb_list));
+			trace_sb_clear_inode_writeback(inode);
+		}
 		spin_unlock_irq(&mapping->tree_lock);
 
 		iput(inode);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index fff846b..962a7f1 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -727,7 +727,7 @@ DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode,
 	TP_ARGS(inode, wbc, nr_to_write)
 );
 
-DECLARE_EVENT_CLASS(writeback_lazytime_template,
+DECLARE_EVENT_CLASS(writeback_inode_template,
 	TP_PROTO(struct inode *inode),
 
 	TP_ARGS(inode),
@@ -754,25 +754,39 @@ DECLARE_EVENT_CLASS(writeback_lazytime_template,
 		  show_inode_state(__entry->state), __entry->mode)
 );
 
-DEFINE_EVENT(writeback_lazytime_template, writeback_lazytime,
+DEFINE_EVENT(writeback_inode_template, writeback_lazytime,
 	TP_PROTO(struct inode *inode),
 
 	TP_ARGS(inode)
 );
 
-DEFINE_EVENT(writeback_lazytime_template, writeback_lazytime_iput,
+DEFINE_EVENT(writeback_inode_template, writeback_lazytime_iput,
 	TP_PROTO(struct inode *inode),
 
 	TP_ARGS(inode)
 );
 
-DEFINE_EVENT(writeback_lazytime_template, writeback_dirty_inode_enqueue,
+DEFINE_EVENT(writeback_inode_template, writeback_dirty_inode_enqueue,
 
 	TP_PROTO(struct inode *inode),
 
 	TP_ARGS(inode)
 );
 
+/*
+ * Inode writeback list tracking.
+ */
+
+DEFINE_EVENT(writeback_inode_template, sb_mark_inode_writeback,
+	TP_PROTO(struct inode *inode),
+	TP_ARGS(inode)
+);
+
+DEFINE_EVENT(writeback_inode_template, sb_clear_inode_writeback,
+	TP_PROTO(struct inode *inode),
+	TP_ARGS(inode)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v5 1/2] sb: add a new writeback list for sync
  2016-01-12 18:35 ` [RFC PATCH v5 1/2] sb: add a new writeback list for sync Brian Foster
@ 2016-01-12 19:24   ` Brian Foster
  2016-01-13 10:43   ` Jan Kara
  1 sibling, 0 replies; 7+ messages in thread
From: Brian Foster @ 2016-01-12 19:24 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: dchinner, jbacik, jack

On Tue, Jan 12, 2016 at 01:35:38PM -0500, Brian Foster wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> wait_sb_inodes() currently does a walk of all inodes in the
> filesystem to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
> 
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the sb when the mapping is
> first tagged as having pages under writeback. wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
> 
> Define a couple helpers to add/remove an inode from the writeback
> list and call them when the overall mapping is tagged for or cleared
> from writeback. Update wait_sb_inodes() to walk only the inodes
> under writeback due to the sync.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/block_dev.c            |   1 +
>  fs/fs-writeback.c         | 139 +++++++++++++++++++++++++++++++++++++---------
>  fs/inode.c                |   1 +
>  fs/super.c                |   2 +
>  include/linux/fs.h        |   4 ++
>  include/linux/writeback.h |   3 +
>  mm/page-writeback.c       |  19 +++++++
>  7 files changed, 144 insertions(+), 25 deletions(-)
> 
...
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 023f6a1..6821347 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
...
> @@ -1118,13 +1149,28 @@ static void __inode_wait_for_writeback(struct inode *inode)
>  }
>  
>  /*
> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> + * Wait for writeback on an inode to complete during eviction. Caller must have
> + * inode pinned.
>   */
>  void inode_wait_for_writeback(struct inode *inode)
>  {
> +	BUG_ON(!(inode->i_state & I_FREEING));
> +
>  	spin_lock(&inode->i_lock);
>  	__inode_wait_for_writeback(inode);
>  	spin_unlock(&inode->i_lock);
> +
> +	/*
> +	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
> +	 * the inode and then deref bdev->bd_disk, which at this point has been
> +	 * set to NULL, so we would panic.  At the point we are dropping our
> +	 * bd_inode we won't have any pages under writeback on the device so
> +	 * this is safe.  But just in case we'll assert to make sure we don't
> +	 * screw this up.
> +	 */

While the comment above is no longer relevant now that this doesn't
touch the bdi, I'm not totally sure whether the following few lines are
still necessary. I think this was associated with the lazy list removal
(to remove the inode on eviction), but I'd have to dig into that a bit
more to be sure...

Brian

> +	if (!sb_is_blkdev_sb(inode->i_sb))
> +		sb_clear_inode_writeback(inode);
> +	BUG_ON(!list_empty(&inode->i_wb_list));
>  }
>  
>  /*
> @@ -2108,7 +2154,7 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   */
>  static void wait_sb_inodes(struct super_block *sb)
>  {
> -	struct inode *inode, *old_inode = NULL;
> +	LIST_HEAD(sync_list);
>  
>  	/*
>  	 * We need to be protected against the filesystem going from
> @@ -2116,23 +2162,56 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> +	/*
> +	 * Data integrity sync. Must wait for all pages under writeback, because
> +	 * there may have been pages dirtied before our sync call, but which had
> +	 * writeout started before we write it out.  In which case, the inode
> +	 * may not be on the dirty list, but we still have to wait for that
> +	 * writeout.
> +	 *
> +	 * To avoid syncing inodes put under IO after we have started here,
> +	 * splice the io list to a temporary list head and walk that. Newly
> +	 * dirtied inodes will go onto the primary list so we won't wait for
> +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> +	 * so we'll complete processing the complete list before the next
> +	 * sync operation repeats the splice-and-walk process.
> +	 *
> +	 * s_inode_wblist_lock protects the wb list and is irq-safe as it is
> +	 * acquired inside of the mapping lock by __test_set_page_writeback().
> +	 * We cannot acquire i_lock while the wblist lock is held without
> +	 * introducing irq inversion issues. Since s_inodes_wb is a subset of
> +	 * s_inodes, use s_inode_list_lock to prevent inodes from disappearing
> +	 * until we have a reference. Note that s_inode_wblist_lock protects the
> +	 * local sync_list as well because inodes can be dropped from either
> +	 * list by writeback completion.
> +	 */
>  	mutex_lock(&sb->s_sync_lock);
> +
>  	spin_lock(&sb->s_inode_list_lock);
> +	spin_lock_irq(&sb->s_inode_wblist_lock);
> +	list_splice_init(&sb->s_inodes_wb, &sync_list);
>  
> -	/*
> -	 * Data integrity sync. Must wait for all pages under writeback,
> -	 * because there may have been pages dirtied before our sync
> -	 * call, but which had writeout started before we write it out.
> -	 * In which case, the inode may not be on the dirty list, but
> -	 * we still have to wait for that writeout.
> -	 */
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	while (!list_empty(&sync_list)) {
> +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> +						       i_wb_list);
>  		struct address_space *mapping = inode->i_mapping;
>  
> +		list_del_init(&inode->i_wb_list);
> +
> +		/*
> +		 * The mapping can be cleaned before we wait on the inode since
> +		 * we do not have the mapping lock.
> +		 */
> +		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
> +			continue;
> +
> +		spin_unlock_irq(&sb->s_inode_wblist_lock);
> +
>  		spin_lock(&inode->i_lock);
> -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -		    (mapping->nrpages == 0)) {
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
>  			spin_unlock(&inode->i_lock);
> +
> +			spin_lock_irq(&sb->s_inode_wblist_lock);
>  			continue;
>  		}
>  		__iget(inode);
> @@ -2140,29 +2219,39 @@ static void wait_sb_inodes(struct super_block *sb)
>  		spin_unlock(&sb->s_inode_list_lock);
>  
>  		/*
> -		 * We hold a reference to 'inode' so it couldn't have been
> -		 * removed from s_inodes list while we dropped the
> -		 * s_inode_list_lock.  We cannot iput the inode now as we can
> -		 * be holding the last reference and we cannot iput it under
> -		 * s_inode_list_lock. So we keep the reference and iput it
> -		 * later.
> -		 */
> -		iput(old_inode);
> -		old_inode = inode;
> -
> -		/*
>  		 * We keep the error status of individual mapping so that
>  		 * applications can catch the writeback error using fsync(2).
>  		 * See filemap_fdatawait_keep_errors() for details.
>  		 */
>  		filemap_fdatawait_keep_errors(mapping);
> -
>  		cond_resched();
>  
> +		/*
> +		 * The inode has been written back. Check whether we still have
> +		 * pages under I/O and move the inode back to the primary list
> +		 * if necessary. sb_mark_inode_writeback() might not have done
> +		 * anything if the writeback tag hadn't been cleared from the
> +		 * mapping by the time more wb had started.
> +		 */
> +		spin_lock_irq(&mapping->tree_lock);
> +		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			spin_lock(&sb->s_inode_wblist_lock);
> +			if (list_empty(&inode->i_wb_list))
> +				list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> +			spin_unlock(&sb->s_inode_wblist_lock);
> +		} else
> +			WARN_ON(!list_empty(&inode->i_wb_list));
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +		iput(inode);
> +
>  		spin_lock(&sb->s_inode_list_lock);
> +		spin_lock_irq(&sb->s_inode_wblist_lock);
>  	}
> +
> +	spin_unlock_irq(&sb->s_inode_wblist_lock);
>  	spin_unlock(&sb->s_inode_list_lock);
> -	iput(old_inode);
> +
>  	mutex_unlock(&sb->s_sync_lock);
>  }
>  
> diff --git a/fs/inode.c b/fs/inode.c
> index 5bb85a0..c7a9581 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -358,6 +358,7 @@ void inode_init_once(struct inode *inode)
>  	INIT_HLIST_NODE(&inode->i_hash);
>  	INIT_LIST_HEAD(&inode->i_devices);
>  	INIT_LIST_HEAD(&inode->i_io_list);
> +	INIT_LIST_HEAD(&inode->i_wb_list);
>  	INIT_LIST_HEAD(&inode->i_lru);
>  	address_space_init_once(&inode->i_data);
>  	i_size_ordered_init(inode);
> diff --git a/fs/super.c b/fs/super.c
> index 954aeb8..86822b8 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -206,6 +206,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
>  	mutex_init(&s->s_sync_lock);
>  	INIT_LIST_HEAD(&s->s_inodes);
>  	spin_lock_init(&s->s_inode_list_lock);
> +	INIT_LIST_HEAD(&s->s_inodes_wb);
> +	spin_lock_init(&s->s_inode_wblist_lock);
>  
>  	if (list_lru_init_memcg(&s->s_dentry_lru))
>  		goto fail;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ef3cd36..aef01c7 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -648,6 +648,7 @@ struct inode {
>  #endif
>  	struct list_head	i_lru;		/* inode LRU list */
>  	struct list_head	i_sb_list;
> +	struct list_head	i_wb_list;	/* backing dev writeback list */
>  	union {
>  		struct hlist_head	i_dentry;
>  		struct rcu_head		i_rcu;
> @@ -1374,6 +1375,9 @@ struct super_block {
>  	/* s_inode_list_lock protects s_inodes */
>  	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
>  	struct list_head	s_inodes;	/* all inodes */
> +
> +	spinlock_t		s_inode_wblist_lock;
> +	struct list_head	s_inodes_wb;	/* writeback inodes */
>  };
>  
>  extern struct timespec current_fs_time(struct super_block *sb);
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index b333c94..90a380c 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -379,4 +379,7 @@ void tag_pages_for_writeback(struct address_space *mapping,
>  
>  void account_page_redirty(struct page *page);
>  
> +void sb_mark_inode_writeback(struct inode *inode);
> +void sb_clear_inode_writeback(struct inode *inode);
> +
>  #endif		/* WRITEBACK_H */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d15d88c..dcd4e2b 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2735,6 +2735,11 @@ int test_clear_page_writeback(struct page *page)
>  				__wb_writeout_inc(wb);
>  			}
>  		}
> +
> +		if (mapping->host && !mapping_tagged(mapping,
> +						     PAGECACHE_TAG_WRITEBACK))
> +			sb_clear_inode_writeback(mapping->host);
> +
>  		spin_unlock_irqrestore(&mapping->tree_lock, flags);
>  	} else {
>  		ret = TestClearPageWriteback(page);
> @@ -2763,11 +2768,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
>  		spin_lock_irqsave(&mapping->tree_lock, flags);
>  		ret = TestSetPageWriteback(page);
>  		if (!ret) {
> +			bool on_wblist;
> +
> +			on_wblist = mapping_tagged(mapping,
> +						   PAGECACHE_TAG_WRITEBACK);
> +
>  			radix_tree_tag_set(&mapping->page_tree,
>  						page_index(page),
>  						PAGECACHE_TAG_WRITEBACK);
>  			if (bdi_cap_account_writeback(bdi))
>  				__inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
> +
> +			/*
> +			 * we can come through here when swapping anonymous
> +			 * pages, so we don't necessarily have an inode to
> +			 * track for sync. Inodes are remove lazily, so there is
> +			 * no equivalent in test_clear_page_writeback().
> +			 */
> +			if (!on_wblist && mapping->host)
> +				sb_mark_inode_writeback(mapping->host);
>  		}
>  		if (!PageDirty(page))
>  			radix_tree_tag_clear(&mapping->page_tree,
> -- 
> 2.4.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v5 1/2] sb: add a new writeback list for sync
  2016-01-12 18:35 ` [RFC PATCH v5 1/2] sb: add a new writeback list for sync Brian Foster
  2016-01-12 19:24   ` Brian Foster
@ 2016-01-13 10:43   ` Jan Kara
  2016-01-13 14:33     ` Brian Foster
  1 sibling, 1 reply; 7+ messages in thread
From: Jan Kara @ 2016-01-13 10:43 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, dchinner, jbacik, jack

On Tue 12-01-16 13:35:38, Brian Foster wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> wait_sb_inodes() currently does a walk of all inodes in the
> filesystem to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
> 
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the sb when the mapping is
> first tagged as having pages under writeback. wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
> 
> Define a couple helpers to add/remove an inode from the writeback
> list and call them when the overall mapping is tagged for or cleared
> from writeback. Update wait_sb_inodes() to walk only the inodes
> under writeback due to the sync.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> Signed-off-by: Brian Foster <bfoster@redhat.com>

...

>  void inode_wait_for_writeback(struct inode *inode)
>  {
> +	BUG_ON(!(inode->i_state & I_FREEING));
> +
>  	spin_lock(&inode->i_lock);
>  	__inode_wait_for_writeback(inode);
>  	spin_unlock(&inode->i_lock);
> +
> +	/*
> +	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
> +	 * the inode and then deref bdev->bd_disk, which at this point has been
> +	 * set to NULL, so we would panic.  At the point we are dropping our
> +	 * bd_inode we won't have any pages under writeback on the device so
> +	 * this is safe.  But just in case we'll assert to make sure we don't
> +	 * screw this up.
> +	 */
> +	if (!sb_is_blkdev_sb(inode->i_sb))

The condition and the comment can go away now. We no longer need to call
inode_to_bdi()...

> +		sb_clear_inode_writeback(inode);
> +	BUG_ON(!list_empty(&inode->i_wb_list));
>  }
>  
>  /*
> @@ -2108,7 +2154,7 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   */
>  static void wait_sb_inodes(struct super_block *sb)
>  {
> -	struct inode *inode, *old_inode = NULL;
> +	LIST_HEAD(sync_list);
>  
>  	/*
>  	 * We need to be protected against the filesystem going from
> @@ -2116,23 +2162,56 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> +	/*
> +	 * Data integrity sync. Must wait for all pages under writeback, because
> +	 * there may have been pages dirtied before our sync call, but which had
> +	 * writeout started before we write it out.  In which case, the inode
> +	 * may not be on the dirty list, but we still have to wait for that
> +	 * writeout.
> +	 *
> +	 * To avoid syncing inodes put under IO after we have started here,
> +	 * splice the io list to a temporary list head and walk that. Newly
> +	 * dirtied inodes will go onto the primary list so we won't wait for
> +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> +	 * so we'll complete processing the complete list before the next
> +	 * sync operation repeats the splice-and-walk process.
> +	 *
> +	 * s_inode_wblist_lock protects the wb list and is irq-safe as it is
> +	 * acquired inside of the mapping lock by __test_set_page_writeback().
> +	 * We cannot acquire i_lock while the wblist lock is held without
> +	 * introducing irq inversion issues. Since s_inodes_wb is a subset of
> +	 * s_inodes, use s_inode_list_lock to prevent inodes from disappearing
> +	 * until we have a reference. Note that s_inode_wblist_lock protects the
> +	 * local sync_list as well because inodes can be dropped from either
> +	 * list by writeback completion.
> +	 */
>  	mutex_lock(&sb->s_sync_lock);
> +
>  	spin_lock(&sb->s_inode_list_lock);

So I'm not very happy that we have to hold s_inode_list_lock here. After
all reducing contention on that lock was one of the main motivations of
this patch. That being said I think the locking is correct now which is a
good start :) and we can work from here.

As a future improvement I think we could use RCU to grab inode reference
safely like:

Hold rcu_read_lock() instead of s_inode_list_lock. That guarantees that
memory for the inode we have reference to is not freed. So we can then:

	spin_unlock_irq(&sb->s_inode_wblist_lock);
	spin_lock(&inode->i_lock);
	if (inode->i_flags & (I_FREEING | I_WILL_FREE | I_NEW))
		avoid touching inode
	__iget(inode);
	spin_unlock(&inode->i_lock);
	rcu_read_unlock();

But I'd do this as a separate patch and run this through Al to verify...

> +	spin_lock_irq(&sb->s_inode_wblist_lock);
> +	list_splice_init(&sb->s_inodes_wb, &sync_list);
>  
> -	/*
> -	 * Data integrity sync. Must wait for all pages under writeback,
> -	 * because there may have been pages dirtied before our sync
> -	 * call, but which had writeout started before we write it out.
> -	 * In which case, the inode may not be on the dirty list, but
> -	 * we still have to wait for that writeout.
> -	 */
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	while (!list_empty(&sync_list)) {
> +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> +						       i_wb_list);
>  		struct address_space *mapping = inode->i_mapping;
>  
> +		list_del_init(&inode->i_wb_list);

What if we just did:

		list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);

here instead of list_del_init()? That way we would always maintain
consistency:

 PAGECACHE_TAG_WRITEBACK set iff !list_empty(&inode->i_wb_list)

under s_inode_wblist_lock and thus you could just delete that list
shuffling below...

> +		/*
> +		 * The inode has been written back. Check whether we still have
> +		 * pages under I/O and move the inode back to the primary list
> +		 * if necessary. sb_mark_inode_writeback() might not have done
> +		 * anything if the writeback tag hadn't been cleared from the
> +		 * mapping by the time more wb had started.
> +		 */
> +		spin_lock_irq(&mapping->tree_lock);
> +		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			spin_lock(&sb->s_inode_wblist_lock);
> +			if (list_empty(&inode->i_wb_list))
> +				list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> +			spin_unlock(&sb->s_inode_wblist_lock);
> +		} else
> +			WARN_ON(!list_empty(&inode->i_wb_list));
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +		iput(inode);
> +
>  		spin_lock(&sb->s_inode_list_lock);
> +		spin_lock_irq(&sb->s_inode_wblist_lock);
>  	}
> +
> +	spin_unlock_irq(&sb->s_inode_wblist_lock);
>  	spin_unlock(&sb->s_inode_list_lock);
> -	iput(old_inode);
> +
>  	mutex_unlock(&sb->s_sync_lock);
>  }
>  

...

> +
> +			/*
> +			 * we can come through here when swapping anonymous
> +			 * pages, so we don't necessarily have an inode to
> +			 * track for sync. Inodes are remove lazily, so there is
> +			 * no equivalent in test_clear_page_writeback().

The comment about lazy removal is not true anymore...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v5 1/2] sb: add a new writeback list for sync
  2016-01-13 10:43   ` Jan Kara
@ 2016-01-13 14:33     ` Brian Foster
  2016-01-15 12:16       ` Jan Kara
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2016-01-13 14:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, dchinner, jbacik

On Wed, Jan 13, 2016 at 11:43:02AM +0100, Jan Kara wrote:
> On Tue 12-01-16 13:35:38, Brian Foster wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > wait_sb_inodes() currently does a walk of all inodes in the
> > filesystem to find dirty one to wait on during sync. This is highly
> > inefficient and wastes a lot of CPU when there are lots of clean
> > cached inodes that we don't need to wait on.
> > 
> > To avoid this "all inode" walk, we need to track inodes that are
> > currently under writeback that we need to wait for. We do this by
> > adding inodes to a writeback list on the sb when the mapping is
> > first tagged as having pages under writeback. wait_sb_inodes() can
> > then walk this list of "inodes under IO" and wait specifically just
> > for the inodes that the current sync(2) needs to wait for.
> > 
> > Define a couple helpers to add/remove an inode from the writeback
> > list and call them when the overall mapping is tagged for or cleared
> > from writeback. Update wait_sb_inodes() to walk only the inodes
> > under writeback due to the sync.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> ...
> 
> >  void inode_wait_for_writeback(struct inode *inode)
> >  {
> > +	BUG_ON(!(inode->i_state & I_FREEING));
> > +
> >  	spin_lock(&inode->i_lock);
> >  	__inode_wait_for_writeback(inode);
> >  	spin_unlock(&inode->i_lock);
> > +
> > +	/*
> > +	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
> > +	 * the inode and then deref bdev->bd_disk, which at this point has been
> > +	 * set to NULL, so we would panic.  At the point we are dropping our
> > +	 * bd_inode we won't have any pages under writeback on the device so
> > +	 * this is safe.  But just in case we'll assert to make sure we don't
> > +	 * screw this up.
> > +	 */
> > +	if (!sb_is_blkdev_sb(inode->i_sb))
> 
> The condition and the comment can go away now. We no longer need to call
> inode_to_bdi()...
> 

Yeah, I noted that in my follow on mail. I think this whole hunk along
with the change to kill_bdev() can both actually get dropped.
kill_bdev() does truncate_inodes_pages() which eventually calls
wait_on_page_writeback() for every page in the mapping. This function
clearly already waits for inode writeback.

Unless I'm misunderstanding something, the original changes in both
places look like they were just there to ensure that the inode is
removed from the wb list when wb completion didn't do so. That is no
longer necessary now that wb completion takes care of that. I'll do some
more testing of course, but any objections if I drop these hunks?

Alternatively, we could retain only the BUG_ON() bits here to make sure
everything works as expected. Thoughts?

> > +		sb_clear_inode_writeback(inode);
> > +	BUG_ON(!list_empty(&inode->i_wb_list));
> >  }
> >  
> >  /*
> > @@ -2108,7 +2154,7 @@ EXPORT_SYMBOL(__mark_inode_dirty);
> >   */
> >  static void wait_sb_inodes(struct super_block *sb)
> >  {
> > -	struct inode *inode, *old_inode = NULL;
> > +	LIST_HEAD(sync_list);
> >  
> >  	/*
> >  	 * We need to be protected against the filesystem going from
> > @@ -2116,23 +2162,56 @@ static void wait_sb_inodes(struct super_block *sb)
> >  	 */
> >  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> >  
> > +	/*
> > +	 * Data integrity sync. Must wait for all pages under writeback, because
> > +	 * there may have been pages dirtied before our sync call, but which had
> > +	 * writeout started before we write it out.  In which case, the inode
> > +	 * may not be on the dirty list, but we still have to wait for that
> > +	 * writeout.
> > +	 *
> > +	 * To avoid syncing inodes put under IO after we have started here,
> > +	 * splice the io list to a temporary list head and walk that. Newly
> > +	 * dirtied inodes will go onto the primary list so we won't wait for
> > +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> > +	 * so we'll complete processing the complete list before the next
> > +	 * sync operation repeats the splice-and-walk process.
> > +	 *
> > +	 * s_inode_wblist_lock protects the wb list and is irq-safe as it is
> > +	 * acquired inside of the mapping lock by __test_set_page_writeback().
> > +	 * We cannot acquire i_lock while the wblist lock is held without
> > +	 * introducing irq inversion issues. Since s_inodes_wb is a subset of
> > +	 * s_inodes, use s_inode_list_lock to prevent inodes from disappearing
> > +	 * until we have a reference. Note that s_inode_wblist_lock protects the
> > +	 * local sync_list as well because inodes can be dropped from either
> > +	 * list by writeback completion.
> > +	 */
> >  	mutex_lock(&sb->s_sync_lock);
> > +
> >  	spin_lock(&sb->s_inode_list_lock);
> 
> So I'm not very happy that we have to hold s_inode_list_lock here. After
> all reducing contention on that lock was one of the main motivations of
> this patch. That being said I think the locking is correct now which is a
> good start :) and we can work from here.
> 

Thanks.. I was shooting for correct and as simple as possible. ;) FWIW,
the primary motivation that I'm aware of for this patch is the current
behavior of unnecessarily walking a massive inode list when the inode
cache is very large (moreso than lock contention). I suspect lock
contention in this path would more likely involve the wblist_lock at
this point, since it protects the wblist.

> As a future improvement I think we could use RCU to grab inode reference
> safely like:
> 
> Hold rcu_read_lock() instead of s_inode_list_lock. That guarantees that
> memory for the inode we have reference to is not freed. So we can then:
> 
> 	spin_unlock_irq(&sb->s_inode_wblist_lock);
> 	spin_lock(&inode->i_lock);
> 	if (inode->i_flags & (I_FREEING | I_WILL_FREE | I_NEW))
> 		avoid touching inode
> 	__iget(inode);
> 	spin_unlock(&inode->i_lock);
> 	rcu_read_unlock();
> 
> But I'd do this as a separate patch and run this through Al to verify...
> 

Ok, I can definitely swap s_inode_list_lock with the rcu lock as a
follow on patch. That in fact might be the more correct thing to do now
that the s_inodes list is no longer involved.

> > +	spin_lock_irq(&sb->s_inode_wblist_lock);
> > +	list_splice_init(&sb->s_inodes_wb, &sync_list);
> >  
> > -	/*
> > -	 * Data integrity sync. Must wait for all pages under writeback,
> > -	 * because there may have been pages dirtied before our sync
> > -	 * call, but which had writeout started before we write it out.
> > -	 * In which case, the inode may not be on the dirty list, but
> > -	 * we still have to wait for that writeout.
> > -	 */
> > -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > +	while (!list_empty(&sync_list)) {
> > +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> > +						       i_wb_list);
> >  		struct address_space *mapping = inode->i_mapping;
> >  
> > +		list_del_init(&inode->i_wb_list);
> 
> What if we just did:
> 
> 		list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> 
> here instead of list_del_init()? That way we would always maintain
> consistency:
> 
>  PAGECACHE_TAG_WRITEBACK set iff !list_empty(&inode->i_wb_list)
> 
> under s_inode_wblist_lock and thus you could just delete that list
> shuffling below...
> 

I assume you mean list_move_tail(), but yeah I think you're right. We
can still use the mapping tag check optimization without removing the
inode, since there's no risk of leaving it on a local list after the
function returns. Beyond that, once the lock is dropped the list state
is simply dependent on whether more wb starts after the immediately
previous wait. I'll stare at it some more, but it sounds like a nice
cleanup to me.

> > +		/*
> > +		 * The inode has been written back. Check whether we still have
> > +		 * pages under I/O and move the inode back to the primary list
> > +		 * if necessary. sb_mark_inode_writeback() might not have done
> > +		 * anything if the writeback tag hadn't been cleared from the
> > +		 * mapping by the time more wb had started.
> > +		 */
> > +		spin_lock_irq(&mapping->tree_lock);
> > +		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> > +			spin_lock(&sb->s_inode_wblist_lock);
> > +			if (list_empty(&inode->i_wb_list))
> > +				list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> > +			spin_unlock(&sb->s_inode_wblist_lock);
> > +		} else
> > +			WARN_ON(!list_empty(&inode->i_wb_list));
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +
> > +		iput(inode);
> > +
> >  		spin_lock(&sb->s_inode_list_lock);
> > +		spin_lock_irq(&sb->s_inode_wblist_lock);
> >  	}
> > +
> > +	spin_unlock_irq(&sb->s_inode_wblist_lock);
> >  	spin_unlock(&sb->s_inode_list_lock);
> > -	iput(old_inode);
> > +
> >  	mutex_unlock(&sb->s_sync_lock);
> >  }
> >  
> 
> ...
> 
> > +
> > +			/*
> > +			 * we can come through here when swapping anonymous
> > +			 * pages, so we don't necessarily have an inode to
> > +			 * track for sync. Inodes are remove lazily, so there is
> > +			 * no equivalent in test_clear_page_writeback().
> 
> The comment about lazy removal is not true anymore...
> 

Yep, will fix. Thanks!

Brian

> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v5 1/2] sb: add a new writeback list for sync
  2016-01-13 14:33     ` Brian Foster
@ 2016-01-15 12:16       ` Jan Kara
  0 siblings, 0 replies; 7+ messages in thread
From: Jan Kara @ 2016-01-15 12:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: Jan Kara, linux-fsdevel, dchinner, jbacik

On Wed 13-01-16 09:33:05, Brian Foster wrote:
> On Wed, Jan 13, 2016 at 11:43:02AM +0100, Jan Kara wrote:
> > On Tue 12-01-16 13:35:38, Brian Foster wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > wait_sb_inodes() currently does a walk of all inodes in the
> > > filesystem to find dirty one to wait on during sync. This is highly
> > > inefficient and wastes a lot of CPU when there are lots of clean
> > > cached inodes that we don't need to wait on.
> > > 
> > > To avoid this "all inode" walk, we need to track inodes that are
> > > currently under writeback that we need to wait for. We do this by
> > > adding inodes to a writeback list on the sb when the mapping is
> > > first tagged as having pages under writeback. wait_sb_inodes() can
> > > then walk this list of "inodes under IO" and wait specifically just
> > > for the inodes that the current sync(2) needs to wait for.
> > > 
> > > Define a couple helpers to add/remove an inode from the writeback
> > > list and call them when the overall mapping is tagged for or cleared
> > > from writeback. Update wait_sb_inodes() to walk only the inodes
> > > under writeback due to the sync.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > 
> > ...
> > 
> > >  void inode_wait_for_writeback(struct inode *inode)
> > >  {
> > > +	BUG_ON(!(inode->i_state & I_FREEING));
> > > +
> > >  	spin_lock(&inode->i_lock);
> > >  	__inode_wait_for_writeback(inode);
> > >  	spin_unlock(&inode->i_lock);
> > > +
> > > +	/*
> > > +	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
> > > +	 * the inode and then deref bdev->bd_disk, which at this point has been
> > > +	 * set to NULL, so we would panic.  At the point we are dropping our
> > > +	 * bd_inode we won't have any pages under writeback on the device so
> > > +	 * this is safe.  But just in case we'll assert to make sure we don't
> > > +	 * screw this up.
> > > +	 */
> > > +	if (!sb_is_blkdev_sb(inode->i_sb))
> > 
> > The condition and the comment can go away now. We no longer need to call
> > inode_to_bdi()...
> > 
> 
> Yeah, I noted that in my follow on mail. I think this whole hunk along
> with the change to kill_bdev() can both actually get dropped.
> kill_bdev() does truncate_inodes_pages() which eventually calls
> wait_on_page_writeback() for every page in the mapping. This function
> clearly already waits for inode writeback.
> 
> Unless I'm misunderstanding something, the original changes in both
> places look like they were just there to ensure that the inode is
> removed from the wb list when wb completion didn't do so. That is no
> longer necessary now that wb completion takes care of that. I'll do some
> more testing of course, but any objections if I drop these hunks?

I agree with removal.

> Alternatively, we could retain only the BUG_ON() bits here to make sure
> everything works as expected. Thoughts?

Just BUG_ON(!list_empty(&inode->i_wb_list)) in clear_inode() should be
enough...

> > > +		sb_clear_inode_writeback(inode);
> > > +	BUG_ON(!list_empty(&inode->i_wb_list));
> > >  }
> > >  
> > >  /*
> > > @@ -2108,7 +2154,7 @@ EXPORT_SYMBOL(__mark_inode_dirty);
> > >   */
> > >  static void wait_sb_inodes(struct super_block *sb)
> > >  {
> > > -	struct inode *inode, *old_inode = NULL;
> > > +	LIST_HEAD(sync_list);
> > >  
> > >  	/*
> > >  	 * We need to be protected against the filesystem going from
> > > @@ -2116,23 +2162,56 @@ static void wait_sb_inodes(struct super_block *sb)
> > >  	 */
> > >  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> > >  
> > > +	/*
> > > +	 * Data integrity sync. Must wait for all pages under writeback, because
> > > +	 * there may have been pages dirtied before our sync call, but which had
> > > +	 * writeout started before we write it out.  In which case, the inode
> > > +	 * may not be on the dirty list, but we still have to wait for that
> > > +	 * writeout.
> > > +	 *
> > > +	 * To avoid syncing inodes put under IO after we have started here,
> > > +	 * splice the io list to a temporary list head and walk that. Newly
> > > +	 * dirtied inodes will go onto the primary list so we won't wait for
> > > +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> > > +	 * so we'll complete processing the complete list before the next
> > > +	 * sync operation repeats the splice-and-walk process.
> > > +	 *
> > > +	 * s_inode_wblist_lock protects the wb list and is irq-safe as it is
> > > +	 * acquired inside of the mapping lock by __test_set_page_writeback().
> > > +	 * We cannot acquire i_lock while the wblist lock is held without
> > > +	 * introducing irq inversion issues. Since s_inodes_wb is a subset of
> > > +	 * s_inodes, use s_inode_list_lock to prevent inodes from disappearing
> > > +	 * until we have a reference. Note that s_inode_wblist_lock protects the
> > > +	 * local sync_list as well because inodes can be dropped from either
> > > +	 * list by writeback completion.
> > > +	 */
> > >  	mutex_lock(&sb->s_sync_lock);
> > > +
> > >  	spin_lock(&sb->s_inode_list_lock);
> > 
> > So I'm not very happy that we have to hold s_inode_list_lock here. After
> > all reducing contention on that lock was one of the main motivations of
> > this patch. That being said I think the locking is correct now which is a
> > good start :) and we can work from here.
> > 
> 
> Thanks.. I was shooting for correct and as simple as possible. ;) FWIW,
> the primary motivation that I'm aware of for this patch is the current
> behavior of unnecessarily walking a massive inode list when the inode
> cache is very large (moreso than lock contention). I suspect lock
> contention in this path would more likely involve the wblist_lock at
> this point, since it protects the wblist.

Well, likely depends on the workload. Under some workloads the contention
on sb_list_lock is heavy because

a) sync uses it heavily
b) inode reclaim needs it
c) loading of new inodes into memory needs it

Now your patch shortens the lock hold times for sync which should help but
not having to touch it would be even better :).

> > > +	spin_lock_irq(&sb->s_inode_wblist_lock);
> > > +	list_splice_init(&sb->s_inodes_wb, &sync_list);
> > >  
> > > -	/*
> > > -	 * Data integrity sync. Must wait for all pages under writeback,
> > > -	 * because there may have been pages dirtied before our sync
> > > -	 * call, but which had writeout started before we write it out.
> > > -	 * In which case, the inode may not be on the dirty list, but
> > > -	 * we still have to wait for that writeout.
> > > -	 */
> > > -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > > +	while (!list_empty(&sync_list)) {
> > > +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> > > +						       i_wb_list);
> > >  		struct address_space *mapping = inode->i_mapping;
> > >  
> > > +		list_del_init(&inode->i_wb_list);
> > 
> > What if we just did:
> > 
> > 		list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> > 
> > here instead of list_del_init()? That way we would always maintain
> > consistency:
> > 
> >  PAGECACHE_TAG_WRITEBACK set iff !list_empty(&inode->i_wb_list)
> > 
> > under s_inode_wblist_lock and thus you could just delete that list
> > shuffling below...
> > 
> 
> I assume you mean list_move_tail(), but yeah I think you're right. We

Yes, I meant list_move_tail().

> can still use the mapping tag check optimization without removing the
> inode, since there's no risk of leaving it on a local list after the
> function returns. Beyond that, once the lock is dropped the list state
> is simply dependent on whether more wb starts after the immediately
> previous wait. I'll stare at it some more, but it sounds like a nice
> cleanup to me.

Exactly.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-01-15 12:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-12 18:35 [RFC PATCH v5 0/2] improve sync efficiency with sb inode wb list Brian Foster
2016-01-12 18:35 ` [RFC PATCH v5 1/2] sb: add a new writeback list for sync Brian Foster
2016-01-12 19:24   ` Brian Foster
2016-01-13 10:43   ` Jan Kara
2016-01-13 14:33     ` Brian Foster
2016-01-15 12:16       ` Jan Kara
2016-01-12 18:35 ` [RFC PATCH v5 2/2] wb: inode writeback list tracking tracepoints Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.