All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5)
@ 2011-06-07 21:32 Wu Fengguang
  2011-06-07 21:32 ` [PATCH 01/15] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage Wu Fengguang
                   ` (15 more replies)
  0 siblings, 16 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, linux-fsdevel,
	Wu Fengguang, LKML


Andrew,

Here are the intented patches for upstream submission, rebased to v3.0-rc2:

	[PATCH 01/15] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
	[PATCH 02/15] writeback: update dirtied_when for synced inode to prevent livelock
	[PATCH 03/15] writeback: introduce writeback_control.inodes_cleaned
	[PATCH 04/15] writeback: try more writeback as long as something was written
	[PATCH 05/15] writeback: the kupdate expire timestamp should be a moving target
	[PATCH 06/15] writeback: refill b_io iff empty
	[PATCH 07/15] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
	[PATCH 08/15] writeback: elevate queue_io() into wb_writeback()
	[PATCH 09/15] writeback: avoid extra sync work at enqueue time
	[PATCH 10/15] writeback: add bdi_dirty_limit() kernel-doc
	[PATCH 11/15] writeback: skip balance_dirty_pages() for in-memory fs
	[PATCH 12/15] writeback: remove writeback_control.more_io
	[PATCH 13/15] writeback: remove .nonblocking and .encountered_congestion
	[PATCH 14/15] writeback: trace event writeback_single_inode
	[PATCH 15/15] writeback: trace event writeback_queue_io

 fs/block_dev.c                   |   16 ++-
 fs/ext4/inode.c                  |    4 +-
 fs/fs-writeback.c                |  219 ++++++++++++++++++++------------------
 fs/inode.c                       |    5 +-
 fs/nfs/write.c                   |    3 +-
 fs/xfs/linux-2.6/xfs_aops.c      |    2 +-
 include/linux/backing-dev.h      |    2 +
 include/linux/writeback.h        |   10 +--
 include/trace/events/btrfs.h     |    6 +-
 include/trace/events/ext4.h      |    6 +-
 include/trace/events/writeback.h |  100 +++++++++++++++++-
 mm/backing-dev.c                 |   21 +++-
 mm/filemap.c                     |    6 +-
 mm/page-writeback.c              |   25 +++--
 mm/rmap.c                        |    4 +-
 15 files changed, 270 insertions(+), 159 deletions(-)

They are git pullable from

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git fs-writeback

Thanks to Jan and Dave for the careful reviews!

Thanks,
Fengguang



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/15] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-06-07 23:02   ` Andrew Morton
  2011-06-07 21:32 ` [PATCH 02/15] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Wu Fengguang, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-for-sync.patch --]
[-- Type: text/plain, Size: 4595 bytes --]

sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
do livelock prevention for it, too.

Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers also need livelock prevention.

Impact:  It changes the order in which pages/inodes are synced to disk.
Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
until finished with the current inode.

Acked-by: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/ext4/inode.c           |    4 ++--
 fs/fs-writeback.c         |   17 +++++++++--------
 include/linux/writeback.h |    1 +
 mm/page-writeback.c       |    4 ++--
 4 files changed, 14 insertions(+), 12 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-06-08 05:20:05.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-06-08 05:20:06.000000000 +0800
@@ -36,6 +36,7 @@ struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
 	enum writeback_sync_modes sync_mode;
+	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
@@ -650,6 +651,7 @@ static long wb_writeback(struct bdi_writ
 {
 	struct writeback_control wbc = {
 		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
 		.older_than_this	= NULL,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
@@ -657,7 +659,7 @@ static long wb_writeback(struct bdi_writ
 	};
 	unsigned long oldest_jif;
 	long wrote = 0;
-	long write_chunk;
+	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
 
 	if (wbc.for_kupdate) {
@@ -683,9 +685,7 @@ static long wb_writeback(struct bdi_writ
 	 *                   (quickly) tag currently dirty pages
 	 *                   (maybe slowly) sync all tagged pages
 	 */
-	if (wbc.sync_mode == WB_SYNC_NONE)
-		write_chunk = MAX_WRITEBACK_PAGES;
-	else
+	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_writepages)
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
@@ -1188,10 +1188,11 @@ void writeback_inodes_sb_nr(struct super
 {
 	DECLARE_COMPLETION_ONSTACK(done);
 	struct wb_writeback_work work = {
-		.sb		= sb,
-		.sync_mode	= WB_SYNC_NONE,
-		.done		= &done,
-		.nr_pages	= nr,
+		.sb			= sb,
+		.sync_mode		= WB_SYNC_NONE,
+		.tagged_writepages	= 1,
+		.done			= &done,
+		.nr_pages		= nr,
 	};
 
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
--- linux-next.orig/include/linux/writeback.h	2011-06-08 05:20:05.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-08 05:20:06.000000000 +0800
@@ -47,6 +47,7 @@ struct writeback_control {
 	unsigned encountered_congestion:1; /* An output: a queue is full */
 	unsigned for_kupdate:1;		/* A kupdate writeback */
 	unsigned for_background:1;	/* A background writeback */
+	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned more_io:1;		/* more io to be dispatched */
--- linux-next.orig/mm/page-writeback.c	2011-06-08 05:20:05.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-08 05:20:06.000000000 +0800
@@ -892,12 +892,12 @@ int write_cache_pages(struct address_spa
 			range_whole = 1;
 		cycled = 1; /* ignore range_cyclic tests */
 	}
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
 retry:
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
--- linux-next.orig/fs/ext4/inode.c	2011-06-08 05:20:05.000000000 +0800
+++ linux-next/fs/ext4/inode.c	2011-06-08 05:20:06.000000000 +0800
@@ -2741,7 +2741,7 @@ static int write_cache_pages_da(struct a
 	index = wbc->range_start >> PAGE_CACHE_SHIFT;
 	end = wbc->range_end >> PAGE_CACHE_SHIFT;
 
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
@@ -2973,7 +2973,7 @@ static int ext4_da_writepages(struct add
 	}
 
 retry:
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag_pages_for_writeback(mapping, index, end);
 
 	while (!ret && wbc->nr_to_write > 0) {



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 02/15] writeback: update dirtied_when for synced inode to prevent livelock
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
  2011-06-07 21:32 ` [PATCH 01/15] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-06-07 23:02   ` Andrew Morton
  2011-06-07 21:32 ` [PATCH 03/15] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-fix-sync-busyloop.patch --]
[-- Type: text/plain, Size: 3807 bytes --]

Explicitly update .dirtied_when on synced inodes, so that they are no
longer considered for writeback in the next round.

We'll do more aggressive "keep writeback as long as we wrote something"
logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
no longer be enough to stop sync livelock.

It can prevent both of the following livelock schemes:

- while true; do echo data >> f; done
- while true; do touch f;        done

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    9 +++++++++
 1 file changed, 9 insertions(+)

ext3/ext4 are working fine now, however tests show that XFS may still
livelock inside the XFS routines:

[ 3581.181253] sync            D ffff8800b6ca15d8  4560  4403   4392 0x00000000
[ 3581.181734]  ffff88006f775bc8 0000000000000046 ffff8800b6ca12b8 00000001b6ca1938
[ 3581.182411]  ffff88006f774000 00000000001d2e40 00000000001d2e40 ffff8800b6ca1280
[ 3581.183088]  00000000001d2e40 ffff88006f775fd8 00000340af111ef2 00000000001d2e40
[ 3581.183765] Call Trace:
[ 3581.184008]  [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
[ 3581.184392]  [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
[ 3581.184756]  [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
[ 3581.185120]  [<ffffffff812ed520>] xfs_ioend_wait+0x87/0x9f
[ 3581.185474]  [<ffffffff8108c97a>] ? wake_up_bit+0x2a/0x2a
[ 3581.185827]  [<ffffffff812f742a>] xfs_sync_inode_data+0x92/0x9d
[ 3581.186198]  [<ffffffff812f76e2>] xfs_inode_ag_walk+0x1a5/0x287
[ 3581.186569]  [<ffffffff812f779b>] ? xfs_inode_ag_walk+0x25e/0x287
[ 3581.186946]  [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
[ 3581.187311]  [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
[ 3581.187669]  [<ffffffff81092175>] ? local_clock+0x41/0x5a
[ 3581.188020]  [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
[ 3581.188403]  [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
[ 3581.188773]  [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
[ 3581.189130]  [<ffffffff812e236c>] ? xfs_perag_get+0x80/0xd0
[ 3581.189488]  [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
[ 3581.189858]  [<ffffffff812f7831>] ? xfs_inode_ag_iterator+0x6d/0x8f
[ 3581.190241]  [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
[ 3581.190606]  [<ffffffff812f780b>] xfs_inode_ag_iterator+0x47/0x8f
[ 3581.190982]  [<ffffffff811611f5>] ? __sync_filesystem+0x7a/0x7a
[ 3581.191352]  [<ffffffff812f7877>] xfs_sync_data+0x24/0x43
[ 3581.191703]  [<ffffffff812f7911>] xfs_quiesce_data+0x2c/0x88
[ 3581.192065]  [<ffffffff812f5556>] xfs_fs_sync_fs+0x21/0x48
[ 3581.192419]  [<ffffffff811611e1>] __sync_filesystem+0x66/0x7a
[ 3581.192783]  [<ffffffff8116120b>] sync_one_sb+0x16/0x18
[ 3581.193128]  [<ffffffff8113e3e3>] iterate_supers+0x72/0xce
[ 3581.193482]  [<ffffffff81161140>] sync_filesystems+0x20/0x22
[ 3581.193842]  [<ffffffff8116127e>] sys_sync+0x21/0x33
[ 3581.194177]  [<ffffffff819016c2>] system_call_fastpath+0x16/0x1b

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:14.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:16.000000000 +0800
@@ -419,6 +419,15 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
+		/*
+		 * Sync livelock prevention. Each inode is tagged and synced in
+		 * one shot. If still dirty, it will be redirty_tail()'ed below.
+		 * Update the dirty time to prevent enqueue and sync it again.
+		 */
+		if ((inode->i_state & I_DIRTY) &&
+		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+			inode->dirtied_when = jiffies;
+
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 03/15] writeback: introduce writeback_control.inodes_cleaned
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
  2011-06-07 21:32 ` [PATCH 01/15] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage Wu Fengguang
  2011-06-07 21:32 ` [PATCH 02/15] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-06-07 23:03   ` Andrew Morton
  2011-06-07 21:32 ` [PATCH 04/15] writeback: try more writeback as long as something was written Wu Fengguang
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Wu Fengguang, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-inodes_written.patch --]
[-- Type: text/plain, Size: 2319 bytes --]

The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".

So introduce writeback_control.inodes_cleaned to count the inodes get
cleaned.  A non-zero value means there are some progress on writeback,
in which case more writeback can be tried.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    4 ++++
 include/linux/writeback.h |    1 +
 2 files changed, 5 insertions(+)

about v1: The initial version was to count successful ->write_inode()
calls.  However it leads to busy loops for sync() over NFS, because NFS
ridiculously returns 0 (success) while at the same time redirties the
inode.  The NFS case can be trivially fixed, however there may be more
hidden bugs in other filesystems..

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:16.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:16.000000000 +0800
@@ -464,6 +464,7 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
+			wbc->inodes_cleaned++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -725,6 +726,7 @@ static long wb_writeback(struct bdi_writ
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
+		wbc.inodes_cleaned = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
@@ -741,6 +743,8 @@ static long wb_writeback(struct bdi_writ
 		 */
 		if (wbc.nr_to_write <= 0)
 			continue;
+		if (wbc.inodes_cleaned)
+			continue;
 		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
--- linux-next.orig/include/linux/writeback.h	2011-05-24 11:17:14.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-24 11:17:16.000000000 +0800
@@ -34,6 +34,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long inodes_cleaned;		/* # of inodes cleaned */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 04/15] writeback: try more writeback as long as something was written
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
                   ` (2 preceding siblings ...)
  2011-06-07 21:32 ` [PATCH 03/15] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-06-07 21:32 ` [PATCH 05/15] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-background-retry.patch --]
[-- Type: text/plain, Size: 2708 bytes --]

writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of eligible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

For example, imagine 100 inodes

        i0, i1, i2, ..., i90, i91, i99

At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold. This will be a fairly normal/frequent case I guess.

Now that we do tagged sync and update inode->dirtied_when after the sync,
this change won't livelock sync(1).  I actually tried to write 1 page
per 1ms with this command

	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test

and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:16.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:17.000000000 +0800
@@ -739,23 +739,23 @@ static long wb_writeback(struct bdi_writ
 		wrote += write_chunk - wbc.nr_to_write;
 
 		/*
-		 * If we consumed everything, see if we have more
+		 * Did we write something? Try for more
+		 *
+		 * Dirty inodes are moved to b_io for writeback in batches.
+		 * The completion of the current batch does not necessarily
+		 * mean the overall work is done. So we keep looping as long
+		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write <= 0)
+		if (wbc.nr_to_write < write_chunk)
 			continue;
 		if (wbc.inodes_cleaned)
 			continue;
 		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * No more inodes for IO, bail
 		 */
 		if (!wbc.more_io)
 			break;
 		/*
-		 * Did we write something? Try for more
-		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 05/15] writeback: the kupdate expire timestamp should be a moving target
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
                   ` (3 preceding siblings ...)
  2011-06-07 21:32 ` [PATCH 04/15] writeback: try more writeback as long as something was written Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-06-07 21:32 ` [PATCH 06/15] writeback: refill b_io iff empty Wu Fengguang
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Mel Gorman, Itaru Kitayama, Wu Fengguang,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-moving-dirty-expired.patch --]
[-- Type: text/plain, Size: 2063 bytes --]

Dynamically compute the dirty expire timestamp at queue_io() time.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the evil pageout(). Nevertheless this patch merely addresses part
  of the problem.

v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.

CC: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:18.000000000 +0800
@@ -672,11 +672,6 @@ static long wb_writeback(struct bdi_writ
 	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -723,6 +718,12 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
+		if (work->for_kupdate) {
+			oldest_jif = jiffies -
+				msecs_to_jiffies(dirty_expire_interval * 10);
+			wbc.older_than_this = &oldest_jif;
+		}
+
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 06/15] writeback: refill b_io iff empty
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
                   ` (4 preceding siblings ...)
  2011-06-07 21:32 ` [PATCH 05/15] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-06-07 21:32 ` [PATCH 07/15] writeback: split inode_wb_list_lock into bdi_writeback.list_lock Wu Fengguang
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Wu Fengguang, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-refill-queue-iff-empty.patch --]
[-- Type: text/plain, Size: 7353 bytes --]

There is no point to carry different refill policies between for_kupdate
and other type of works. Use a consistent "refill b_io iff empty" policy
which can guarantee fairness in an easy to understand way.

A b_io refill will setup a _fixed_ work set with all currently eligible
inodes and start a new round of walk through b_io. The "fixed" work set
means no new inodes will be added to the work set during the walk.
Only when a complete walk over b_io is done, new inodes that are
eligible at the time will be enqueued and the walk be started over.

This procedure provides fairness among the inodes because it guarantees
each inode to be synced once and only once at each round. So all inodes
will be free from starvations.

This change relies on wb_writeback() to keep retrying as long as we made
some progress on cleaning some pages and/or inodes. Without that ability,
the old logic on background works relies on aggressively queuing all
eligible inodes into b_io at every time. But that's not a guarantee.

The below test script completes a slightly faster now:

             2.6.39-rc3	  2.6.39-rc3-dyn-expire+
------------------------------------------------
all elapsed     256.043      252.367
stddev           24.381       12.530

tar elapsed      30.097       28.808
dd  elapsed      13.214       11.782

	#!/bin/zsh

	cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/

	umount /dev/sda7
	mkfs.xfs -f /dev/sda7
	mount /dev/sda7 /fs

	echo 3 > /proc/sys/vm/drop_caches

	tic=$(cat /proc/uptime|cut -d' ' -f2)

	cd /fs
	time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
	time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &

	wait
	sync
	tac=$(cat /proc/uptime|cut -d' ' -f2)
	echo elapsed: $((tac - tic))

It maintains roughly the same small vs. large file writeout shares, and
offers large files better chances to be written in nice 4M chunks.

Analyzes from Dave Chinner in great details:

Let's say we have lots of inodes with 100 dirty pages being created,
and one large writeback going on. We expire 8 new inodes for every
1024 pages we write back.

With the old code, we do:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)

	writeback  8 small inodes 800 pages
		   1 large inode 224 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)
	.....

Your new code:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 8s)
	writeback  8 small inodes 800 pages

	b_io empty: (1800 pages written)
		b_more_io (large inode) -> b_io (1l)
		14 newly expired inodes -> b_io (1l, 14s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 14s)
	writeback  10 small inodes 1000 pages
		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
	writeback  5 small inodes 500 pages
	b_io empty: (2548 pages written)
		b_more_io (large inode) -> b_io (1l, 1s(24))
		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
	......

Rough progression of pages written at b_io refill:

Old code:

	total	large file	% of writeback
	1024	224		21.9% (fixed)

New code:
	total	large file	% of writeback
	1800	1024		~55%
	2550	1024		~40%
	3050	1024		~33%
	3500	1024		~29%
	3950	1024		~26%
	4250	1024		~24%
	4500	1024		~22.7%
	4700	1024		~21.7%
	4800	1024		~21.3%
	4800	1024		~21.3%
	(pretty much steady state from here)

Ok, so the steady state is reached with a similar percentage of
writeback to the large file as the existing code. Ok, that's good,
but providing some evidence that is doesn't change the shared of
writeback to the large should be in the commit message ;)

The other advantage to this is that we always write 1024 page chunks
to the large file, rather than smaller "whatever remains" chunks.

CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

after + dyn-expire + ioless:
                &(&wb->list_lock)->rlock:          2291           2304           0.15         204.09        3125.12          35315         970712           0.10         223.84     1113437.05
                ------------------------
                &(&wb->list_lock)->rlock              9          [<ffffffff8115dc5d>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock           1614          [<ffffffff8115da6a>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock            459          [<ffffffff8115d351>] writeback_sb_inodes+0x108/0x154
                &(&wb->list_lock)->rlock            137          [<ffffffff8115cdcf>] writeback_single_inode+0x1b4/0x296
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110c367>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115dc5d>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock           1160          [<ffffffff8115da6a>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock            435          [<ffffffff8115dcb6>] writeback_inodes_wb+0x33/0x12b

after + dyn-expire:
                &(&wb->list_lock)->rlock:        226820         229719           0.10         194.28      809275.91         327372     1033513685           0.08         476.96  3590929811.61
                ------------------------
                &(&wb->list_lock)->rlock             11          [<ffffffff8115b6d3>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock          30559          [<ffffffff8115bb1f>] wb_writeback+0x2fb/0x3c3
                &(&wb->list_lock)->rlock          37339          [<ffffffff8115b72c>] writeback_inodes_wb+0x33/0x12b
                &(&wb->list_lock)->rlock          54880          [<ffffffff8115a87f>] writeback_single_inode+0x17f/0x227
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110b606>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115b6d3>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock          55347          [<ffffffff8115b72c>] writeback_inodes_wb+0x33/0x12b
                &(&wb->list_lock)->rlock          55338          [<ffffffff8115a87f>] writeback_single_inode+0x17f/0x227

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:18.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:19.000000000 +0800
@@ -589,7 +589,8 @@ void writeback_inodes_wb(struct bdi_writ
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_wb_list_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
 	while (!list_empty(&wb->b_io)) {
@@ -616,7 +617,7 @@ static void __writeback_inodes_sb(struct
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_wb_list_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_wb_list_lock);



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 07/15] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
                   ` (5 preceding siblings ...)
  2011-06-07 21:32 ` [PATCH 06/15] writeback: refill b_io iff empty Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-06-07 23:03   ` Andrew Morton
  2011-06-07 21:32 ` [PATCH 08/15] writeback: elevate queue_io() into wb_writeback() Wu Fengguang
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Hugh Dickins, Wu Fengguang,
	Dave Chinner, Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-split-lock.patch --]
[-- Type: text/plain, Size: 21387 bytes --]

From: Christoph Hellwig <hch@infradead.org>

Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads.  It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
                      inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157

2.6.39-rc3 + patch:
                &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                ------------------------
                &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f

hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/block_dev.c              |   16 +++--
 fs/fs-writeback.c           |   97 +++++++++++++++++-----------------
 fs/inode.c                  |    5 -
 include/linux/backing-dev.h |    2 
 include/linux/writeback.h   |    2 
 mm/backing-dev.c            |   21 +++++--
 mm/filemap.c                |    6 +-
 mm/rmap.c                   |    4 -
 8 files changed, 85 insertions(+), 68 deletions(-)

--- linux-next.orig/fs/block_dev.c	2011-06-08 05:13:44.000000000 +0800
+++ linux-next/fs/block_dev.c	2011-06-08 05:14:32.000000000 +0800
@@ -44,24 +44,28 @@ inline struct block_device *I_BDEV(struc
 {
 	return &BDEV_I(inode)->bdev;
 }
-
 EXPORT_SYMBOL(I_BDEV);
 
 /*
- * move the inode from it's current bdi to the a new bdi. if the inode is dirty
- * we need to move it onto the dirty list of @dst so that the inode is always
- * on the right list.
+ * Move the inode from its current bdi to a new bdi. If the inode is dirty we
+ * need to move it onto the dirty list of @dst so that the inode is always on
+ * the right list.
  */
 static void bdev_inode_switch_bdi(struct inode *inode,
 			struct backing_dev_info *dst)
 {
-	spin_lock(&inode_wb_list_lock);
+	struct backing_dev_info *old = inode->i_data.backing_dev_info;
+
+	if (unlikely(dst == old))		/* deadlock avoidance */
+		return;
+	bdi_lock_two(&old->wb, &dst->wb);
 	spin_lock(&inode->i_lock);
 	inode->i_data.backing_dev_info = dst;
 	if (inode->i_state & I_DIRTY)
 		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&old->wb.list_lock);
+	spin_unlock(&dst->wb.list_lock);
 }
 
 static sector_t max_block(struct block_device *bdev)
--- linux-next.orig/fs/fs-writeback.c	2011-06-08 05:14:14.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-06-08 05:14:32.000000000 +0800
@@ -181,12 +181,13 @@ void bdi_start_background_writeback(stru
  */
 void inode_wb_list_del(struct inode *inode)
 {
-	spin_lock(&inode_wb_list_lock);
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.list_lock);
 	list_del_init(&inode->i_wb_list);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&bdi->wb.list_lock);
 }
 
-
 /*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
@@ -196,11 +197,9 @@ void inode_wb_list_del(struct inode *ino
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
-static void redirty_tail(struct inode *inode)
+static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -214,11 +213,9 @@ static void redirty_tail(struct inode *i
 /*
  * requeue inode for re-scanning after bdi->b_io list is exhausted.
  */
-static void requeue_io(struct inode *inode)
+static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
@@ -226,7 +223,7 @@ static void inode_sync_complete(struct i
 {
 	/*
 	 * Prevent speculative execution through
-	 * spin_unlock(&inode_wb_list_lock);
+	 * spin_unlock(&wb->list_lock);
 	 */
 
 	smp_mb();
@@ -302,7 +299,7 @@ static void move_expired_inodes(struct l
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
@@ -317,7 +314,8 @@ static int write_inode(struct inode *ino
 /*
  * Wait for writeback on an inode to complete.
  */
-static void inode_wait_for_writeback(struct inode *inode)
+static void inode_wait_for_writeback(struct inode *inode,
+				     struct bdi_writeback *wb)
 {
 	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
 	wait_queue_head_t *wqh;
@@ -325,15 +323,15 @@ static void inode_wait_for_writeback(str
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		spin_lock(&inode->i_lock);
 	}
 }
 
 /*
- * Write out an inode's dirty pages.  Called under inode_wb_list_lock and
+ * Write out an inode's dirty pages.  Called under wb->list_lock and
  * inode->i_lock.  Either the caller has an active reference on the inode or
  * the inode has I_WILL_FREE set.
  *
@@ -344,13 +342,14 @@ static void inode_wait_for_writeback(str
  * livelocks, etc.
  */
 static int
-writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
+writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
+		       struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
 	unsigned dirty;
 	int ret;
 
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
 
 	if (!atomic_read(&inode->i_count))
@@ -368,14 +367,14 @@ writeback_single_inode(struct inode *ino
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			return 0;
 		}
 
 		/*
 		 * It's a data-integrity sync.  We must wait.
 		 */
-		inode_wait_for_writeback(inode);
+		inode_wait_for_writeback(inode, wb);
 	}
 
 	BUG_ON(inode->i_state & I_SYNC);
@@ -384,7 +383,7 @@ writeback_single_inode(struct inode *ino
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 
 	ret = do_writepages(mapping, wbc);
 
@@ -415,7 +414,7 @@ writeback_single_inode(struct inode *ino
 			ret = err;
 	}
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
@@ -438,7 +437,7 @@ writeback_single_inode(struct inode *ino
 				/*
 				 * slice used up: queue for next turn
 				 */
-				requeue_io(inode);
+				requeue_io(inode, wb);
 			} else {
 				/*
 				 * Writeback blocked by something other than
@@ -447,7 +446,7 @@ writeback_single_inode(struct inode *ino
 				 * retrying writeback of the dirty page/inode
 				 * that cannot be performed immediately.
 				 */
-				redirty_tail(inode);
+				redirty_tail(inode, wb);
 			}
 		} else if (inode->i_state & I_DIRTY) {
 			/*
@@ -456,7 +455,7 @@ writeback_single_inode(struct inode *ino
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
-			redirty_tail(inode);
+			redirty_tail(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -521,7 +520,7 @@ static int writeback_sb_inodes(struct su
 				 * superblock, move all inodes not belonging
 				 * to it back onto the dirty list.
 				 */
-				redirty_tail(inode);
+				redirty_tail(inode, wb);
 				continue;
 			}
 
@@ -541,7 +540,7 @@ static int writeback_sb_inodes(struct su
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			continue;
 		}
 
@@ -557,19 +556,19 @@ static int writeback_sb_inodes(struct su
 		__iget(inode);
 
 		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wbc);
+		writeback_single_inode(inode, wb, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
-			redirty_tail(inode);
+			redirty_tail(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 		iput(inode);
 		cond_resched();
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
 			return 1;
@@ -588,7 +587,7 @@ void writeback_inodes_wb(struct bdi_writ
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
@@ -598,7 +597,7 @@ void writeback_inodes_wb(struct bdi_writ
 		struct super_block *sb = inode->i_sb;
 
 		if (!pin_sb_for_writeback(sb)) {
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			continue;
 		}
 		ret = writeback_sb_inodes(sb, wb, wbc, false);
@@ -607,7 +606,7 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
@@ -616,11 +615,11 @@ static void __writeback_inodes_sb(struct
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 }
 
 /*
@@ -762,15 +761,15 @@ static long wb_writeback(struct bdi_writ
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = wb_inode(wb->b_more_io.prev);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode);
+			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 	}
 
 	return wrote;
@@ -1104,10 +1103,10 @@ void __mark_inode_dirty(struct inode *in
 			}
 
 			spin_unlock(&inode->i_lock);
-			spin_lock(&inode_wb_list_lock);
+			spin_lock(&bdi->wb.list_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
-			spin_unlock(&inode_wb_list_lock);
+			spin_unlock(&bdi->wb.list_lock);
 
 			if (wakeup_bdi)
 				bdi_wakeup_thread_delayed(bdi);
@@ -1309,6 +1308,7 @@ EXPORT_SYMBOL(sync_inodes_sb);
  */
 int write_inode_now(struct inode *inode, int sync)
 {
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	int ret;
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
@@ -1321,11 +1321,11 @@ int write_inode_now(struct inode *inode,
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, &wbc);
+	ret = writeback_single_inode(inode, wb, &wbc);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
@@ -1345,13 +1345,14 @@ EXPORT_SYMBOL(write_inode_now);
  */
 int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	int ret;
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, wbc);
+	ret = writeback_single_inode(inode, wb, wbc);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode);
--- linux-next.orig/fs/inode.c	2011-06-08 05:13:45.000000000 +0800
+++ linux-next/fs/inode.c	2011-06-08 05:14:32.000000000 +0800
@@ -37,7 +37,7 @@
  *   inode_lru, inode->i_lru
  * inode_sb_list_lock protects:
  *   sb->s_inodes, inode->i_sb_list
- * inode_wb_list_lock protects:
+ * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
  * inode_hash_lock protects:
  *   inode_hashtable, inode->i_hash
@@ -48,7 +48,7 @@
  *   inode->i_lock
  *     inode_lru_lock
  *
- * inode_wb_list_lock
+ * bdi->wb.list_lock
  *   inode->i_lock
  *
  * inode_hash_lock
@@ -68,7 +68,6 @@ static LIST_HEAD(inode_lru);
 static DEFINE_SPINLOCK(inode_lru_lock);
 
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_wb_list_lock);
 
 /*
  * iprune_sem provides exclusion between the icache shrinking and the
--- linux-next.orig/include/linux/backing-dev.h	2011-06-08 05:11:57.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-06-08 05:14:32.000000000 +0800
@@ -57,6 +57,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
 struct backing_dev_info {
@@ -106,6 +107,7 @@ int bdi_writeback_thread(void *data);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
--- linux-next.orig/include/linux/writeback.h	2011-06-08 05:14:14.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-08 05:14:32.000000000 +0800
@@ -9,8 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t inode_wb_list_lock;
-
 /*
  * fs/fs-writeback.c
  */
--- linux-next.orig/mm/backing-dev.c	2011-06-08 05:13:47.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-08 05:14:48.000000000 +0800
@@ -45,6 +45,17 @@ static struct timer_list sync_supers_tim
 static int bdi_sync_supers(void *);
 static void sync_supers_timer_fn(unsigned long);
 
+void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
+{
+	if (wb1 < wb2) {
+		spin_lock(&wb1->list_lock);
+		spin_lock_nested(&wb2->list_lock, 1);
+	} else {
+		spin_lock(&wb2->list_lock);
+		spin_lock_nested(&wb1->list_lock, 1);
+	}
+}
+
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
@@ -67,14 +78,14 @@ static int bdi_debug_stats_show(struct s
 	struct inode *inode;
 
 	nr_dirty = nr_io = nr_more_io = 0;
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
 	list_for_each_entry(inode, &wb->b_io, i_wb_list)
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -628,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	spin_lock_init(&wb->list_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
 
@@ -676,11 +688,12 @@ void bdi_destroy(struct backing_dev_info
 	if (bdi_has_dirty_io(bdi)) {
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
-		spin_lock(&inode_wb_list_lock);
+		bdi_lock_two(&bdi->wb, dst);
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&bdi->wb.list_lock);
+		spin_unlock(&dst->list_lock);
 	}
 
 	bdi_unregister(bdi);
--- linux-next.orig/mm/filemap.c	2011-06-08 05:13:47.000000000 +0800
+++ linux-next/mm/filemap.c	2011-06-08 05:14:32.000000000 +0800
@@ -81,7 +81,7 @@
  *  ->i_mutex
  *    ->i_alloc_sem             (various)
  *
- *  inode_wb_list_lock
+ *  bdi->wb.list_lock
  *    sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
@@ -99,9 +99,9 @@
  *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
- *    inode_wb_list_lock	(page_remove_rmap->set_page_dirty)
+ *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
  *    ->inode->i_lock		(page_remove_rmap->set_page_dirty)
- *    inode_wb_list_lock	(zap_pte_range->set_page_dirty)
+ *    bdi.wb->list_lock		(zap_pte_range->set_page_dirty)
  *    ->inode->i_lock		(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
  *
--- linux-next.orig/mm/rmap.c	2011-06-08 05:13:47.000000000 +0800
+++ linux-next/mm/rmap.c	2011-06-08 05:14:32.000000000 +0800
@@ -32,11 +32,11 @@
  *               mmlist_lock (in mmput, drain_mmlist and others)
  *               mapping->private_lock (in __set_page_dirty_buffers)
  *               inode->i_lock (in set_page_dirty's __mark_inode_dirty)
- *               inode_wb_list_lock (in set_page_dirty's __mark_inode_dirty)
+ *               bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
  *                 sb_lock (within inode_lock in fs/fs-writeback.c)
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
- *                           within inode_wb_list_lock in __sync_single_inode)
+ *                           within bdi.wb->list_lock in __sync_single_inode)
  *
  * (code doesn't rely on that order so it could be switched around)
  * ->tasklist_lock



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 08/15] writeback: elevate queue_io() into wb_writeback()
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
                   ` (6 preceding siblings ...)
  2011-06-07 21:32 ` [PATCH 07/15] writeback: split inode_wb_list_lock into bdi_writeback.list_lock Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-06-07 21:32 ` [PATCH 09/15] writeback: avoid extra sync work at enqueue time Wu Fengguang
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: move-queue_io.patch --]
[-- Type: text/plain, Size: 3489 bytes --]

Code refactor for more logical code layout.
No behavior change.

- remove the mis-named __writeback_inodes_sb()

- wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
  before calling __writeback_inodes_wb()

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-06-08 05:14:32.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-06-08 05:16:08.000000000 +0800
@@ -580,17 +580,13 @@ static int writeback_sb_inodes(struct su
 	return 1;
 }
 
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc)
+static void __writeback_inodes_wb(struct bdi_writeback *wb,
+				  struct writeback_control *wbc)
 {
 	int ret = 0;
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&wb->list_lock);
-
-	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -606,19 +602,16 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&wb->list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
-static void __writeback_inodes_sb(struct super_block *sb,
-		struct bdi_writeback *wb, struct writeback_control *wbc)
+void writeback_inodes_wb(struct bdi_writeback *wb,
+		struct writeback_control *wbc)
 {
-	WARN_ON(!rwsem_is_locked(&sb->s_umount));
-
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
-	writeback_sb_inodes(sb, wb, wbc, true);
+	__writeback_inodes_wb(wb, wbc);
 	spin_unlock(&wb->list_lock);
 }
 
@@ -685,7 +678,7 @@ static long wb_writeback(struct bdi_writ
 	 * The intended call sequence for WB_SYNC_ALL writeback is:
 	 *
 	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
+	 *          writeback_sb_inodes()       <== called only once
 	 *              write_cache_pages()     <== called once for each inode
 	 *                   (quickly) tag currently dirty pages
 	 *                   (maybe slowly) sync all tagged pages
@@ -694,6 +687,7 @@ static long wb_writeback(struct bdi_writ
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
+	spin_lock(&wb->list_lock);
 	for (;;) {
 		/*
 		 * Stop writeback when nr_pages has been consumed
@@ -730,10 +724,12 @@ static long wb_writeback(struct bdi_writ
 		wbc.inodes_cleaned = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
+		if (list_empty(&wb->b_io))
+			queue_io(wb, wbc.older_than_this);
 		if (work->sb)
-			__writeback_inodes_sb(work->sb, wb, &wbc);
+			writeback_sb_inodes(work->sb, wb, &wbc, true);
 		else
-			writeback_inodes_wb(wb, &wbc);
+			__writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
@@ -761,7 +757,6 @@ static long wb_writeback(struct bdi_writ
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
-		spin_lock(&wb->list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = wb_inode(wb->b_more_io.prev);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
@@ -769,8 +764,8 @@ static long wb_writeback(struct bdi_writ
 			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&wb->list_lock);
 	}
+	spin_unlock(&wb->list_lock);
 
 	return wrote;
 }



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 09/15] writeback: avoid extra sync work at enqueue time
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
                   ` (7 preceding siblings ...)
  2011-06-07 21:32 ` [PATCH 08/15] writeback: elevate queue_io() into wb_writeback() Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-06-07 21:32 ` [PATCH 10/15] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-kill-wb_start.patch --]
[-- Type: text/plain, Size: 2179 bytes --]

This removes writeback_control.wb_start and does more straightforward
sync livelock prevention by setting .older_than_this to prevent extra
inodes from being enqueued in the first place.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   16 +++-------------
 include/linux/writeback.h |    3 ---
 2 files changed, 3 insertions(+), 16 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:21.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:22.000000000 +0800
@@ -544,15 +544,6 @@ static int writeback_sb_inodes(struct su
 			continue;
 		}
 
-		/*
-		 * Was this inode dirtied after sync_sb_inodes was called?
-		 * This keeps sync from extra jobs and livelock.
-		 */
-		if (inode_dirtied_after(inode, wbc->wb_start)) {
-			spin_unlock(&inode->i_lock);
-			return 1;
-		}
-
 		__iget(inode);
 
 		pages_skipped = wbc->pages_skipped;
@@ -585,9 +576,6 @@ static void __writeback_inodes_wb(struct
 {
 	int ret = 0;
 
-	if (!wbc->wb_start)
-		wbc->wb_start = jiffies; /* livelock avoidance */
-
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
 		struct super_block *sb = inode->i_sb;
@@ -686,7 +674,9 @@ static long wb_writeback(struct bdi_writ
 	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_writepages)
 		write_chunk = LONG_MAX;
 
-	wbc.wb_start = jiffies; /* livelock avoidance */
+	oldest_jif = jiffies;
+	wbc.older_than_this = &oldest_jif;
+
 	spin_lock(&wb->list_lock);
 	for (;;) {
 		/*
--- linux-next.orig/include/linux/writeback.h	2011-05-24 11:17:20.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-24 11:17:22.000000000 +0800
@@ -26,9 +26,6 @@ struct writeback_control {
 	enum writeback_sync_modes sync_mode;
 	unsigned long *older_than_this;	/* If !NULL, only write back inodes
 					   older than this */
-	unsigned long wb_start;         /* Time writeback_inodes_wb was
-					   called. This is needed to avoid
-					   extra jobs and livelock */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 10/15] writeback: add bdi_dirty_limit() kernel-doc
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
                   ` (8 preceding siblings ...)
  2011-06-07 21:32 ` [PATCH 09/15] writeback: avoid extra sync work at enqueue time Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-06-07 21:32 ` [PATCH 11/15] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-task_dirty_limit-comment.patch --]
[-- Type: text/plain, Size: 1193 bytes --]

Clarify the bdi_dirty_limit() comment.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-05-24 11:17:14.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-24 11:17:23.000000000 +0800
@@ -437,10 +437,17 @@ void global_dirty_limits(unsigned long *
 	*pdirty = dirty;
 }
 
-/*
+/**
  * bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ * @bdi: the backing_dev_info to query
+ * @dirty: global dirty limit in pages
+ *
+ * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
+ * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ * And the "limit" in the name is not seriously taken as hard limit in
+ * balance_dirty_pages().
  *
- * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * It allocates high/low dirty limits to fast/slow devices, in order to prevent
  * - starving fast devices
  * - piling up dirty pages (that will take long time to sync) on slow devices
  *



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 11/15] writeback: skip balance_dirty_pages() for in-memory fs
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
                   ` (9 preceding siblings ...)
  2011-06-07 21:32 ` [PATCH 10/15] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-06-11 13:07   ` Wu Fengguang
  2011-06-07 21:32 ` [PATCH 12/15] writeback: remove writeback_control.more_io Wu Fengguang
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Hugh Dickins, Rik van Riel, Peter Zijlstra,
	Wu Fengguang, Dave Chinner, Christoph Hellwig, linux-fsdevel,
	LKML

[-- Attachment #1: writeback-trace-global-dirty-states-fix.patch --]
[-- Type: text/plain, Size: 2707 bytes --]

This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.

Notes about the tmpfs/ramfs behavior changes:

As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
balance_dirty_pages() as long as we are over the (dirty+background)/2
global throttle threshold.  This is because both the dirty pages and
threshold will be 0 for tmpfs/ramfs. Hence this test will always
evaluate to TRUE:

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
                        || (nr_reclaimable + nr_writeback >= dirty_thresh);

For 2.6.37, someone complained that the current logic does not allow the
users to set vm.dirty_ratio=0.  So commit 4cbec4c8b9 changed the test to

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
                        || (nr_reclaimable + nr_writeback > dirty_thresh);

So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
throttled unless the global dirty threshold is exceeded (which is very
unlikely to happen; once happen, will block many tasks).

I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
for a busy writing server, tmpfs write()s may get livelocked! The
"inadvertent" throttling can hardly bring help to any workload because
of its "either no throttling, or get throttled to death" property.

So based on 2.6.37, this patch won't bring more noticeable changes.

CC: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-05-24 11:17:23.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-24 11:17:24.000000000 +0800
@@ -244,13 +244,8 @@ void task_dirty_inc(struct task_struct *
 static void bdi_writeout_fraction(struct backing_dev_info *bdi,
 		long *numerator, long *denominator)
 {
-	if (bdi_cap_writeback_dirty(bdi)) {
-		prop_fraction_percpu(&vm_completions, &bdi->completions,
+	prop_fraction_percpu(&vm_completions, &bdi->completions,
 				numerator, denominator);
-	} else {
-		*numerator = 0;
-		*denominator = 1;
-	}
 }
 
 static inline void task_dirties_fraction(struct task_struct *tsk,
@@ -495,6 +490,9 @@ static void balance_dirty_pages(struct a
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+	if (!bdi_cap_account_dirty(bdi))
+		return;
+
 	for (;;) {
 		struct writeback_control wbc = {
 			.sync_mode	= WB_SYNC_NONE,



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 12/15] writeback: remove writeback_control.more_io
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
                   ` (10 preceding siblings ...)
  2011-06-07 21:32 ` [PATCH 11/15] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-07-11 21:31   ` Hugh Dickins
  2011-06-07 21:32 ` [PATCH 13/15] writeback: remove .nonblocking and .encountered_congestion Wu Fengguang
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Wu Fengguang, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-kill-more_io.patch --]
[-- Type: text/plain, Size: 4465 bytes --]

When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |    9 ++-------
 include/linux/writeback.h        |    1 -
 include/trace/events/ext4.h      |    6 ++----
 include/trace/events/writeback.h |    5 +----
 4 files changed, 5 insertions(+), 16 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-06-08 05:16:56.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-06-08 05:17:13.000000000 +0800
@@ -560,12 +560,8 @@ static int writeback_sb_inodes(struct su
 		iput(inode);
 		cond_resched();
 		spin_lock(&wb->list_lock);
-		if (wbc->nr_to_write <= 0) {
-			wbc->more_io = 1;
+		if (wbc->nr_to_write <= 0)
 			return 1;
-		}
-		if (!list_empty(&wb->b_more_io))
-			wbc->more_io = 1;
 	}
 	/* b_io is empty */
 	return 1;
@@ -708,7 +704,6 @@ static long wb_writeback(struct bdi_writ
 			wbc.older_than_this = &oldest_jif;
 		}
 
-		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 		wbc.inodes_cleaned = 0;
@@ -740,7 +735,7 @@ static long wb_writeback(struct bdi_writ
 		/*
 		 * No more inodes for IO, bail
 		 */
-		if (!wbc.more_io)
+		if (list_empty(&wb->b_more_io))
 			break;
 		/*
 		 * Nothing written. Wait for some inode to
--- linux-next.orig/include/linux/writeback.h	2011-06-08 05:16:56.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-08 05:17:13.000000000 +0800
@@ -46,7 +46,6 @@ struct writeback_control {
 	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
-	unsigned more_io:1;		/* more io to be dispatched */
 };
 
 /*
--- linux-next.orig/include/trace/events/writeback.h	2011-06-08 05:11:56.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-06-08 05:17:13.000000000 +0800
@@ -101,7 +101,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(int, more_io)
 		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
@@ -116,7 +115,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->more_io	= wbc->more_io;
 		__entry->older_than_this = wbc->older_than_this ?
 						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
@@ -124,7 +122,7 @@ DECLARE_EVENT_CLASS(wbc_class,
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d older=0x%lx "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -134,7 +132,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->more_io,
 		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
--- linux-next.orig/include/trace/events/ext4.h	2011-06-08 05:11:56.000000000 +0800
+++ linux-next/include/trace/events/ext4.h	2011-06-08 05:17:08.000000000 +0800
@@ -404,7 +404,6 @@ TRACE_EVENT(ext4_da_writepages_result,
 		__field(	int,	pages_written		)
 		__field(	long,	pages_skipped		)
 		__field(	int,	sync_mode		)
-		__field(	char,	more_io			)	
 		__field(       pgoff_t,	writeback_index		)
 	),
 
@@ -415,16 +414,15 @@ TRACE_EVENT(ext4_da_writepages_result,
 		__entry->pages_written	= pages_written;
 		__entry->pages_skipped	= wbc->pages_skipped;
 		__entry->sync_mode	= wbc->sync_mode;
-		__entry->more_io	= wbc->more_io;
 		__entry->writeback_index = inode->i_mapping->writeback_index;
 	),
 
 	TP_printk("dev %d,%d ino %lu ret %d pages_written %d pages_skipped %ld "
-		  " more_io %d sync_mode %d writeback_index %lu",
+		  "sync_mode %d writeback_index %lu",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino, __entry->ret,
 		  __entry->pages_written, __entry->pages_skipped,
-		  __entry->more_io, __entry->sync_mode,
+		  __entry->sync_mode,
 		  (unsigned long) __entry->writeback_index)
 );
 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 13/15] writeback: remove .nonblocking and .encountered_congestion
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
                   ` (11 preceding siblings ...)
  2011-06-07 21:32 ` [PATCH 12/15] writeback: remove writeback_control.more_io Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-06-07 21:32 ` [PATCH 14/15] writeback: trace event writeback_single_inode Wu Fengguang
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-kill-nonblocking.patch --]
[-- Type: text/plain, Size: 3584 bytes --]

Remove two unused struct writeback_control fields:

	.encountered_congestion	(completely unused)
	.nonblocking		(never set, checked/showed in XFS,NFS/btrfs)

The .for_background check in nfs_write_inode() is also removed btw,
as .for_background implies WB_SYNC_NONE.

Reviewed-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c               |    3 +--
 fs/xfs/linux-2.6/xfs_aops.c  |    2 +-
 include/linux/writeback.h    |    2 --
 include/trace/events/btrfs.h |    6 ++----
 4 files changed, 4 insertions(+), 9 deletions(-)

--- linux-next.orig/fs/xfs/linux-2.6/xfs_aops.c	2011-06-08 05:17:13.000000000 +0800
+++ linux-next/fs/xfs/linux-2.6/xfs_aops.c	2011-06-08 05:17:14.000000000 +0800
@@ -970,7 +970,7 @@ xfs_vm_writepage(
 	offset = page_offset(page);
 	type = IO_OVERWRITE;
 
-	if (wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking)
+	if (wbc->sync_mode == WB_SYNC_NONE)
 		nonblocking = 1;
 
 	do {
--- linux-next.orig/include/trace/events/btrfs.h	2011-06-08 05:17:13.000000000 +0800
+++ linux-next/include/trace/events/btrfs.h	2011-06-08 05:17:14.000000000 +0800
@@ -284,7 +284,6 @@ DECLARE_EVENT_CLASS(btrfs__writepage,
 		__field(	long,   pages_skipped		)
 		__field(	loff_t, range_start		)
 		__field(	loff_t, range_end		)
-		__field(	char,   nonblocking		)
 		__field(	char,   for_kupdate		)
 		__field(	char,   for_reclaim		)
 		__field(	char,   range_cyclic		)
@@ -299,7 +298,6 @@ DECLARE_EVENT_CLASS(btrfs__writepage,
 		__entry->pages_skipped	= wbc->pages_skipped;
 		__entry->range_start	= wbc->range_start;
 		__entry->range_end	= wbc->range_end;
-		__entry->nonblocking	= wbc->nonblocking;
 		__entry->for_kupdate	= wbc->for_kupdate;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
@@ -310,13 +308,13 @@ DECLARE_EVENT_CLASS(btrfs__writepage,
 
 	TP_printk("root = %llu(%s), ino = %lu, page_index = %lu, "
 		  "nr_to_write = %ld, pages_skipped = %ld, range_start = %llu, "
-		  "range_end = %llu, nonblocking = %d, for_kupdate = %d, "
+		  "range_end = %llu, for_kupdate = %d, "
 		  "for_reclaim = %d, range_cyclic = %d, writeback_index = %lu",
 		  show_root_type(__entry->root_objectid),
 		  (unsigned long)__entry->ino, __entry->index,
 		  __entry->nr_to_write, __entry->pages_skipped,
 		  __entry->range_start, __entry->range_end,
-		  __entry->nonblocking, __entry->for_kupdate,
+		  __entry->for_kupdate,
 		  __entry->for_reclaim, __entry->range_cyclic,
 		  (unsigned long)__entry->writeback_index)
 );
--- linux-next.orig/fs/nfs/write.c	2011-06-08 05:17:13.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-06-08 05:17:14.000000000 +0800
@@ -1564,8 +1564,7 @@ int nfs_write_inode(struct inode *inode,
 		int status;
 		bool sync = true;
 
-		if (wbc->sync_mode == WB_SYNC_NONE || wbc->nonblocking ||
-		    wbc->for_background)
+		if (wbc->sync_mode == WB_SYNC_NONE)
 			sync = false;
 
 		status = pnfs_layoutcommit_inode(inode, sync);
--- linux-next.orig/include/linux/writeback.h	2011-06-08 05:17:13.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-08 05:17:14.000000000 +0800
@@ -39,8 +39,6 @@ struct writeback_control {
 	loff_t range_start;
 	loff_t range_end;
 
-	unsigned nonblocking:1;		/* Don't get stuck on request queues */
-	unsigned encountered_congestion:1; /* An output: a queue is full */
 	unsigned for_kupdate:1;		/* A kupdate writeback */
 	unsigned for_background:1;	/* A background writeback */
 	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 14/15] writeback: trace event writeback_single_inode
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
                   ` (12 preceding siblings ...)
  2011-06-07 21:32 ` [PATCH 13/15] writeback: remove .nonblocking and .encountered_congestion Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-06-07 21:32 ` [PATCH 15/15] writeback: trace event writeback_queue_io Wu Fengguang
  2011-06-07 23:04 ` [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Andrew Morton
  15 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Wu Fengguang, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-writeback_single_inode.patch --]
[-- Type: text/plain, Size: 4034 bytes --]

It is valuable to know how the dirty inodes are iterated and their IO size.

"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"

- "state" reflects inode->i_state at the end of writeback_single_inode()
- "index" reflects mapping->writeback_index after the ->writepages() call
- "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
- "wrote" is the number of pages actually written

v2: add trace event writeback_single_inode_requeue as proposed by Dave.

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |    4 +
 include/trace/events/writeback.h |   70 +++++++++++++++++++++++++++++
 2 files changed, 74 insertions(+)

--- linux-next.orig/include/trace/events/writeback.h	2011-05-24 11:17:25.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-24 11:17:26.000000000 +0800
@@ -8,6 +8,19 @@
 #include <linux/device.h>
 #include <linux/writeback.h>
 
+#define show_inode_state(state)					\
+	__print_flags(state, "|",				\
+		{I_DIRTY_SYNC,		"I_DIRTY_SYNC"},	\
+		{I_DIRTY_DATASYNC,	"I_DIRTY_DATASYNC"},	\
+		{I_DIRTY_PAGES,		"I_DIRTY_PAGES"},	\
+		{I_NEW,			"I_NEW"},		\
+		{I_WILL_FREE,		"I_WILL_FREE"},		\
+		{I_FREEING,		"I_FREEING"},		\
+		{I_CLEAR,		"I_CLEAR"},		\
+		{I_SYNC,		"I_SYNC"},		\
+		{I_REFERENCED,		"I_REFERENCED"}		\
+	)
+
 struct wb_writeback_work;
 
 DECLARE_EVENT_CLASS(writeback_work_class,
@@ -184,6 +197,63 @@ DEFINE_EVENT(writeback_congest_waited_te
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
+DECLARE_EVENT_CLASS(writeback_single_inode_template,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long nr_to_write
+	),
+
+	TP_ARGS(inode, wbc, nr_to_write),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, state)
+		__field(unsigned long, age)
+		__field(unsigned long, writeback_index)
+		__field(long, nr_to_write)
+		__field(unsigned long, wrote)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->state		= inode->i_state;
+		__entry->age		= (jiffies - inode->dirtied_when) *
+								1000 / HZ;
+		__entry->writeback_index = inode->i_mapping->writeback_index;
+		__entry->nr_to_write	= nr_to_write;
+		__entry->wrote		= nr_to_write - wbc->nr_to_write;
+	),
+
+	TP_printk("bdi %s: ino=%lu state=%s age=%lu "
+		  "index=%lu to_write=%ld wrote=%lu",
+		  __entry->name,
+		  __entry->ino,
+		  show_inode_state(__entry->state),
+		  __entry->age,
+		  __entry->writeback_index,
+		  __entry->nr_to_write,
+		  __entry->wrote
+	)
+);
+
+DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode_requeue,
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long nr_to_write),
+	TP_ARGS(inode, wbc, nr_to_write)
+);
+
+DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode,
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long nr_to_write),
+	TP_ARGS(inode, wbc, nr_to_write)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:25.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:26.000000000 +0800
@@ -346,6 +346,7 @@ writeback_single_inode(struct inode *ino
 		       struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
+	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -368,6 +369,8 @@ writeback_single_inode(struct inode *ino
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
 			requeue_io(inode, wb);
+			trace_writeback_single_inode_requeue(inode, wbc,
+							     nr_to_write);
 			return 0;
 		}
 
@@ -467,6 +470,7 @@ writeback_single_inode(struct inode *ino
 		}
 	}
 	inode_sync_complete(inode);
+	trace_writeback_single_inode(inode, wbc, nr_to_write);
 	return ret;
 }
 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 15/15] writeback: trace event writeback_queue_io
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
                   ` (13 preceding siblings ...)
  2011-06-07 21:32 ` [PATCH 14/15] writeback: trace event writeback_single_inode Wu Fengguang
@ 2011-06-07 21:32 ` Wu Fengguang
  2011-06-07 23:04 ` [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Andrew Morton
  15 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-queue_io.patch --]
[-- Type: text/plain, Size: 3503 bytes --]

Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   14 ++++++++++----
 include/trace/events/writeback.h |   25 +++++++++++++++++++++++++
 2 files changed, 35 insertions(+), 4 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:26.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:27.000000000 +0800
@@ -248,15 +248,16 @@ static bool inode_dirtied_after(struct i
 /*
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
-static void move_expired_inodes(struct list_head *delaying_queue,
+static int move_expired_inodes(struct list_head *delaying_queue,
 			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+			       unsigned long *older_than_this)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
+	int moved = 0;
 
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
@@ -267,12 +268,13 @@ static void move_expired_inodes(struct l
 			do_sb_sort = 1;
 		sb = inode->i_sb;
 		list_move(&inode->i_wb_list, &tmp);
+		moved++;
 	}
 
 	/* just one sb in list, splice to dispatch_queue and we're done */
 	if (!do_sb_sort) {
 		list_splice(&tmp, dispatch_queue);
-		return;
+		goto out;
 	}
 
 	/* Move inodes from one superblock together */
@@ -284,6 +286,8 @@ static void move_expired_inodes(struct l
 				list_move(&inode->i_wb_list, dispatch_queue);
 		}
 	}
+out:
+	return moved;
 }
 
 /*
@@ -299,9 +303,11 @@ static void move_expired_inodes(struct l
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
+	int moved;
 	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	trace_writeback_queue_io(wb, older_than_this, moved);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
--- linux-next.orig/include/trace/events/writeback.h	2011-05-24 11:17:26.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-24 11:17:27.000000000 +0800
@@ -162,6 +162,31 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_writt
 DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(writeback_queue_io,
+	TP_PROTO(struct bdi_writeback *wb,
+		 unsigned long *older_than_this,
+		 int moved),
+	TP_ARGS(wb, older_than_this, moved),
+	TP_STRUCT__entry(
+		__array(char,		name, 32)
+		__field(unsigned long,	older)
+		__field(long,		age)
+		__field(int,		moved)
+	),
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(wb->bdi->dev), 32);
+		__entry->older	= older_than_this ?  *older_than_this : 0;
+		__entry->age	= older_than_this ?
+				  (jiffies - *older_than_this) * 1000 / HZ : -1;
+		__entry->moved	= moved;
+	),
+	TP_printk("bdi %s: older=%lu age=%ld enqueue=%d",
+		__entry->name,
+		__entry->older,	/* older_than_this in jiffies */
+		__entry->age,	/* older_than_this in relative milliseconds */
+		__entry->moved)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/15] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
  2011-06-07 21:32 ` [PATCH 01/15] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage Wu Fengguang
@ 2011-06-07 23:02   ` Andrew Morton
  2011-06-07 23:24     ` Wu Fengguang
  0 siblings, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2011-06-07 23:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, linux-fsdevel, LKML

On Wed, 08 Jun 2011 05:32:37 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
> WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
> do livelock prevention for it, too.
> 
> Note that writeback_inodes_sb() is called by not only sync(), they are
> treated the same because the other callers also need livelock prevention.
> 
> Impact:  It changes the order in which pages/inodes are synced to disk.
> Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
> until finished with the current inode.

What problem is this patch actually fixing?  It sounds like there's
some livelock scenario in the WB_SYNC_NONE phase.  otoh the final
paragraph implies that the WB_SYNC_NONE phase is failing to write some
pages under some situations.

Suggest that the changelog be fleshed out to cover all of this.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/15] writeback: update dirtied_when for synced inode to prevent livelock
  2011-06-07 21:32 ` [PATCH 02/15] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
@ 2011-06-07 23:02   ` Andrew Morton
  2011-06-07 23:51     ` Wu Fengguang
  0 siblings, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2011-06-07 23:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, linux-fsdevel, LKML

On Wed, 08 Jun 2011 05:32:38 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Explicitly update .dirtied_when on synced inodes, so that they are no
> longer considered for writeback in the next round.

It sounds like this somewhat answers my questions for [1/15].

But I'm not seeing a description of exactly what caused the livelock.

> We'll do more aggressive "keep writeback as long as we wrote something"
> logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
> b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
> no longer be enough to stop sync livelock.
> 
> It can prevent both of the following livelock schemes:
> 
> - while true; do echo data >> f; done
> - while true; do touch f;        done

You're kidding.  This livelocks sync(1)?  When did we break this?

Why is this?  Because the inode keeps on getting rotated to head-of-list?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/15] writeback: introduce writeback_control.inodes_cleaned
  2011-06-07 21:32 ` [PATCH 03/15] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
@ 2011-06-07 23:03   ` Andrew Morton
  2011-06-08  0:10     ` Wu Fengguang
  0 siblings, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2011-06-07 23:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Mel Gorman, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

On Wed, 08 Jun 2011 05:32:39 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> The flusher works on dirty inodes in batches, and may quit prematurely
> if the batch of inodes happen to be metadata-only dirtied: in this case
> wbc->nr_to_write won't be decreased at all, which stands for "no pages
> written" but also mis-interpreted as "no progress".
> 
> So introduce writeback_control.inodes_cleaned to count the inodes get
> cleaned.  A non-zero value means there are some progress on writeback,
> in which case more writeback can be tried.

Yes, that makes sense.  I had a workload which demonstrated/exploited
this nine years ago but I never got around to fixing it, never told
anyone and nobody noticed ;)

> +	long inodes_cleaned;		/* # of inodes cleaned */

nanonit: I'd call this inodes_written, because they may not actually be
clean.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/15] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
  2011-06-07 21:32 ` [PATCH 07/15] writeback: split inode_wb_list_lock into bdi_writeback.list_lock Wu Fengguang
@ 2011-06-07 23:03   ` Andrew Morton
  2011-06-08  0:20     ` Wu Fengguang
  0 siblings, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2011-06-07 23:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Christoph Hellwig, Hugh Dickins, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

On Wed, 08 Jun 2011 05:32:43 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

>  static void bdev_inode_switch_bdi(struct inode *inode,
>  			struct backing_dev_info *dst)
>  {
> -	spin_lock(&inode_wb_list_lock);
> +	struct backing_dev_info *old = inode->i_data.backing_dev_info;
> +
> +	if (unlikely(dst == old))		/* deadlock avoidance */
> +		return;

Why does this occur?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5)
  2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
                   ` (14 preceding siblings ...)
  2011-06-07 21:32 ` [PATCH 15/15] writeback: trace event writeback_queue_io Wu Fengguang
@ 2011-06-07 23:04 ` Andrew Morton
  2011-06-08  2:01   ` Wu Fengguang
  15 siblings, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2011-06-07 23:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, linux-fsdevel, LKML

On Wed, 08 Jun 2011 05:32:36 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> 
> Andrew,
> 
> Here are the intented patches for upstream submission, rebased to v3.0-rc2:
> 
> 	[PATCH 01/15] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
> 	[PATCH 02/15] writeback: update dirtied_when for synced inode to prevent livelock
> 	[PATCH 03/15] writeback: introduce writeback_control.inodes_cleaned
> 	[PATCH 04/15] writeback: try more writeback as long as something was written
> 	[PATCH 05/15] writeback: the kupdate expire timestamp should be a moving target
> 	[PATCH 06/15] writeback: refill b_io iff empty
> 	[PATCH 07/15] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
> 	[PATCH 08/15] writeback: elevate queue_io() into wb_writeback()
> 	[PATCH 09/15] writeback: avoid extra sync work at enqueue time
> 	[PATCH 10/15] writeback: add bdi_dirty_limit() kernel-doc
> 	[PATCH 11/15] writeback: skip balance_dirty_pages() for in-memory fs
> 	[PATCH 12/15] writeback: remove writeback_control.more_io
> 	[PATCH 13/15] writeback: remove .nonblocking and .encountered_congestion
> 	[PATCH 14/15] writeback: trace event writeback_single_inode
> 	[PATCH 15/15] writeback: trace event writeback_queue_io

Looks OK to me, although I think a few of the changelogs are lacking
important details.

> They are git pullable from
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git fs-writeback
> 

Please ask Stephen to suck this into linux-next when ready.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/15] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
  2011-06-07 23:02   ` Andrew Morton
@ 2011-06-07 23:24     ` Wu Fengguang
  0 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 23:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, linux-fsdevel, LKML

On Wed, Jun 08, 2011 at 07:02:34AM +0800, Andrew Morton wrote:
> On Wed, 08 Jun 2011 05:32:37 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
> > WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
> > do livelock prevention for it, too.
> > 
> > Note that writeback_inodes_sb() is called by not only sync(), they are
> > treated the same because the other callers also need livelock prevention.
> > 
> > Impact:  It changes the order in which pages/inodes are synced to disk.
> > Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
> > until finished with the current inode.
> 
> What problem is this patch actually fixing?  It sounds like there's
> some livelock scenario in the WB_SYNC_NONE phase.  otoh the final
> paragraph implies that the WB_SYNC_NONE phase is failing to write some
> pages under some situations.

Problem is: the WB_SYNC_NONE phase has no livelock prevention _at all_.

Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance
using page tagging") is a partial fix in that it only fixed the
WB_SYNC_ALL phase livelock.

Although ext4 is tested to no longer livelock with commit f446daaea9,
it may due to some "redirty_tail() after pages_skipped" effect which
is by no means a guarantee for _all_ the file systems.

> Suggest that the changelog be fleshed out to cover all of this.

OK, I'll add the above two paragraphs to the changelog.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/15] writeback: update dirtied_when for synced inode to prevent livelock
  2011-06-07 23:02   ` Andrew Morton
@ 2011-06-07 23:51     ` Wu Fengguang
  0 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-07 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, linux-fsdevel, LKML

On Wed, Jun 08, 2011 at 07:02:45AM +0800, Andrew Morton wrote:
> On Wed, 08 Jun 2011 05:32:38 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Explicitly update .dirtied_when on synced inodes, so that they are no
> > longer considered for writeback in the next round.
> 
> It sounds like this somewhat answers my questions for [1/15].
> 
> But I'm not seeing a description of exactly what caused the livelock.

The exact livelock condition is, during sync(1):

(1) no new inodes are dirtied
(2) an inode being actively dirtied

On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX.
When finished, it will be redirty_tail()ed because it's still dirty
and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when
on condition (1). The sync work will then revisit it on the next
queue_io() and find it eligible again because its old ->dirtied_when
predates the sync work start time.

I'll add the above to the changelog.

> > We'll do more aggressive "keep writeback as long as we wrote something"
> > logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
> > b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
> > no longer be enough to stop sync livelock.
> > 
> > It can prevent both of the following livelock schemes:
> > 
> > - while true; do echo data >> f; done
> > - while true; do touch f;        done
> 
> You're kidding.  This livelocks sync(1)?  When did we break this?

There are no reported real cases for "touch f" style livelock.  It's
merely a possibility in theory and the more concurrent meta data
dirties, the more likelihood it will happen.

> Why is this?  Because the inode keeps on getting rotated to head-of-list?

Yes, when the inode is always redirty_tail()ed without updating its
->dirtied_when.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/15] writeback: introduce writeback_control.inodes_cleaned
  2011-06-07 23:03   ` Andrew Morton
@ 2011-06-08  0:10     ` Wu Fengguang
  0 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-08  0:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

On Wed, Jun 08, 2011 at 07:03:13AM +0800, Andrew Morton wrote:
> On Wed, 08 Jun 2011 05:32:39 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > The flusher works on dirty inodes in batches, and may quit prematurely
> > if the batch of inodes happen to be metadata-only dirtied: in this case
> > wbc->nr_to_write won't be decreased at all, which stands for "no pages
> > written" but also mis-interpreted as "no progress".
> > 
> > So introduce writeback_control.inodes_cleaned to count the inodes get
> > cleaned.  A non-zero value means there are some progress on writeback,
> > in which case more writeback can be tried.
> 
> Yes, that makes sense.  I had a workload which demonstrated/exploited
> this nine years ago but I never got around to fixing it, never told
> anyone and nobody noticed ;)

Good to know that :)

> > +	long inodes_cleaned;		/* # of inodes cleaned */
> 
> nanonit: I'd call this inodes_written, because they may not actually be
> clean.

Yeah, at least for ext4.

Done the rename. Also added a note that it's not the exact number of
inodes written.

Thanks,
Fengguang
---
Subject: writeback: introduce writeback_control.inodes_written
Date: Wed Jul 21 22:50:57 CST 2010

The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".

So introduce writeback_control.inodes_written to count the inodes get
cleaned from VFS POV.  A non-zero value means there are some progress on
writeback, in which case more writeback can be tried.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    4 ++++
 include/linux/writeback.h |    1 +
 2 files changed, 5 insertions(+)

about v1: The initial version was to count successful ->write_inode()
calls.  However it leads to busy loops for sync() over NFS, because NFS
ridiculously returns 0 (success) while at the same time redirties the
inode.  The NFS case can be trivially fixed, however there may be more
hidden bugs in other filesystems..

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:16.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:16.000000000 +0800
@@ -464,6 +464,7 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
+			wbc->inodes_written++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -725,6 +726,7 @@ static long wb_writeback(struct bdi_writ
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
+		wbc.inodes_written = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
@@ -741,6 +743,8 @@ static long wb_writeback(struct bdi_writ
 		 */
 		if (wbc.nr_to_write <= 0)
 			continue;
+		if (wbc.inodes_written)
+			continue;
 		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
--- linux-next.orig/include/linux/writeback.h	2011-05-24 11:17:14.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-24 11:17:16.000000000 +0800
@@ -34,6 +34,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long inodes_written;		/* # of inodes written (at least) */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/15] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
  2011-06-07 23:03   ` Andrew Morton
@ 2011-06-08  0:20     ` Wu Fengguang
  2011-06-08  0:35       ` Andrew Morton
  0 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-06-08  0:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Hugh Dickins, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

On Wed, Jun 08, 2011 at 07:03:19AM +0800, Andrew Morton wrote:
> On Wed, 08 Jun 2011 05:32:43 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> >  static void bdev_inode_switch_bdi(struct inode *inode,
> >  			struct backing_dev_info *dst)
> >  {
> > -	spin_lock(&inode_wb_list_lock);
> > +	struct backing_dev_info *old = inode->i_data.backing_dev_info;
> > +
> > +	if (unlikely(dst == old))		/* deadlock avoidance */
> > +		return;
> 
> Why does this occur?

That's a fix from Hugh Dickins:

        Yesterday's mmotm hangs at startup, and with lockdep it reports:
        BUG: spinlock recursion on CPU#1, blkid/284 - with bdi_lock_two()
        called from bdev_inode_switch_bdi() in the backtrace.  It appears
        that this function is sometimes called with new the same as old.

The problem becomes clear when looking at bdi_lock_two(), which will
immediately deadlock itself if called with (wb1 == wb2):

  void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
  {
          if (wb1 < wb2) {
                  spin_lock(&wb1->list_lock);                                                                         
                  spin_lock_nested(&wb2->list_lock, 1);
          } else {
                  spin_lock(&wb2->list_lock);
                  spin_lock_nested(&wb1->list_lock, 1);
          }
  }

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/15] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
  2011-06-08  0:20     ` Wu Fengguang
@ 2011-06-08  0:35       ` Andrew Morton
  2011-06-08  1:36         ` Wu Fengguang
  0 siblings, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2011-06-08  0:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Christoph Hellwig, Hugh Dickins, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

On Wed, 8 Jun 2011 08:20:57 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Wed, Jun 08, 2011 at 07:03:19AM +0800, Andrew Morton wrote:
> > On Wed, 08 Jun 2011 05:32:43 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > >  static void bdev_inode_switch_bdi(struct inode *inode,
> > >  			struct backing_dev_info *dst)
> > >  {
> > > -	spin_lock(&inode_wb_list_lock);
> > > +	struct backing_dev_info *old = inode->i_data.backing_dev_info;
> > > +
> > > +	if (unlikely(dst == old))		/* deadlock avoidance */
> > > +		return;
> > 
> > Why does this occur?
> 
> That's a fix from Hugh Dickins:

yes, I remember it.  And I remember rubberiness about this at the time ;)

>         Yesterday's mmotm hangs at startup, and with lockdep it reports:
>         BUG: spinlock recursion on CPU#1, blkid/284 - with bdi_lock_two()
>         called from bdev_inode_switch_bdi() in the backtrace.  It appears
>         that this function is sometimes called with new the same as old.
> 
> The problem becomes clear when looking at bdi_lock_two(), which will
> immediately deadlock itself if called with (wb1 == wb2):
> 
>   void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
>   {
>           if (wb1 < wb2) {
>                   spin_lock(&wb1->list_lock);                                                                         
>                   spin_lock_nested(&wb2->list_lock, 1);
>           } else {
>                   spin_lock(&wb2->list_lock);
>                   spin_lock_nested(&wb1->list_lock, 1);
>           }
>   }

But why are we asking bdev_inode_switch_bdi() to switch an inode to a
bdi where it already resides?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/15] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
  2011-06-08  0:35       ` Andrew Morton
@ 2011-06-08  1:36         ` Wu Fengguang
  0 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-08  1:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Hugh Dickins, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

On Wed, Jun 08, 2011 at 08:35:27AM +0800, Andrew Morton wrote:
> On Wed, 8 Jun 2011 08:20:57 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Wed, Jun 08, 2011 at 07:03:19AM +0800, Andrew Morton wrote:
> > > On Wed, 08 Jun 2011 05:32:43 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > >  static void bdev_inode_switch_bdi(struct inode *inode,
> > > >  			struct backing_dev_info *dst)
> > > >  {
> > > > -	spin_lock(&inode_wb_list_lock);
> > > > +	struct backing_dev_info *old = inode->i_data.backing_dev_info;
> > > > +
> > > > +	if (unlikely(dst == old))		/* deadlock avoidance */
> > > > +		return;
> > > 
> > > Why does this occur?
> > 
> > That's a fix from Hugh Dickins:
> 
> yes, I remember it.  And I remember rubberiness about this at the time ;)
> 
> >         Yesterday's mmotm hangs at startup, and with lockdep it reports:
> >         BUG: spinlock recursion on CPU#1, blkid/284 - with bdi_lock_two()
> >         called from bdev_inode_switch_bdi() in the backtrace.  It appears
> >         that this function is sometimes called with new the same as old.
> > 
> > The problem becomes clear when looking at bdi_lock_two(), which will
> > immediately deadlock itself if called with (wb1 == wb2):
> > 
> >   void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
> >   {
> >           if (wb1 < wb2) {
> >                   spin_lock(&wb1->list_lock);                                                                         
> >                   spin_lock_nested(&wb2->list_lock, 1);
> >           } else {
> >                   spin_lock(&wb2->list_lock);
> >                   spin_lock_nested(&wb1->list_lock, 1);
> >           }
> >   }
> 
> But why are we asking bdev_inode_switch_bdi() to switch an inode to a
> bdi where it already resides?

That's definitely an interesting problem.

I suspect it to be some inode pointing to &default_backing_dev_info
switches to the same &default_backing_dev_info, and did manage to
catch one such case, called from __blkdev_get():

        1196  out_clear:
        1197         disk_put_part(bdev->bd_part);
        1198         bdev->bd_disk = NULL;
        1199         bdev->bd_part = NULL;
        1200         WARN_ON(bdev->bd_inode->i_data.backing_dev_info ==
        1201                 &default_backing_dev_info);
==>     1202         bdev_inode_switch_bdi(bdev->bd_inode, &default_backing_dev_info);
        1203         if (bdev != bdev->bd_contains)
        1204                 __blkdev_put(bdev->bd_contains, mode, 1);
        1205         bdev->bd_contains = NULL;

The debug call trace is:

[   88.751130] ------------[ cut here ]------------
[   88.751546] WARNING: at /c/wfg/linux-next/fs/block_dev.c:1201 __blkdev_get+0x38a/0x40a()
[   88.752201] Hardware name:
[   88.752554] Modules linked in:
[   88.752866] Pid: 3214, comm: blkid Not tainted 3.0.0-rc2-next-20110607+ #372
[   88.753354] Call Trace:
[   88.753610]  [<ffffffff810700e0>] warn_slowpath_common+0x85/0x9d
[   88.753987]  [<ffffffff81070112>] warn_slowpath_null+0x1a/0x1c
[   88.754428]  [<ffffffff8116c57e>] __blkdev_get+0x38a/0x40a
[   88.754798]  [<ffffffff8116c8e3>] ? blkdev_get+0x2e5/0x2e5
[   88.755238]  [<ffffffff8116c7cb>] blkdev_get+0x1cd/0x2e5
[   88.755622]  [<ffffffff8192817b>] ? _raw_spin_unlock+0x2b/0x2f
[   88.759131]  [<ffffffff8116c8e3>] ? blkdev_get+0x2e5/0x2e5
[   88.759527]  [<ffffffff8116c961>] blkdev_open+0x7e/0x82
[   88.759896]  [<ffffffff8113e84f>] __dentry_open+0x1c8/0x31d
[   88.760341]  [<ffffffff8192817b>] ? _raw_spin_unlock+0x2b/0x2f
[   88.760737]  [<ffffffff8113f65c>] nameidata_to_filp+0x48/0x4f
[   88.761126]  [<ffffffff8114bafa>] do_last+0x5c8/0x71f
[   88.761552]  [<ffffffff8114cd7b>] path_openat+0x29d/0x34f
[   88.761932]  [<ffffffff8114ce6a>] do_filp_open+0x3d/0x89
[   88.762367]  [<ffffffff8192817b>] ? _raw_spin_unlock+0x2b/0x2f
[   88.762765]  [<ffffffff811577a1>] ? alloc_fd+0x10b/0x11d
[   88.763200]  [<ffffffff8113f771>] do_sys_open+0x10e/0x1a0
[   88.763581]  [<ffffffff81111813>] ? __do_fault+0x29a/0x46e
[   88.763960]  [<ffffffff8113f823>] sys_open+0x20/0x22
[   88.764380]  [<ffffffff8192ed42>] system_call_fastpath+0x16/0x1b
[   88.764782] ---[ end trace 28100c425ce9e560 ]---


Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5)
  2011-06-07 23:04 ` [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Andrew Morton
@ 2011-06-08  2:01   ` Wu Fengguang
  2011-06-08  6:21     ` Sedat Dilek
  2011-06-08 13:45     ` Wu Fengguang
  0 siblings, 2 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-08  2:01 UTC (permalink / raw)
  To: Andrew Morton, Stephen Rothwell
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, linux-fsdevel, LKML

On Wed, Jun 08, 2011 at 07:04:41AM +0800, Andrew Morton wrote:
> On Wed, 08 Jun 2011 05:32:36 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > 
> > Andrew,
> > 
> > Here are the intented patches for upstream submission, rebased to v3.0-rc2:
> > 
> > 	[PATCH 01/15] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
> > 	[PATCH 02/15] writeback: update dirtied_when for synced inode to prevent livelock
> > 	[PATCH 03/15] writeback: introduce writeback_control.inodes_cleaned
> > 	[PATCH 04/15] writeback: try more writeback as long as something was written
> > 	[PATCH 05/15] writeback: the kupdate expire timestamp should be a moving target
> > 	[PATCH 06/15] writeback: refill b_io iff empty
> > 	[PATCH 07/15] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
> > 	[PATCH 08/15] writeback: elevate queue_io() into wb_writeback()
> > 	[PATCH 09/15] writeback: avoid extra sync work at enqueue time
> > 	[PATCH 10/15] writeback: add bdi_dirty_limit() kernel-doc
> > 	[PATCH 11/15] writeback: skip balance_dirty_pages() for in-memory fs
> > 	[PATCH 12/15] writeback: remove writeback_control.more_io
> > 	[PATCH 13/15] writeback: remove .nonblocking and .encountered_congestion
> > 	[PATCH 14/15] writeback: trace event writeback_single_inode
> > 	[PATCH 15/15] writeback: trace event writeback_queue_io
> 
> Looks OK to me, although I think a few of the changelogs are lacking
> important details.

Thanks.

> > They are git pullable from
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git fs-writeback
> > 
> 
> Please ask Stephen to suck this into linux-next when ready.

OK, now it includes the updated patches. It's based on v3.0-rc2 and
also compile/runs OK with 3.0.0-rc2-next-20110607.

Stephen, would you help add this git branch to linux-next? Thank you.

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git#fs-writeback

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5)
  2011-06-08  2:01   ` Wu Fengguang
@ 2011-06-08  6:21     ` Sedat Dilek
  2011-06-08 13:45     ` Wu Fengguang
  1 sibling, 0 replies; 42+ messages in thread
From: Sedat Dilek @ 2011-06-08  6:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Stephen Rothwell, Jan Kara, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

On Wed, Jun 8, 2011 at 4:01 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Wed, Jun 08, 2011 at 07:04:41AM +0800, Andrew Morton wrote:
>> On Wed, 08 Jun 2011 05:32:36 +0800
>> Wu Fengguang <fengguang.wu@intel.com> wrote:
>>
>> >
>> > Andrew,
>> >
>> > Here are the intented patches for upstream submission, rebased to v3.0-rc2:
>> >
>> >     [PATCH 01/15] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
>> >     [PATCH 02/15] writeback: update dirtied_when for synced inode to prevent livelock
>> >     [PATCH 03/15] writeback: introduce writeback_control.inodes_cleaned
>> >     [PATCH 04/15] writeback: try more writeback as long as something was written
>> >     [PATCH 05/15] writeback: the kupdate expire timestamp should be a moving target
>> >     [PATCH 06/15] writeback: refill b_io iff empty
>> >     [PATCH 07/15] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
>> >     [PATCH 08/15] writeback: elevate queue_io() into wb_writeback()
>> >     [PATCH 09/15] writeback: avoid extra sync work at enqueue time
>> >     [PATCH 10/15] writeback: add bdi_dirty_limit() kernel-doc
>> >     [PATCH 11/15] writeback: skip balance_dirty_pages() for in-memory fs
>> >     [PATCH 12/15] writeback: remove writeback_control.more_io
>> >     [PATCH 13/15] writeback: remove .nonblocking and .encountered_congestion
>> >     [PATCH 14/15] writeback: trace event writeback_single_inode
>> >     [PATCH 15/15] writeback: trace event writeback_queue_io
>>
>> Looks OK to me, although I think a few of the changelogs are lacking
>> important details.
>
> Thanks.
>
>> > They are git pullable from
>> >
>> > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git fs-writeback
>> >
>>
>> Please ask Stephen to suck this into linux-next when ready.
>
> OK, now it includes the updated patches. It's based on v3.0-rc2 and
> also compile/runs OK with 3.0.0-rc2-next-20110607.
>

So, did I a few hours ago... Running a kernel with your patchset
applied right now.

- Sedat -

> Stephen, would you help add this git branch to linux-next? Thank you.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git#fs-writeback
>
> Thanks,
> Fengguang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5)
  2011-06-08  2:01   ` Wu Fengguang
  2011-06-08  6:21     ` Sedat Dilek
@ 2011-06-08 13:45     ` Wu Fengguang
  2011-06-09  1:16       ` Stephen Rothwell
  1 sibling, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-06-08 13:45 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Andrew Morton, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML, sedat.dilek

Hi Stephen,

> git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git#fs-writeback

Will you include this branch instead? It looks more reasonable as
suggested by Sedat:

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git  next

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5)
  2011-06-08 13:45     ` Wu Fengguang
@ 2011-06-09  1:16       ` Stephen Rothwell
  2011-06-09  2:18         ` Wu Fengguang
  0 siblings, 1 reply; 42+ messages in thread
From: Stephen Rothwell @ 2011-06-09  1:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML, sedat.dilek

[-- Attachment #1: Type: text/plain, Size: 2027 bytes --]

Hi,

On Wed, 8 Jun 2011 21:45:10 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git#fs-writeback
> 
> Will you include this branch instead? It looks more reasonable as
> suggested by Sedat:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git  next

I have added that tree starting today with just you as the contact
(please let me know if others should be added as contacts as well).

Thanks for adding your subsystem tree as a participant of linux-next.  As
you may know, this is not a judgment of your code.  The purpose of
linux-next is for integration testing and to lower the impact of
conflicts between subsystems in the next merge window. 

You will need to ensure that the patches/commits in your tree/series have
been:
     * submitted under GPL v2 (or later) and include the Contributor's
	Signed-off-by,
     * posted to the relevant mailing list,
     * reviewed by you (or another maintainer of your subsystem tree),
     * successfully unit tested, and 
     * destined for the current or next Linux merge window.

Basically, this should be just what you would send to Linus (or ask him
to fetch).  It is allowed to be rebased if you deem it necessary.

-- 
Cheers,
Stephen Rothwell 
sfr@canb.auug.org.au

Legal Stuff:
By participating in linux-next, your subsystem tree contributions are
public and will be included in the linux-next trees.  You may be sent
e-mail messages indicating errors or other issues when the
patches/commits from your subsystem tree are merged and tested in
linux-next.  These messages may also be cross-posted to the linux-next
mailing list, the linux-kernel mailing list, etc.  The linux-next tree
project and IBM (my employer) make no warranties regarding the linux-next
project, the testing procedures, the results, the e-mails, etc.  If you
don't agree to these ground rules, let me know and I'll remove your tree
from participation in linux-next.

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5)
  2011-06-09  1:16       ` Stephen Rothwell
@ 2011-06-09  2:18         ` Wu Fengguang
  0 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-06-09  2:18 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Andrew Morton, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML, sedat.dilek

Hi Stephen,

On Thu, Jun 09, 2011 at 09:16:09AM +0800, Stephen Rothwell wrote:
> Hi,
> 
> On Wed, 8 Jun 2011 21:45:10 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git#fs-writeback
> > 
> > Will you include this branch instead? It looks more reasonable as
> > suggested by Sedat:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git  next
> 
> I have added that tree starting today with just you as the contact
> (please let me know if others should be added as contacts as well).

Thank you!

> Thanks for adding your subsystem tree as a participant of linux-next.  As
> you may know, this is not a judgment of your code.  The purpose of
> linux-next is for integration testing and to lower the impact of
> conflicts between subsystems in the next merge window. 
> 
> You will need to ensure that the patches/commits in your tree/series have
> been:
>      * submitted under GPL v2 (or later) and include the Contributor's
> 	Signed-off-by,
>      * posted to the relevant mailing list,
>      * reviewed by you (or another maintainer of your subsystem tree),
>      * successfully unit tested, and 
>      * destined for the current or next Linux merge window.
> 
> Basically, this should be just what you would send to Linus (or ask him
> to fetch).  It is allowed to be rebased if you deem it necessary.

OK.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 11/15] writeback: skip balance_dirty_pages() for in-memory fs
  2011-06-07 21:32 ` [PATCH 11/15] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
@ 2011-06-11 13:07   ` Wu Fengguang
  2011-06-13 13:42     ` Jan Kara
  0 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-06-11 13:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Hugh Dickins, Rik van Riel, Peter Zijlstra,
	Dave Chinner, Christoph Hellwig, linux-fsdevel, LKML

> @@ -495,6 +490,9 @@ static void balance_dirty_pages(struct a
>  	bool dirty_exceeded = false;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
> +	if (!bdi_cap_account_dirty(bdi))
> +		return;
> +
>  	for (;;) {
>  		struct writeback_control wbc = {
>  			.sync_mode	= WB_SYNC_NONE,
> 

Update: it's slightly better to do early return in
balance_dirty_pages_ratelimited_nr(). This way the tmpfs dirties won't
unnecessarily disturb the per-cpu bdp_ratelimits counters.

Thanks,
Fengguang
---

--- linux-next.orig/mm/page-writeback.c	2011-06-11 17:52:59.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-11 17:53:53.000000000 +0800
@@ -1098,9 +1098,6 @@ static void balance_dirty_pages(struct a
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
-	if (!bdi_cap_account_dirty(bdi))
-		return;
-
 	for (;;) {
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
@@ -1237,9 +1234,13 @@ static DEFINE_PER_CPU(unsigned long, bdp
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied)
 {
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long ratelimit;
 	unsigned long *p;
 
+	if (!bdi_cap_account_dirty(bdi))
+		return;
+
 	ratelimit = ratelimit_pages;
 	if (mapping->backing_dev_info->dirty_exceeded)
 		ratelimit = 8;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 11/15] writeback: skip balance_dirty_pages() for in-memory fs
  2011-06-11 13:07   ` Wu Fengguang
@ 2011-06-13 13:42     ` Jan Kara
  0 siblings, 0 replies; 42+ messages in thread
From: Jan Kara @ 2011-06-13 13:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Hugh Dickins, Rik van Riel,
	Peter Zijlstra, Dave Chinner, Christoph Hellwig, linux-fsdevel,
	LKML

On Sat 11-06-11 21:07:31, Wu Fengguang wrote:
> > @@ -495,6 +490,9 @@ static void balance_dirty_pages(struct a
> >  	bool dirty_exceeded = false;
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> >  
> > +	if (!bdi_cap_account_dirty(bdi))
> > +		return;
> > +
> >  	for (;;) {
> >  		struct writeback_control wbc = {
> >  			.sync_mode	= WB_SYNC_NONE,
> > 
> 
> Update: it's slightly better to do early return in
> balance_dirty_pages_ratelimited_nr(). This way the tmpfs dirties won't
> unnecessarily disturb the per-cpu bdp_ratelimits counters.
  Yup, makes sense to me.

Acked-by: Jan Kara <jack@suse.cz>

								Honza

> --- linux-next.orig/mm/page-writeback.c	2011-06-11 17:52:59.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-06-11 17:53:53.000000000 +0800
> @@ -1098,9 +1098,6 @@ static void balance_dirty_pages(struct a
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	unsigned long start_time = jiffies;
>  
> -	if (!bdi_cap_account_dirty(bdi))
> -		return;
> -
>  	for (;;) {
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> @@ -1237,9 +1234,13 @@ static DEFINE_PER_CPU(unsigned long, bdp
>  void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
>  					unsigned long nr_pages_dirtied)
>  {
> +	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	unsigned long ratelimit;
>  	unsigned long *p;
>  
> +	if (!bdi_cap_account_dirty(bdi))
> +		return;
> +
>  	ratelimit = ratelimit_pages;
>  	if (mapping->backing_dev_info->dirty_exceeded)
>  		ratelimit = 8;
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 12/15] writeback: remove writeback_control.more_io
  2011-06-07 21:32 ` [PATCH 12/15] writeback: remove writeback_control.more_io Wu Fengguang
@ 2011-07-11 21:31   ` Hugh Dickins
  2011-07-12  6:20     ` Wu Fengguang
  0 siblings, 1 reply; 42+ messages in thread
From: Hugh Dickins @ 2011-07-11 21:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

On Wed, 8 Jun 2011, Wu Fengguang wrote:
> When wbc.more_io was first introduced, it indicates whether there are
> at least one superblock whose s_more_io contains more IO work. Now with
> the per-bdi writeback, it can be replaced with a simple b_more_io test.

This commit, b7a2441f9966fe3e1be960a876ab52e6029ea005 in your branch
for linux-next, seems very reasonable to me.

But bisection, confirmed on x86_64 and ppc64 by patching the effective
(fs-writeback.c) mods into and out of mmotm with that patch reverted,
show it to be responsible for freezes when running my kernel builds
on ext2 on loop on tmpfs swapping test.

flush-7:0 (which is doing writeback to the ext2 filesystem on loop0 on
a 450MB tmpfs file, though I'm using the ext4 driver to run that ext2fs)
seems to get stuck circling around __writeback_inodes_wb(), called from
wb_writeback() from wb_do_writeback() from bdi_writeback_thread().

Other tasks then hang trying to get the spinlock in inode_wb_list_del()
(memory pressure is trying to evict inodes) or __mark_inode_dirty().

I spent a little while trying to understand why,
but couldn't work it out: hope you can do better!

You may want to ask me to try it with ext4 on disk without /dev/loop0
and tmpfs: I can try that later, though it may take me a while to tune
it to produce equivalent memory pressure.

Hugh

> 
> Acked-by: Jan Kara <jack@suse.cz>
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |    9 ++-------
>  include/linux/writeback.h        |    1 -
>  include/trace/events/ext4.h      |    6 ++----
>  include/trace/events/writeback.h |    5 +----
>  4 files changed, 5 insertions(+), 16 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-06-08 05:16:56.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-06-08 05:17:13.000000000 +0800
> @@ -560,12 +560,8 @@ static int writeback_sb_inodes(struct su
>  		iput(inode);
>  		cond_resched();
>  		spin_lock(&wb->list_lock);
> -		if (wbc->nr_to_write <= 0) {
> -			wbc->more_io = 1;
> +		if (wbc->nr_to_write <= 0)
>  			return 1;
> -		}
> -		if (!list_empty(&wb->b_more_io))
> -			wbc->more_io = 1;
>  	}
>  	/* b_io is empty */
>  	return 1;
> @@ -708,7 +704,6 @@ static long wb_writeback(struct bdi_writ
>  			wbc.older_than_this = &oldest_jif;
>  		}
>  
> -		wbc.more_io = 0;
>  		wbc.nr_to_write = write_chunk;
>  		wbc.pages_skipped = 0;
>  		wbc.inodes_cleaned = 0;
> @@ -740,7 +735,7 @@ static long wb_writeback(struct bdi_writ
>  		/*
>  		 * No more inodes for IO, bail
>  		 */
> -		if (!wbc.more_io)
> +		if (list_empty(&wb->b_more_io))
>  			break;
>  		/*
>  		 * Nothing written. Wait for some inode to

[ Deleted the rest which is just about removing the more_io bitfield ]

Hugh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 12/15] writeback: remove writeback_control.more_io
  2011-07-11 21:31   ` Hugh Dickins
@ 2011-07-12  6:20     ` Wu Fengguang
  2011-07-12 19:50       ` Hugh Dickins
  0 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-07-12  6:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

Hi Hugh,

On Tue, Jul 12, 2011 at 05:31:50AM +0800, Hugh Dickins wrote:
> On Wed, 8 Jun 2011, Wu Fengguang wrote:
> > When wbc.more_io was first introduced, it indicates whether there are
> > at least one superblock whose s_more_io contains more IO work. Now with
> > the per-bdi writeback, it can be replaced with a simple b_more_io test.
> 
> This commit, b7a2441f9966fe3e1be960a876ab52e6029ea005 in your branch
> for linux-next, seems very reasonable to me.
> 
> But bisection, confirmed on x86_64 and ppc64 by patching the effective
> (fs-writeback.c) mods into and out of mmotm with that patch reverted,
> show it to be responsible for freezes when running my kernel builds
> on ext2 on loop on tmpfs swapping test.
> 
> flush-7:0 (which is doing writeback to the ext2 filesystem on loop0 on
> a 450MB tmpfs file, though I'm using the ext4 driver to run that ext2fs)
> seems to get stuck circling around __writeback_inodes_wb(), called from
> wb_writeback() from wb_do_writeback() from bdi_writeback_thread().
> 
> Other tasks then hang trying to get the spinlock in inode_wb_list_del()
> (memory pressure is trying to evict inodes) or __mark_inode_dirty().

I created the ext2 on tmpfs loop file and did some simple file copies,
however cannot reproduce the problem. It would help if you have happen
to have some usable test scripts. Or may I ask for your help to follow
the below analyze and perhaps tracing efforts?

> I spent a little while trying to understand why,
> but couldn't work it out: hope you can do better!

The patch in theory only makes difference in this case in
writeback_sb_inodes():

                if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
                        spin_unlock(&inode->i_lock);
                        requeue_io(inode, wb);
                        continue;
                }

So if some inode is stuck in the I_NEW, I_FREEING or I_WILL_FREE state,
the flusher will get stuck busy retrying that inode.

It's relatively easy to confirm, by reusing the below trace event to
show the inode (together with its state) being requeued.

If this is the root cause, it may equally be fixed by

-			requeue_io(inode, wb);
+			redirty_tail(inode, wb);

which would be useful in case the bug is so deadly that it's no longer
possible to do tracing.

Thanks,
Fengguang
---

echo 1 > /debug/tracing/events/writeback/writeback_single_inode*


--- linux-next.orig/fs/fs-writeback.c	2011-07-11 23:07:04.000000000 -0700
+++ linux-next/fs/fs-writeback.c	2011-07-11 23:08:45.000000000 -0700
@@ -726,6 +726,7 @@ static long writeback_sb_inodes(struct s
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
 			requeue_io(inode, wb);
+			trace_writeback_single_inode_requeue(inode, &wbc, 0);
 			continue;
 		}
 		__iget(inode);

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 12/15] writeback: remove writeback_control.more_io
  2011-07-12  6:20     ` Wu Fengguang
@ 2011-07-12 19:50       ` Hugh Dickins
  2011-07-13  5:49         ` Hugh Dickins
  2011-07-13 22:07         ` Wu Fengguang
  0 siblings, 2 replies; 42+ messages in thread
From: Hugh Dickins @ 2011-07-12 19:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

On Mon, 11 Jul 2011, Wu Fengguang wrote:
> On Tue, Jul 12, 2011 at 05:31:50AM +0800, Hugh Dickins wrote:
> > On Wed, 8 Jun 2011, Wu Fengguang wrote:
> > > When wbc.more_io was first introduced, it indicates whether there are
> > > at least one superblock whose s_more_io contains more IO work. Now with
> > > the per-bdi writeback, it can be replaced with a simple b_more_io test.
> > 
> > This commit, b7a2441f9966fe3e1be960a876ab52e6029ea005 in your branch
> > for linux-next, seems very reasonable to me.
> > 
> > But bisection, confirmed on x86_64 and ppc64 by patching the effective
> > (fs-writeback.c) mods into and out of mmotm with that patch reverted,
> > show it to be responsible for freezes when running my kernel builds
> > on ext2 on loop on tmpfs swapping test.
> > 
> > flush-7:0 (which is doing writeback to the ext2 filesystem on loop0 on
> > a 450MB tmpfs file, though I'm using the ext4 driver to run that ext2fs)
> > seems to get stuck circling around __writeback_inodes_wb(), called from
> > wb_writeback() from wb_do_writeback() from bdi_writeback_thread().
> > 
> > Other tasks then hang trying to get the spinlock in inode_wb_list_del()
> > (memory pressure is trying to evict inodes) or __mark_inode_dirty().
> 
> I created the ext2 on tmpfs loop file and did some simple file copies,
> however cannot reproduce the problem. It would help if you have happen
> to have some usable test scripts.

It takes a while to explain: the sizes need to be tuned to get enough
memory pressure.  I'll forward you the mail in which I described it two
years ago, which should be enough to give you the idea, even if it's not
identical to what I'm using now.

But from what you say below, I think it's pretty much all (the ext2,
the loop, the tmpfs) irrelevant: memory pressure and lots of files
modified, a kernel build in limited memory, that should be enough.

> Or may I ask for your help to follow
> the below analyze and perhaps tracing efforts?
> 
> > I spent a little while trying to understand why,
> > but couldn't work it out: hope you can do better!
> 
> The patch in theory only makes difference in this case in
> writeback_sb_inodes():
> 
>                 if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
>                         spin_unlock(&inode->i_lock);
>                         requeue_io(inode, wb);
>                         continue;
>                 }
> 
> So if some inode is stuck in the I_NEW, I_FREEING or I_WILL_FREE state,
> the flusher will get stuck busy retrying that inode.

Well, if that's the case (it's not obvious to me how removing more_io
made a difference there) then I think you've already got the answer:
thank you!  As I mentioned, memory pressure is trying to evict inodes, so
tasks are hanging trying to acquire wb->list_lock in inode_wb_list_del().

But inodes being evicted are I_FREEING, and the flush thread is
holding wb->list_lock across all of this (since one of your earlier
patchs), so if one of the inodes being evicted is on the b_io list,
then indeed we'll be stuck.

> 
> It's relatively easy to confirm, by reusing the below trace event to
> show the inode (together with its state) being requeued.
> 
> If this is the root cause, it may equally be fixed by
> 
> -			requeue_io(inode, wb);
> +			redirty_tail(inode, wb);
> 
> which would be useful in case the bug is so deadly that it's no longer
> possible to do tracing.

I checked again this morning that I could reproduce it on two machines,
one went in a few minutes, the other within the hour.  Then I made that
patch changing the requeue_io to redirty_tail, and left home with them
running the test with the new kernel: we'll see at the end of the day
how they fared.

I do have a variant of kdb in (but I think my breakpoints are broken at
present), so I'd be inclined to use that rather than get into tracing:
I can easily add a counter in there and see what happens to it, if more
investigation turns out to be needed.

redirty_tail() instead of requeue_io(): is there any danger that doing
that could delay some updates indefinitely?

Hugh

> 
> Thanks,
> Fengguang
> ---
> 
> echo 1 > /debug/tracing/events/writeback/writeback_single_inode*
> 
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-07-11 23:07:04.000000000 -0700
> +++ linux-next/fs/fs-writeback.c	2011-07-11 23:08:45.000000000 -0700
> @@ -726,6 +726,7 @@ static long writeback_sb_inodes(struct s
>  		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
>  			spin_unlock(&inode->i_lock);
>  			requeue_io(inode, wb);
> +			trace_writeback_single_inode_requeue(inode, &wbc, 0);
>  			continue;
>  		}
>  		__iget(inode);

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 12/15] writeback: remove writeback_control.more_io
  2011-07-12 19:50       ` Hugh Dickins
@ 2011-07-13  5:49         ` Hugh Dickins
  2011-07-13 10:57           ` Hugh Dickins
  2011-07-13 22:07         ` Wu Fengguang
  1 sibling, 1 reply; 42+ messages in thread
From: Hugh Dickins @ 2011-07-13  5:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Wu Fengguang, Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Christoph Hellwig, Christoph Lameter, Pekka Enberg,
	linux-fsdevel, linux-kernel

On Tue, 12 Jul 2011, Hugh Dickins wrote:
> On Mon, 11 Jul 2011, Wu Fengguang wrote:
> > 
> > It's relatively easy to confirm, by reusing the below trace event to
> > show the inode (together with its state) being requeued.
> > 
> > If this is the root cause, it may equally be fixed by
> > 
> > -			requeue_io(inode, wb);
> > +			redirty_tail(inode, wb);
> > 
> > which would be useful in case the bug is so deadly that it's no longer
> > possible to do tracing.
> 
> I checked again this morning that I could reproduce it on two machines,
> one went in a few minutes, the other within the hour.  Then I made that
> patch changing the requeue_io to redirty_tail, and left home with them
> running the test with the new kernel: we'll see at the end of the day
> how they fared.

I think that fixes it.  The x86_64 is still running with that, but the
ppc64 gave up fairly early, hitting freeze in __slab_free() instead.

I've now, I believe, reconstituted what ChristophL intended from the
mm_types.h struct page patch he posted (which applied neither to mmotm,
nor to Pekka's for-next, so far as I could tell: maybe cl did some
intermediate tidying of some of the random indentation).  So now
testing that with redirty_tail on ppc64: will report in 9 hours.

Hugh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 12/15] writeback: remove writeback_control.more_io
  2011-07-13  5:49         ` Hugh Dickins
@ 2011-07-13 10:57           ` Hugh Dickins
  2011-07-13 11:19             ` Jan Kara
  0 siblings, 1 reply; 42+ messages in thread
From: Hugh Dickins @ 2011-07-13 10:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Christoph Hellwig, Christoph Lameter, Pekka Enberg,
	linux-fsdevel, linux-kernel

On Tue, 12 Jul 2011, Hugh Dickins wrote:
> On Tue, 12 Jul 2011, Hugh Dickins wrote:
> > On Mon, 11 Jul 2011, Wu Fengguang wrote:
> > > 
> > > It's relatively easy to confirm, by reusing the below trace event to
> > > show the inode (together with its state) being requeued.
> > > 
> > > If this is the root cause, it may equally be fixed by
> > > 
> > > -			requeue_io(inode, wb);
> > > +			redirty_tail(inode, wb);
> > > 
> > > which would be useful in case the bug is so deadly that it's no longer
> > > possible to do tracing.
> > 
> > I checked again this morning that I could reproduce it on two machines,
> > one went in a few minutes, the other within the hour.  Then I made that
> > patch changing the requeue_io to redirty_tail, and left home with them
> > running the test with the new kernel: we'll see at the end of the day
> > how they fared.
> 
> I think that fixes it.  The x86_64 is still running with that, but the
> ppc64 gave up fairly early, hitting freeze in __slab_free() instead.
> 
> I've now, I believe, reconstituted what ChristophL intended from the
> mm_types.h struct page patch he posted (which applied neither to mmotm,
> nor to Pekka's for-next, so far as I could tell: maybe cl did some
> intermediate tidying of some of the random indentation).  So now
> testing that with redirty_tail on ppc64: will report in 9 hours.

Same result as before.  The x86_64 is still going fine, but the ppc64
again seized up in __slab_free() after two and a half hours of load.

I think we should assume that your -requeue_io +redirty_tail is a good
fix for the writeback freeze (if you can reassure us, that it does not
risk postponing some writes indefinitely), and I move over to the other
thread to pursue the struct page __slab_free() freeze.

Thanks!
Hugh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 12/15] writeback: remove writeback_control.more_io
  2011-07-13 10:57           ` Hugh Dickins
@ 2011-07-13 11:19             ` Jan Kara
  2011-07-13 15:06               ` Hugh Dickins
  0 siblings, 1 reply; 42+ messages in thread
From: Jan Kara @ 2011-07-13 11:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Wu Fengguang, Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Christoph Hellwig, Christoph Lameter, Pekka Enberg,
	linux-fsdevel, linux-kernel

On Wed 13-07-11 03:57:40, Hugh Dickins wrote:
> On Tue, 12 Jul 2011, Hugh Dickins wrote:
> > On Tue, 12 Jul 2011, Hugh Dickins wrote:
> > > On Mon, 11 Jul 2011, Wu Fengguang wrote:
> > > > 
> > > > It's relatively easy to confirm, by reusing the below trace event to
> > > > show the inode (together with its state) being requeued.
> > > > 
> > > > If this is the root cause, it may equally be fixed by
> > > > 
> > > > -			requeue_io(inode, wb);
> > > > +			redirty_tail(inode, wb);
> > > > 
> > > > which would be useful in case the bug is so deadly that it's no longer
> > > > possible to do tracing.
> > > 
> > > I checked again this morning that I could reproduce it on two machines,
> > > one went in a few minutes, the other within the hour.  Then I made that
> > > patch changing the requeue_io to redirty_tail, and left home with them
> > > running the test with the new kernel: we'll see at the end of the day
> > > how they fared.
> > 
> > I think that fixes it.  The x86_64 is still running with that, but the
> > ppc64 gave up fairly early, hitting freeze in __slab_free() instead.
> > 
> > I've now, I believe, reconstituted what ChristophL intended from the
> > mm_types.h struct page patch he posted (which applied neither to mmotm,
> > nor to Pekka's for-next, so far as I could tell: maybe cl did some
> > intermediate tidying of some of the random indentation).  So now
> > testing that with redirty_tail on ppc64: will report in 9 hours.
> 
> Same result as before.  The x86_64 is still going fine, but the ppc64
> again seized up in __slab_free() after two and a half hours of load.
> 
> I think we should assume that your -requeue_io +redirty_tail is a good
> fix for the writeback freeze (if you can reassure us, that it does not
> risk postponing some writes indefinitely), and I move over to the other
> thread to pursue the struct page __slab_free() freeze.
  Well, I_FREEING or I_WILL_FREE inodes are written back by iput_final()
and it is reclaim code that is responsible for eventually removing them. So
writeback code can safely ignore them. I_NEW inodes should move out of this
state when they are fully set up and in the writeback round following that,
we will consider them for writeback. So I think the change really makes
sense.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 12/15] writeback: remove writeback_control.more_io
  2011-07-13 11:19             ` Jan Kara
@ 2011-07-13 15:06               ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2011-07-13 15:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, Andrew Morton, Mel Gorman, Dave Chinner,
	Christoph Hellwig, Christoph Lameter, Pekka Enberg,
	linux-fsdevel, linux-kernel

On Wed, 13 Jul 2011, Jan Kara wrote:
> On Wed 13-07-11 03:57:40, Hugh Dickins wrote:
> > 
> > I think we should assume that your -requeue_io +redirty_tail is a good
> > fix for the writeback freeze (if you can reassure us, that it does not
> > risk postponing some writes indefinitely)
>   Well, I_FREEING or I_WILL_FREE inodes are written back by iput_final()
> and it is reclaim code that is responsible for eventually removing them. So
> writeback code can safely ignore them. I_NEW inodes should move out of this
> state when they are fully set up and in the writeback round following that,
> we will consider them for writeback. So I think the change really makes
> sense.

That fits.  Good, thanks for the reassurance.

Hugh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 12/15] writeback: remove writeback_control.more_io
  2011-07-12 19:50       ` Hugh Dickins
  2011-07-13  5:49         ` Hugh Dickins
@ 2011-07-13 22:07         ` Wu Fengguang
  1 sibling, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-07-13 22:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

On Wed, Jul 13, 2011 at 03:50:31AM +0800, Hugh Dickins wrote:
> On Mon, 11 Jul 2011, Wu Fengguang wrote:
> > On Tue, Jul 12, 2011 at 05:31:50AM +0800, Hugh Dickins wrote:
> > > On Wed, 8 Jun 2011, Wu Fengguang wrote:
> > > > When wbc.more_io was first introduced, it indicates whether there are
> > > > at least one superblock whose s_more_io contains more IO work. Now with
> > > > the per-bdi writeback, it can be replaced with a simple b_more_io test.
> > > 
> > > This commit, b7a2441f9966fe3e1be960a876ab52e6029ea005 in your branch
> > > for linux-next, seems very reasonable to me.
> > > 
> > > But bisection, confirmed on x86_64 and ppc64 by patching the effective
> > > (fs-writeback.c) mods into and out of mmotm with that patch reverted,
> > > show it to be responsible for freezes when running my kernel builds
> > > on ext2 on loop on tmpfs swapping test.
> > > 
> > > flush-7:0 (which is doing writeback to the ext2 filesystem on loop0 on
> > > a 450MB tmpfs file, though I'm using the ext4 driver to run that ext2fs)
> > > seems to get stuck circling around __writeback_inodes_wb(), called from
> > > wb_writeback() from wb_do_writeback() from bdi_writeback_thread().
> > > 
> > > Other tasks then hang trying to get the spinlock in inode_wb_list_del()
> > > (memory pressure is trying to evict inodes) or __mark_inode_dirty().
> > 
> > I created the ext2 on tmpfs loop file and did some simple file copies,
> > however cannot reproduce the problem. It would help if you have happen
> > to have some usable test scripts.
> 
> It takes a while to explain: the sizes need to be tuned to get enough
> memory pressure.  I'll forward you the mail in which I described it two
> years ago, which should be enough to give you the idea, even if it's not
> identical to what I'm using now.

Thank you. I'm able to reproduce the bug with your script. I tried 2
runs on unmodified linux-next and they both hang the test box. Then I
applied the below redirty_tail() patch and it survived 2 new runs.

Combined with your test results, we can safely assume that the bug
is fixed by changing to redirty_tail().

> But from what you say below, I think it's pretty much all (the ext2,
> the loop, the tmpfs) irrelevant: memory pressure and lots of files
> modified, a kernel build in limited memory, that should be enough.

Yeah, that's probably enough to trigger an I_FREEING|I_WILL_FREE dirty
inode that is caught by the flusher thread.

> > Or may I ask for your help to follow
> > the below analyze and perhaps tracing efforts?
> > 
> > > I spent a little while trying to understand why,
> > > but couldn't work it out: hope you can do better!
> > 
> > The patch in theory only makes difference in this case in
> > writeback_sb_inodes():
> > 
> >                 if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
> >                         spin_unlock(&inode->i_lock);
> >                         requeue_io(inode, wb);
> >                         continue;
> >                 }
> > 
> > So if some inode is stuck in the I_NEW, I_FREEING or I_WILL_FREE state,
> > the flusher will get stuck busy retrying that inode.
> 
> Well, if that's the case (it's not obvious to me how removing more_io
> made a difference there) then I think you've already got the answer:
> thank you!  As I mentioned, memory pressure is trying to evict inodes, so
> tasks are hanging trying to acquire wb->list_lock in inode_wb_list_del().
>
> But inodes being evicted are I_FREEING, and the flush thread is
> holding wb->list_lock across all of this (since one of your earlier
> patchs), so if one of the inodes being evicted is on the b_io list,
> then indeed we'll be stuck.

Exactly. Commit e8dfc305 ("writeback: elevate queue_io() into
wb_writeback()") aims to be a cleanup patch, however happen to have
the side effect of keeping wb->list_lock throughout the loop in the
case of an I_NEW|I_FREEING|I_WILL_FREE inode. Without it, only the
flusher thread will be busy looping, other tasks still have the
chances to pass through the lock.

> > 
> > It's relatively easy to confirm, by reusing the below trace event to
> > show the inode (together with its state) being requeued.
> > 
> > If this is the root cause, it may equally be fixed by
> > 
> > -			requeue_io(inode, wb);
> > +			redirty_tail(inode, wb);
> > 
> > which would be useful in case the bug is so deadly that it's no longer
> > possible to do tracing.
> 
> I checked again this morning that I could reproduce it on two machines,
> one went in a few minutes, the other within the hour.  Then I made that
> patch changing the requeue_io to redirty_tail, and left home with them
> running the test with the new kernel: we'll see at the end of the day
> how they fared.

Good, our test results agree up to now :)

> I do have a variant of kdb in (but I think my breakpoints are broken at
> present), so I'd be inclined to use that rather than get into tracing:
> I can easily add a counter in there and see what happens to it, if more
> investigation turns out to be needed.

I also find it impossible to carry out the tracing..

> redirty_tail() instead of requeue_io(): is there any danger that doing
> that could delay some updates indefinitely?

As explained by Jan, I_FREEING|I_WILL_FREE inodes will be wrote out by
the freer, and I_NEW is transitional state. So it looks like a safe fix.

Thanks,
Fengguang
---
Subject: writeback: don't busy retry writeback on new/freeing inodes
Date: Mon Jul 11 23:08:50 PDT 2011

Fix a system hang bug introduced by commit b7a2441f9966 ("writeback:
remove writeback_control.more_io") and e8dfc3058 ("writeback: elevate
queue_io() into wb_writeback()") easily reproducible with high memory
pressure and lots of file creation/deletions, for example, a kernel
build in limited memory.

It hangs when some inode is in the I_NEW, I_FREEING or I_WILL_FREE
state, the flusher will get stuck busy retrying that inode, never
releasing wb->list_lock. The lock in turn blocks all kinds of other
tasks when they are trying to grab it.

As put by Jan, it's a safe change regarding data integrity. I_FREEING or
I_WILL_FREE inodes are written back by iput_final() and it is reclaim
code that is responsible for eventually removing them. So writeback code
can safely ignore them. I_NEW inodes should move out of this state when
they are fully set up and in the writeback round following that, we will
consider them for writeback. So the change makes sense.                                                         

CC: Jan Kara <jack@suse.cz> 
Reported-by: Hugh Dickins <hughd@google.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/fs/fs-writeback.c	2011-07-12 10:06:45.000000000 -0700
+++ linux-next/fs/fs-writeback.c	2011-07-13 14:41:39.000000000 -0700
@@ -725,7 +725,7 @@ static long writeback_sb_inodes(struct s
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-			requeue_io(inode, wb);
+			redirty_tail(inode, wb);
 			continue;
 		}
 		__iget(inode);

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2011-07-13 22:07 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-07 21:32 [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Wu Fengguang
2011-06-07 21:32 ` [PATCH 01/15] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage Wu Fengguang
2011-06-07 23:02   ` Andrew Morton
2011-06-07 23:24     ` Wu Fengguang
2011-06-07 21:32 ` [PATCH 02/15] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
2011-06-07 23:02   ` Andrew Morton
2011-06-07 23:51     ` Wu Fengguang
2011-06-07 21:32 ` [PATCH 03/15] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
2011-06-07 23:03   ` Andrew Morton
2011-06-08  0:10     ` Wu Fengguang
2011-06-07 21:32 ` [PATCH 04/15] writeback: try more writeback as long as something was written Wu Fengguang
2011-06-07 21:32 ` [PATCH 05/15] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
2011-06-07 21:32 ` [PATCH 06/15] writeback: refill b_io iff empty Wu Fengguang
2011-06-07 21:32 ` [PATCH 07/15] writeback: split inode_wb_list_lock into bdi_writeback.list_lock Wu Fengguang
2011-06-07 23:03   ` Andrew Morton
2011-06-08  0:20     ` Wu Fengguang
2011-06-08  0:35       ` Andrew Morton
2011-06-08  1:36         ` Wu Fengguang
2011-06-07 21:32 ` [PATCH 08/15] writeback: elevate queue_io() into wb_writeback() Wu Fengguang
2011-06-07 21:32 ` [PATCH 09/15] writeback: avoid extra sync work at enqueue time Wu Fengguang
2011-06-07 21:32 ` [PATCH 10/15] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
2011-06-07 21:32 ` [PATCH 11/15] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
2011-06-11 13:07   ` Wu Fengguang
2011-06-13 13:42     ` Jan Kara
2011-06-07 21:32 ` [PATCH 12/15] writeback: remove writeback_control.more_io Wu Fengguang
2011-07-11 21:31   ` Hugh Dickins
2011-07-12  6:20     ` Wu Fengguang
2011-07-12 19:50       ` Hugh Dickins
2011-07-13  5:49         ` Hugh Dickins
2011-07-13 10:57           ` Hugh Dickins
2011-07-13 11:19             ` Jan Kara
2011-07-13 15:06               ` Hugh Dickins
2011-07-13 22:07         ` Wu Fengguang
2011-06-07 21:32 ` [PATCH 13/15] writeback: remove .nonblocking and .encountered_congestion Wu Fengguang
2011-06-07 21:32 ` [PATCH 14/15] writeback: trace event writeback_single_inode Wu Fengguang
2011-06-07 21:32 ` [PATCH 15/15] writeback: trace event writeback_queue_io Wu Fengguang
2011-06-07 23:04 ` [PATCH 00/15] writeback fixes and cleanups for 3.0 (v5) Andrew Morton
2011-06-08  2:01   ` Wu Fengguang
2011-06-08  6:21     ` Sedat Dilek
2011-06-08 13:45     ` Wu Fengguang
2011-06-09  1:16       ` Stephen Rothwell
2011-06-09  2:18         ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.