All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/17] writeback fixes and cleanups for 2.6.40
@ 2011-05-06  3:08 Wu Fengguang
  2011-05-06  3:08 ` [PATCH 01/17] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage Wu Fengguang
                   ` (17 more replies)
  0 siblings, 18 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

Andrew,

This is the combination of all the recent writeback patches that get
reasonably reviewed and tested.

The first 10 patches are already in -mm tree, with updates:

- remove "writeback: pass writeback_control down to move_expired_inodes()", and
  resolve the resulting merge conflicts in other patches.
- move ahead the sync livelock prevention patches (01, 02) so that (04) won't livelock sync
- merge the three -mm fixes to (08)
- fixed changelog of (01)

	[PATCH 01/17] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage
	[PATCH 02/17] writeback: update dirtied_when for synced inode to prevent livelock
	[PATCH 03/17] writeback: introduce writeback_control.inodes_cleaned
	[PATCH 04/17] writeback: try more writeback as long as something was written
	[PATCH 05/17] writeback: the kupdate expire timestamp should be a moving target
	[PATCH 06/17] writeback: sync expired inodes first in background writeback
	[PATCH 07/17] writeback: refill b_io iff empty
	[PATCH 08/17] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
	[PATCH 09/17] writeback: elevate queue_io() into wb_writeback()
	[PATCH 10/17] writeback: avoid extra sync work at enqueue time

The following 7 patches were posted and reviewed these days:

	[PATCH 11/17] writeback: add bdi_dirty_limit() kernel-doc
	[PATCH 12/17] writeback: skip balance_dirty_pages() for in-memory fs
	[PATCH 13/17] writeback: remove writeback_control.more_io
	[PATCH 14/17] writeback: make writeback_control.nr_to_write straight
	[PATCH 15/17] writeback: remove .nonblocking and .encountered_congestion
	[PATCH 16/17] writeback: trace event writeback_single_inode
	[PATCH 17/17] writeback: trace event writeback_queue_io


The patches are also git pullable from

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git fs-writeback

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/17] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-06 16:08   ` Jan Kara
  2011-05-06  3:08 ` [PATCH 02/17] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

[-- Attachment #1: writeback-for-sync.patch --]
[-- Type: text/plain, Size: 4341 bytes --]

sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Tag the first stage with wbc.tagged_sync and do
livelock prevention for it, too.

Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers also need livelock prevention.

Impact:  It changes the order in which pages/inodes are synced to disk.
Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
until finished with the current inode.

CC: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/ext4/inode.c           |    4 ++--
 fs/fs-writeback.c         |    9 +++++----
 include/linux/writeback.h |    1 +
 mm/page-writeback.c       |    4 ++--
 4 files changed, 10 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:29:51.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:22.000000000 +0800
@@ -36,6 +36,7 @@ struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
 	enum writeback_sync_modes sync_mode;
+	unsigned int tagged_sync:1;
 	unsigned int for_kupdate:1;
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
@@ -650,6 +651,7 @@ static long wb_writeback(struct bdi_writ
 {
 	struct writeback_control wbc = {
 		.sync_mode		= work->sync_mode,
+		.tagged_sync		= work->tagged_sync,
 		.older_than_this	= NULL,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
@@ -657,7 +659,7 @@ static long wb_writeback(struct bdi_writ
 	};
 	unsigned long oldest_jif;
 	long wrote = 0;
-	long write_chunk;
+	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
 
 	if (wbc.for_kupdate) {
@@ -683,9 +685,7 @@ static long wb_writeback(struct bdi_writ
 	 *                   (quickly) tag currently dirty pages
 	 *                   (maybe slowly) sync all tagged pages
 	 */
-	if (wbc.sync_mode == WB_SYNC_NONE)
-		write_chunk = MAX_WRITEBACK_PAGES;
-	else
+	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
@@ -1193,6 +1193,7 @@ void writeback_inodes_sb_nr(struct super
 	struct wb_writeback_work work = {
 		.sb		= sb,
 		.sync_mode	= WB_SYNC_NONE,
+		.tagged_sync	= 1,
 		.done		= &done,
 		.nr_pages	= nr,
 	};
--- linux-next.orig/include/linux/writeback.h	2011-05-05 23:29:51.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-05 23:30:22.000000000 +0800
@@ -47,6 +47,7 @@ struct writeback_control {
 	unsigned encountered_congestion:1; /* An output: a queue is full */
 	unsigned for_kupdate:1;		/* A kupdate writeback */
 	unsigned for_background:1;	/* A background writeback */
+	unsigned tagged_sync:1;		/* do livelock prevention for sync */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned more_io:1;		/* more io to be dispatched */
--- linux-next.orig/mm/page-writeback.c	2011-05-05 23:29:51.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-05 23:30:22.000000000 +0800
@@ -892,12 +892,12 @@ int write_cache_pages(struct address_spa
 			range_whole = 1;
 		cycled = 1; /* ignore range_cyclic tests */
 	}
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
 retry:
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
 		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
--- linux-next.orig/fs/ext4/inode.c	2011-05-05 23:29:51.000000000 +0800
+++ linux-next/fs/ext4/inode.c	2011-05-05 23:30:22.000000000 +0800
@@ -2741,7 +2741,7 @@ static int write_cache_pages_da(struct a
 	index = wbc->range_start >> PAGE_CACHE_SHIFT;
 	end = wbc->range_end >> PAGE_CACHE_SHIFT;
 
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
@@ -2975,7 +2975,7 @@ static int ext4_da_writepages(struct add
 	}
 
 retry:
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
 		tag_pages_for_writeback(mapping, index, end);
 
 	while (!ret && wbc->nr_to_write > 0) {



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 02/17] writeback: update dirtied_when for synced inode to prevent livelock
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
  2011-05-06  3:08 ` [PATCH 01/17] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-06 16:33   ` Jan Kara
  2011-05-06  3:08 ` [PATCH 03/17] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

[-- Attachment #1: writeback-fix-sync-busyloop.patch --]
[-- Type: text/plain, Size: 3615 bytes --]

With the more aggressive "keep writeback as long as we wrote something"
logic in wb_writeback(), the "use LONG_MAX .nr_to_write" trick in commit
b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") is
no longer enough to stop sync livelock.

The fix is to explicitly update .dirtied_when on synced inodes, so that
they are no longer considered for writeback in the next round.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    9 +++++++++
 1 file changed, 9 insertions(+)

ext3/ext4 are working fine now, however tests show that XFS may still
livelock inside the XFS routines:

[ 3581.181253] sync            D ffff8800b6ca15d8  4560  4403   4392 0x00000000
[ 3581.181734]  ffff88006f775bc8 0000000000000046 ffff8800b6ca12b8 00000001b6ca1938
[ 3581.182411]  ffff88006f774000 00000000001d2e40 00000000001d2e40 ffff8800b6ca1280
[ 3581.183088]  00000000001d2e40 ffff88006f775fd8 00000340af111ef2 00000000001d2e40
[ 3581.183765] Call Trace:
[ 3581.184008]  [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
[ 3581.184392]  [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
[ 3581.184756]  [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
[ 3581.185120]  [<ffffffff812ed520>] xfs_ioend_wait+0x87/0x9f
[ 3581.185474]  [<ffffffff8108c97a>] ? wake_up_bit+0x2a/0x2a
[ 3581.185827]  [<ffffffff812f742a>] xfs_sync_inode_data+0x92/0x9d
[ 3581.186198]  [<ffffffff812f76e2>] xfs_inode_ag_walk+0x1a5/0x287
[ 3581.186569]  [<ffffffff812f779b>] ? xfs_inode_ag_walk+0x25e/0x287
[ 3581.186946]  [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
[ 3581.187311]  [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
[ 3581.187669]  [<ffffffff81092175>] ? local_clock+0x41/0x5a
[ 3581.188020]  [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
[ 3581.188403]  [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
[ 3581.188773]  [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
[ 3581.189130]  [<ffffffff812e236c>] ? xfs_perag_get+0x80/0xd0
[ 3581.189488]  [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
[ 3581.189858]  [<ffffffff812f7831>] ? xfs_inode_ag_iterator+0x6d/0x8f
[ 3581.190241]  [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
[ 3581.190606]  [<ffffffff812f780b>] xfs_inode_ag_iterator+0x47/0x8f
[ 3581.190982]  [<ffffffff811611f5>] ? __sync_filesystem+0x7a/0x7a
[ 3581.191352]  [<ffffffff812f7877>] xfs_sync_data+0x24/0x43
[ 3581.191703]  [<ffffffff812f7911>] xfs_quiesce_data+0x2c/0x88
[ 3581.192065]  [<ffffffff812f5556>] xfs_fs_sync_fs+0x21/0x48
[ 3581.192419]  [<ffffffff811611e1>] __sync_filesystem+0x66/0x7a
[ 3581.192783]  [<ffffffff8116120b>] sync_one_sb+0x16/0x18
[ 3581.193128]  [<ffffffff8113e3e3>] iterate_supers+0x72/0xce
[ 3581.193482]  [<ffffffff81161140>] sync_filesystems+0x20/0x22
[ 3581.193842]  [<ffffffff8116127e>] sys_sync+0x21/0x33
[ 3581.194177]  [<ffffffff819016c2>] system_call_fastpath+0x16/0x1b

--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:22.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:23.000000000 +0800
@@ -432,6 +432,15 @@ writeback_single_inode(struct inode *ino
 				requeue_io(inode);
 			} else {
 				/*
+				 * sync livelock prevention: each inode is
+				 * tagged and synced in one shot. If still
+				 * dirty, move it back to s_dirty with updated
+				 * dirty time to prevent syncing it again.
+				 */
+				if (wbc->sync_mode == WB_SYNC_ALL ||
+				    wbc->tagged_sync)
+					inode->dirtied_when = jiffies;
+				/*
 				 * Writeback blocked by something other than
 				 * congestion. Delay the inode for some time to
 				 * avoid spinning on the CPU (100% iowait)



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 03/17] writeback: introduce writeback_control.inodes_cleaned
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
  2011-05-06  3:08 ` [PATCH 01/17] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage Wu Fengguang
  2011-05-06  3:08 ` [PATCH 02/17] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-06 14:36   ` Jan Kara
  2011-05-06  3:08 ` [PATCH 04/17] writeback: try more writeback as long as something was written Wu Fengguang
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Mel Gorman, Dave Chinner,
	Christoph Hellwig, linux-fsdevel

[-- Attachment #1: writeback-inodes_written.patch --]
[-- Type: text/plain, Size: 2319 bytes --]

The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".

So introduce writeback_control.inodes_cleaned to count the inodes get
cleaned.  A non-zero value means there are some progress on writeback,
in which case more writeback can be tried.

about v1: The initial version was to count successful ->write_inode()
calls.  However it leads to busy loops for sync() over NFS, because NFS
ridiculously returns 0 (success) while at the same time redirties the
inode.  The NFS case can be trivially fixed, however there may be more
hidden bugs in other filesystems..

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    4 ++++
 include/linux/writeback.h |    1 +
 2 files changed, 5 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:23.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:24.000000000 +0800
@@ -464,6 +464,7 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
+			wbc->inodes_cleaned++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -725,6 +726,7 @@ static long wb_writeback(struct bdi_writ
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
+		wbc.inodes_cleaned = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
@@ -741,6 +743,8 @@ static long wb_writeback(struct bdi_writ
 		 */
 		if (wbc.nr_to_write <= 0)
 			continue;
+		if (wbc.inodes_cleaned)
+			continue;
 		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
--- linux-next.orig/include/linux/writeback.h	2011-05-05 23:30:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-05 23:30:24.000000000 +0800
@@ -34,6 +34,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long inodes_cleaned;		/* # of inodes cleaned */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 04/17] writeback: try more writeback as long as something was written
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (2 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 03/17] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-09 16:05   ` Jan Kara
  2011-05-06  3:08 ` [PATCH 05/17] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

[-- Attachment #1: writeback-background-retry.patch --]
[-- Type: text/plain, Size: 3199 bytes --]

writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of eligible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

For example, imagine 100 inodes

        i0, i1, i2, ..., i90, i91, i99

At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold. This will be a fairly normal/frequent case I guess.

Jan raised the concern

	I'm just afraid that in some pathological cases this could
	result in bad writeback pattern - like if there is some process
	which manages to dirty just a few pages while we are doing
	writeout, this looping could result in writing just a few pages
	in each round which is bad for fragmentation etc.

However it requires really strong timing to make that to (continuously)
happen.  In practice it's very hard to produce such a pattern even if
there is such a possibility in theory. I actually tried to write 1 page
per 1ms with this command

	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test

and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs. The readers could try other write-and-sleep patterns and
check if it can block sync for longer time.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:24.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:25.000000000 +0800
@@ -739,23 +739,23 @@ static long wb_writeback(struct bdi_writ
 		wrote += write_chunk - wbc.nr_to_write;
 
 		/*
-		 * If we consumed everything, see if we have more
+		 * Did we write something? Try for more
+		 *
+		 * Dirty inodes are moved to b_io for writeback in batches.
+		 * The completion of the current batch does not necessarily
+		 * mean the overall work is done. So we keep looping as long
+		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write <= 0)
+		if (wbc.nr_to_write < write_chunk)
 			continue;
 		if (wbc.inodes_cleaned)
 			continue;
 		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * No more inodes for IO, bail
 		 */
 		if (!wbc.more_io)
 			break;
 		/*
-		 * Did we write something? Try for more
-		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 05/17] writeback: the kupdate expire timestamp should be a moving target
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (3 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 04/17] writeback: try more writeback as long as something was written Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-06  3:08 ` [PATCH 06/17] writeback: sync expired inodes first in background writeback Wu Fengguang
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Dave Chinner, Jan Kara, Mel Gorman,
	Itaru Kitayama, Christoph Hellwig, linux-fsdevel

[-- Attachment #1: writeback-moving-dirty-expired.patch --]
[-- Type: text/plain, Size: 2063 bytes --]

Dynamically compute the dirty expire timestamp at queue_io() time.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the evil pageout(). Nevertheless this patch merely addresses part
  of the problem.

v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.

CC: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:25.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:25.000000000 +0800
@@ -672,11 +672,6 @@ static long wb_writeback(struct bdi_writ
 	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -723,6 +718,12 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
+		if (work->for_kupdate) {
+			oldest_jif = jiffies -
+				msecs_to_jiffies(dirty_expire_interval * 10);
+			wbc.older_than_this = &oldest_jif;
+		}
+
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 06/17] writeback: sync expired inodes first in background writeback
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (4 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 05/17] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-06 19:02   ` Rik van Riel
  2011-05-09 16:08   ` Jan Kara
  2011-05-06  3:08 ` [PATCH 07/17] writeback: refill b_io iff empty Wu Fengguang
                   ` (11 subsequent siblings)
  17 siblings, 2 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Dave Chinner, Jan Kara, Rik van Riel,
	Mel Gorman, Christoph Hellwig, linux-fsdevel

[-- Attachment #1: writeback-moving-dirty-expired-background.patch --]
[-- Type: text/plain, Size: 3280 bytes --]

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

At each queue_io() time, first try enqueuing only newly expired inodes.
If there are zero expired inodes to work with, then relax the rule and
enqueue all dirty inodes.

This will help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

More background: as Mel put it, "it makes sense to write old pages first
to reduce the chances page reclaim is initiating IO."

Rik also presented the situation with a graph:

LRU head                                 [*] dirty page
[                          *              *      * *  *  * * * * * *]

Ideally, most dirty pages should lie close to the LRU tail instead of
LRU head. That requires the flusher thread to sync old/expired inodes
first (as there are obvious correlations between inode age and page
age), and to give fair opportunities to newly expired inodes rather
than sticking with some large eldest inodes (as larger inodes have
weaker correlations in the inode<=>page ages).

This patch helps the flusher to meet both the above requirements.

Side effects: it might reduce the batch size and hence reduce
inode_wb_list_lock hold time, but in turn make the cluster-by-partition
logic in the same function less effective on reducing disk seeks.

v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.

CC: Dave Chinner <david@fromorbit.com>
CC: Jan Kara <jack@suse.cz>
CC: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:25.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:26.000000000 +0800
@@ -718,7 +718,7 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
-		if (work->for_kupdate) {
+		if (work->for_kupdate || work->for_background) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
 			wbc.older_than_this = &oldest_jif;
@@ -729,6 +729,7 @@ static long wb_writeback(struct bdi_writ
 		wbc.pages_skipped = 0;
 		wbc.inodes_cleaned = 0;
 
+retry:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
 			__writeback_inodes_sb(work->sb, wb, &wbc);
@@ -752,6 +753,19 @@ static long wb_writeback(struct bdi_writ
 		if (wbc.inodes_cleaned)
 			continue;
 		/*
+		 * background writeback will start with expired inodes, and
+		 * if none is found, fallback to all inodes. This order helps
+		 * reduce the number of dirty pages reaching the end of LRU
+		 * lists and cause trouble to the page reclaim.
+		 */
+		if (work->for_background &&
+		    wbc.older_than_this &&
+		    list_empty(&wb->b_io) &&
+		    list_empty(&wb->b_more_io)) {
+			wbc.older_than_this = NULL;
+			goto retry;
+		}
+		/*
 		 * No more inodes for IO, bail
 		 */
 		if (!wbc.more_io)



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 07/17] writeback: refill b_io iff empty
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (5 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 06/17] writeback: sync expired inodes first in background writeback Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-06  3:08 ` [PATCH 08/17] writeback: split inode_wb_list_lock into bdi_writeback.list_lock Wu Fengguang
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Mel Gorman, Dave Chinner,
	Christoph Hellwig, linux-fsdevel

[-- Attachment #1: writeback-refill-queue-iff-empty.patch --]
[-- Type: text/plain, Size: 7353 bytes --]

There is no point to carry different refill policies between for_kupdate
and other type of works. Use a consistent "refill b_io iff empty" policy
which can guarantee fairness in an easy to understand way.

A b_io refill will setup a _fixed_ work set with all currently eligible
inodes and start a new round of walk through b_io. The "fixed" work set
means no new inodes will be added to the work set during the walk.
Only when a complete walk over b_io is done, new inodes that are
eligible at the time will be enqueued and the walk be started over.

This procedure provides fairness among the inodes because it guarantees
each inode to be synced once and only once at each round. So all inodes
will be free from starvations.

This change relies on wb_writeback() to keep retrying as long as we made
some progress on cleaning some pages and/or inodes. Without that ability,
the old logic on background works relies on aggressively queuing all
eligible inodes into b_io at every time. But that's not a guarantee.

The below test script completes a slightly faster now:

             2.6.39-rc3	  2.6.39-rc3-dyn-expire+
------------------------------------------------
all elapsed     256.043      252.367
stddev           24.381       12.530

tar elapsed      30.097       28.808
dd  elapsed      13.214       11.782

	#!/bin/zsh

	cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/

	umount /dev/sda7
	mkfs.xfs -f /dev/sda7
	mount /dev/sda7 /fs

	echo 3 > /proc/sys/vm/drop_caches

	tic=$(cat /proc/uptime|cut -d' ' -f2)

	cd /fs
	time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
	time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &

	wait
	sync
	tac=$(cat /proc/uptime|cut -d' ' -f2)
	echo elapsed: $((tac - tic))

It maintains roughly the same small vs. large file writeout shares, and
offers large files better chances to be written in nice 4M chunks.

Analyzes from Dave Chinner in great details:

Let's say we have lots of inodes with 100 dirty pages being created,
and one large writeback going on. We expire 8 new inodes for every
1024 pages we write back.

With the old code, we do:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)

	writeback  8 small inodes 800 pages
		   1 large inode 224 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)
	.....

Your new code:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 8s)
	writeback  8 small inodes 800 pages

	b_io empty: (1800 pages written)
		b_more_io (large inode) -> b_io (1l)
		14 newly expired inodes -> b_io (1l, 14s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 14s)
	writeback  10 small inodes 1000 pages
		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
	writeback  5 small inodes 500 pages
	b_io empty: (2548 pages written)
		b_more_io (large inode) -> b_io (1l, 1s(24))
		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
	......

Rough progression of pages written at b_io refill:

Old code:

	total	large file	% of writeback
	1024	224		21.9% (fixed)

New code:
	total	large file	% of writeback
	1800	1024		~55%
	2550	1024		~40%
	3050	1024		~33%
	3500	1024		~29%
	3950	1024		~26%
	4250	1024		~24%
	4500	1024		~22.7%
	4700	1024		~21.7%
	4800	1024		~21.3%
	4800	1024		~21.3%
	(pretty much steady state from here)

Ok, so the steady state is reached with a similar percentage of
writeback to the large file as the existing code. Ok, that's good,
but providing some evidence that is doesn't change the shared of
writeback to the large should be in the commit message ;)

The other advantage to this is that we always write 1024 page chunks
to the large file, rather than smaller "whatever remains" chunks.

CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

after + dyn-expire + ioless:
                &(&wb->list_lock)->rlock:          2291           2304           0.15         204.09        3125.12          35315         970712           0.10         223.84     1113437.05
                ------------------------
                &(&wb->list_lock)->rlock              9          [<ffffffff8115dc5d>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock           1614          [<ffffffff8115da6a>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock            459          [<ffffffff8115d351>] writeback_sb_inodes+0x108/0x154
                &(&wb->list_lock)->rlock            137          [<ffffffff8115cdcf>] writeback_single_inode+0x1b4/0x296
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110c367>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115dc5d>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock           1160          [<ffffffff8115da6a>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock            435          [<ffffffff8115dcb6>] writeback_inodes_wb+0x33/0x12b

after + dyn-expire:
                &(&wb->list_lock)->rlock:        226820         229719           0.10         194.28      809275.91         327372     1033513685           0.08         476.96  3590929811.61
                ------------------------
                &(&wb->list_lock)->rlock             11          [<ffffffff8115b6d3>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock          30559          [<ffffffff8115bb1f>] wb_writeback+0x2fb/0x3c3
                &(&wb->list_lock)->rlock          37339          [<ffffffff8115b72c>] writeback_inodes_wb+0x33/0x12b
                &(&wb->list_lock)->rlock          54880          [<ffffffff8115a87f>] writeback_single_inode+0x17f/0x227
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110b606>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115b6d3>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock          55347          [<ffffffff8115b72c>] writeback_inodes_wb+0x33/0x12b
                &(&wb->list_lock)->rlock          55338          [<ffffffff8115a87f>] writeback_single_inode+0x17f/0x227

--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:26.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:27.000000000 +0800
@@ -589,7 +589,8 @@ void writeback_inodes_wb(struct bdi_writ
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_wb_list_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
 	while (!list_empty(&wb->b_io)) {
@@ -616,7 +617,7 @@ static void __writeback_inodes_sb(struct
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_wb_list_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_wb_list_lock);



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 08/17] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (6 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 07/17] writeback: refill b_io iff empty Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-06  3:08 ` [PATCH 09/17] writeback: elevate queue_io() into wb_writeback() Wu Fengguang
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Christoph Hellwig, Hugh Dickins, Jan Kara,
	Dave Chinner, Christoph Hellwig, linux-fsdevel

[-- Attachment #1: writeback-split-lock.patch --]
[-- Type: text/plain, Size: 21352 bytes --]

From: Christoph Hellwig <hch@infradead.org>

Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads.  It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
                      inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157

2.6.39-rc3 + patch:
                &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                ------------------------
                &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f

hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 fs/block_dev.c              |   16 +++--
 fs/fs-writeback.c           |   97 +++++++++++++++++-----------------
 fs/inode.c                  |    5 -
 include/linux/backing-dev.h |    2 
 include/linux/writeback.h   |    2 
 mm/backing-dev.c            |   21 +++++--
 mm/filemap.c                |    6 +-
 mm/rmap.c                   |    4 -
 8 files changed, 85 insertions(+), 68 deletions(-)

--- linux-next.orig/fs/block_dev.c	2011-05-05 23:29:50.000000000 +0800
+++ linux-next/fs/block_dev.c	2011-05-05 23:44:36.000000000 +0800
@@ -44,24 +44,28 @@ inline struct block_device *I_BDEV(struc
 {
 	return &BDEV_I(inode)->bdev;
 }
-
 EXPORT_SYMBOL(I_BDEV);
 
 /*
- * move the inode from it's current bdi to the a new bdi. if the inode is dirty
- * we need to move it onto the dirty list of @dst so that the inode is always
- * on the right list.
+ * Move the inode from its current bdi to a new bdi. If the inode is dirty we
+ * need to move it onto the dirty list of @dst so that the inode is always on
+ * the right list.
  */
 static void bdev_inode_switch_bdi(struct inode *inode,
 			struct backing_dev_info *dst)
 {
-	spin_lock(&inode_wb_list_lock);
+	struct backing_dev_info *old = inode->i_data.backing_dev_info;
+
+	if (dst == old)			/* deadlock avoidance */
+		return;
+	bdi_lock_two(&old->wb, &dst->wb);
 	spin_lock(&inode->i_lock);
 	inode->i_data.backing_dev_info = dst;
 	if (inode->i_state & I_DIRTY)
 		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&old->wb.list_lock);
+	spin_unlock(&dst->wb.list_lock);
 }
 
 static sector_t max_block(struct block_device *bdev)
--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:27.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:41:36.000000000 +0800
@@ -181,12 +181,13 @@ void bdi_start_background_writeback(stru
  */
 void inode_wb_list_del(struct inode *inode)
 {
-	spin_lock(&inode_wb_list_lock);
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.list_lock);
 	list_del_init(&inode->i_wb_list);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&bdi->wb.list_lock);
 }
 
-
 /*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
@@ -196,11 +197,9 @@ void inode_wb_list_del(struct inode *ino
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
-static void redirty_tail(struct inode *inode)
+static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -214,11 +213,9 @@ static void redirty_tail(struct inode *i
 /*
  * requeue inode for re-scanning after bdi->b_io list is exhausted.
  */
-static void requeue_io(struct inode *inode)
+static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
@@ -226,7 +223,7 @@ static void inode_sync_complete(struct i
 {
 	/*
 	 * Prevent speculative execution through
-	 * spin_unlock(&inode_wb_list_lock);
+	 * spin_unlock(&wb->list_lock);
 	 */
 
 	smp_mb();
@@ -302,7 +299,7 @@ static void move_expired_inodes(struct l
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
@@ -317,7 +314,8 @@ static int write_inode(struct inode *ino
 /*
  * Wait for writeback on an inode to complete.
  */
-static void inode_wait_for_writeback(struct inode *inode)
+static void inode_wait_for_writeback(struct inode *inode,
+				     struct bdi_writeback *wb)
 {
 	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
 	wait_queue_head_t *wqh;
@@ -325,15 +323,15 @@ static void inode_wait_for_writeback(str
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		spin_lock(&inode->i_lock);
 	}
 }
 
 /*
- * Write out an inode's dirty pages.  Called under inode_wb_list_lock and
+ * Write out an inode's dirty pages.  Called under wb->list_lock and
  * inode->i_lock.  Either the caller has an active reference on the inode or
  * the inode has I_WILL_FREE set.
  *
@@ -344,13 +342,14 @@ static void inode_wait_for_writeback(str
  * livelocks, etc.
  */
 static int
-writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
+writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
+		       struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
 	unsigned dirty;
 	int ret;
 
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
 
 	if (!atomic_read(&inode->i_count))
@@ -368,14 +367,14 @@ writeback_single_inode(struct inode *ino
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			return 0;
 		}
 
 		/*
 		 * It's a data-integrity sync.  We must wait.
 		 */
-		inode_wait_for_writeback(inode);
+		inode_wait_for_writeback(inode, wb);
 	}
 
 	BUG_ON(inode->i_state & I_SYNC);
@@ -384,7 +383,7 @@ writeback_single_inode(struct inode *ino
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 
 	ret = do_writepages(mapping, wbc);
 
@@ -415,7 +414,7 @@ writeback_single_inode(struct inode *ino
 			ret = err;
 	}
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
@@ -429,7 +428,7 @@ writeback_single_inode(struct inode *ino
 				/*
 				 * slice used up: queue for next turn
 				 */
-				requeue_io(inode);
+				requeue_io(inode, wb);
 			} else {
 				/*
 				 * sync livelock prevention: each inode is
@@ -447,7 +446,7 @@ writeback_single_inode(struct inode *ino
 				 * retrying writeback of the dirty page/inode
 				 * that cannot be performed immediately.
 				 */
-				redirty_tail(inode);
+				redirty_tail(inode, wb);
 			}
 		} else if (inode->i_state & I_DIRTY) {
 			/*
@@ -456,7 +455,7 @@ writeback_single_inode(struct inode *ino
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
-			redirty_tail(inode);
+			redirty_tail(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -521,7 +520,7 @@ static int writeback_sb_inodes(struct su
 				 * superblock, move all inodes not belonging
 				 * to it back onto the dirty list.
 				 */
-				redirty_tail(inode);
+				redirty_tail(inode, wb);
 				continue;
 			}
 
@@ -541,7 +540,7 @@ static int writeback_sb_inodes(struct su
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			continue;
 		}
 
@@ -557,19 +556,19 @@ static int writeback_sb_inodes(struct su
 		__iget(inode);
 
 		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wbc);
+		writeback_single_inode(inode, wb, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
-			redirty_tail(inode);
+			redirty_tail(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 		iput(inode);
 		cond_resched();
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
 			return 1;
@@ -588,7 +587,7 @@ void writeback_inodes_wb(struct bdi_writ
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
@@ -598,7 +597,7 @@ void writeback_inodes_wb(struct bdi_writ
 		struct super_block *sb = inode->i_sb;
 
 		if (!pin_sb_for_writeback(sb)) {
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			continue;
 		}
 		ret = writeback_sb_inodes(sb, wb, wbc, false);
@@ -607,7 +606,7 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
@@ -616,11 +615,11 @@ static void __writeback_inodes_sb(struct
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 }
 
 /*
@@ -776,15 +775,15 @@ retry:
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = wb_inode(wb->b_more_io.prev);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode);
+			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 	}
 
 	return wrote;
@@ -1121,10 +1120,10 @@ void __mark_inode_dirty(struct inode *in
 			}
 
 			spin_unlock(&inode->i_lock);
-			spin_lock(&inode_wb_list_lock);
+			spin_lock(&bdi->wb.list_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
-			spin_unlock(&inode_wb_list_lock);
+			spin_unlock(&bdi->wb.list_lock);
 
 			if (wakeup_bdi)
 				bdi_wakeup_thread_delayed(bdi);
@@ -1326,6 +1325,7 @@ EXPORT_SYMBOL(sync_inodes_sb);
  */
 int write_inode_now(struct inode *inode, int sync)
 {
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	int ret;
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
@@ -1338,11 +1338,11 @@ int write_inode_now(struct inode *inode,
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, &wbc);
+	ret = writeback_single_inode(inode, wb, &wbc);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
@@ -1362,13 +1362,14 @@ EXPORT_SYMBOL(write_inode_now);
  */
 int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	int ret;
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, wbc);
+	ret = writeback_single_inode(inode, wb, wbc);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode);
--- linux-next.orig/fs/inode.c	2011-05-05 23:29:50.000000000 +0800
+++ linux-next/fs/inode.c	2011-05-05 23:30:28.000000000 +0800
@@ -37,7 +37,7 @@
  *   inode_lru, inode->i_lru
  * inode_sb_list_lock protects:
  *   sb->s_inodes, inode->i_sb_list
- * inode_wb_list_lock protects:
+ * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
  * inode_hash_lock protects:
  *   inode_hashtable, inode->i_hash
@@ -48,7 +48,7 @@
  *   inode->i_lock
  *     inode_lru_lock
  *
- * inode_wb_list_lock
+ * bdi->wb.list_lock
  *   inode->i_lock
  *
  * inode_hash_lock
@@ -111,7 +111,6 @@ static LIST_HEAD(inode_lru);
 static DEFINE_SPINLOCK(inode_lru_lock);
 
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_wb_list_lock);
 
 /*
  * iprune_sem provides exclusion between the icache shrinking and the
--- linux-next.orig/include/linux/backing-dev.h	2011-05-05 23:29:50.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-05-05 23:30:28.000000000 +0800
@@ -57,6 +57,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
 struct backing_dev_info {
@@ -106,6 +107,7 @@ int bdi_writeback_thread(void *data);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
--- linux-next.orig/include/linux/writeback.h	2011-05-05 23:30:24.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-05 23:41:36.000000000 +0800
@@ -9,8 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t inode_wb_list_lock;
-
 /*
  * fs/fs-writeback.c
  */
--- linux-next.orig/mm/backing-dev.c	2011-05-05 23:29:50.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-05-05 23:41:35.000000000 +0800
@@ -45,6 +45,17 @@ static struct timer_list sync_supers_tim
 static int bdi_sync_supers(void *);
 static void sync_supers_timer_fn(unsigned long);
 
+void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
+{
+	if (wb1 < wb2) {
+		spin_lock(&wb1->list_lock);
+		spin_lock_nested(&wb2->list_lock, 1);
+	} else {
+		spin_lock(&wb2->list_lock);
+		spin_lock_nested(&wb1->list_lock, 1);
+	}
+}
+
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
@@ -67,14 +78,14 @@ static int bdi_debug_stats_show(struct s
 	struct inode *inode;
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
 	list_for_each_entry(inode, &wb->b_io, i_wb_list)
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -628,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	spin_lock_init(&wb->list_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
 
@@ -676,11 +688,12 @@ void bdi_destroy(struct backing_dev_info
 	if (bdi_has_dirty_io(bdi)) {
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
-		spin_lock(&inode_wb_list_lock);
+		bdi_lock_two(&bdi->wb, dst);
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&bdi->wb.list_lock);
+		spin_unlock(&dst->list_lock);
 	}
 
 	bdi_unregister(bdi);
--- linux-next.orig/mm/filemap.c	2011-05-05 23:29:50.000000000 +0800
+++ linux-next/mm/filemap.c	2011-05-05 23:30:28.000000000 +0800
@@ -80,7 +80,7 @@
  *  ->i_mutex
  *    ->i_alloc_sem             (various)
  *
- *  inode_wb_list_lock
+ *  bdi->wb.list_lock
  *    sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
@@ -98,9 +98,9 @@
  *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
- *    inode_wb_list_lock	(page_remove_rmap->set_page_dirty)
+ *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
  *    ->inode->i_lock		(page_remove_rmap->set_page_dirty)
- *    inode_wb_list_lock	(zap_pte_range->set_page_dirty)
+ *    bdi.wb->list_lock		(zap_pte_range->set_page_dirty)
  *    ->inode->i_lock		(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
  *
--- linux-next.orig/mm/rmap.c	2011-05-05 23:29:50.000000000 +0800
+++ linux-next/mm/rmap.c	2011-05-05 23:30:28.000000000 +0800
@@ -32,11 +32,11 @@
  *               mmlist_lock (in mmput, drain_mmlist and others)
  *               mapping->private_lock (in __set_page_dirty_buffers)
  *               inode->i_lock (in set_page_dirty's __mark_inode_dirty)
- *               inode_wb_list_lock (in set_page_dirty's __mark_inode_dirty)
+ *               bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
  *                 sb_lock (within inode_lock in fs/fs-writeback.c)
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
- *                           within inode_wb_list_lock in __sync_single_inode)
+ *                           within bdi.wb->list_lock in __sync_single_inode)
  *
  * (code doesn't rely on that order so it could be switched around)
  * ->tasklist_lock



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 09/17] writeback: elevate queue_io() into wb_writeback()
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (7 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 08/17] writeback: split inode_wb_list_lock into bdi_writeback.list_lock Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-09 16:15   ` Jan Kara
  2011-05-06  3:08 ` [PATCH 10/17] writeback: avoid extra sync work at enqueue time Wu Fengguang
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

[-- Attachment #1: move-queue_io.patch --]
[-- Type: text/plain, Size: 2727 bytes --]

Code refactor for more logical code layout.
No behavior change.

- remove the mis-named __writeback_inodes_sb()

- wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
  before calling __writeback_inodes_wb()

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:41:36.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:51:59.000000000 +0800
@@ -580,17 +580,13 @@ static int writeback_sb_inodes(struct su
 	return 1;
 }
 
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc)
+static void __writeback_inodes_wb(struct bdi_writeback *wb,
+				  struct writeback_control *wbc)
 {
 	int ret = 0;
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&wb->list_lock);
-
-	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -606,19 +602,16 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&wb->list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
-static void __writeback_inodes_sb(struct super_block *sb,
-		struct bdi_writeback *wb, struct writeback_control *wbc)
+void writeback_inodes_wb(struct bdi_writeback *wb,
+		struct writeback_control *wbc)
 {
-	WARN_ON(!rwsem_is_locked(&sb->s_umount));
-
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
-	writeback_sb_inodes(sb, wb, wbc, true);
+	__writeback_inodes_wb(wb, wbc);
 	spin_unlock(&wb->list_lock);
 }
 
@@ -685,7 +678,7 @@ static long wb_writeback(struct bdi_writ
 	 * The intended call sequence for WB_SYNC_ALL writeback is:
 	 *
 	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
+	 *          writeback_sb_inodes()       <== called only once
 	 *              write_cache_pages()     <== called once for each inode
 	 *                   (quickly) tag currently dirty pages
 	 *                   (maybe slowly) sync all tagged pages
@@ -731,10 +724,14 @@ static long wb_writeback(struct bdi_writ
 
 retry:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
+		spin_lock(&wb->list_lock);
+		if (list_empty(&wb->b_io))
+			queue_io(wb, wbc.older_than_this);
 		if (work->sb)
-			__writeback_inodes_sb(work->sb, wb, &wbc);
+			writeback_sb_inodes(work->sb, wb, &wbc, true);
 		else
-			writeback_inodes_wb(wb, &wbc);
+			__writeback_inodes_wb(wb, &wbc);
+		spin_unlock(&wb->list_lock);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 10/17] writeback: avoid extra sync work at enqueue time
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (8 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 09/17] writeback: elevate queue_io() into wb_writeback() Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-09 16:16   ` Jan Kara
  2011-05-06  3:08 ` [PATCH 11/17] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

[-- Attachment #1: writeback-kill-wb_start.patch --]
[-- Type: text/plain, Size: 2191 bytes --]

This removes writeback_control.wb_start and does more straightforward
sync livelock prevention by setting .older_than_this to prevent extra
inodes from being enqueued in the first place.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   16 +++-------------
 include/linux/writeback.h |    3 ---
 2 files changed, 3 insertions(+), 16 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:29.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:30.000000000 +0800
@@ -544,15 +544,6 @@ static int writeback_sb_inodes(struct su
 			continue;
 		}
 
-		/*
-		 * Was this inode dirtied after sync_sb_inodes was called?
-		 * This keeps sync from extra jobs and livelock.
-		 */
-		if (inode_dirtied_after(inode, wbc->wb_start)) {
-			spin_unlock(&inode->i_lock);
-			return 1;
-		}
-
 		__iget(inode);
 
 		pages_skipped = wbc->pages_skipped;
@@ -585,9 +576,6 @@ static void __writeback_inodes_wb(struct
 {
 	int ret = 0;
 
-	if (!wbc->wb_start)
-		wbc->wb_start = jiffies; /* livelock avoidance */
-
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
 		struct super_block *sb = inode->i_sb;
@@ -686,7 +674,9 @@ static long wb_writeback(struct bdi_writ
 	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
 		write_chunk = LONG_MAX;
 
-	wbc.wb_start = jiffies; /* livelock avoidance */
+	oldest_jif = jiffies;
+	wbc.older_than_this = &oldest_jif;
+
 	for (;;) {
 		/*
 		 * Stop writeback when nr_pages has been consumed
--- linux-next.orig/include/linux/writeback.h	2011-05-05 23:30:28.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-05 23:30:30.000000000 +0800
@@ -26,9 +26,6 @@ struct writeback_control {
 	enum writeback_sync_modes sync_mode;
 	unsigned long *older_than_this;	/* If !NULL, only write back inodes
 					   older than this */
-	unsigned long wb_start;         /* Time writeback_inodes_wb was
-					   called. This is needed to avoid
-					   extra jobs and livelock */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 11/17] writeback: add bdi_dirty_limit() kernel-doc
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (9 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 10/17] writeback: avoid extra sync work at enqueue time Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-06  3:08 ` [PATCH 12/17] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Peter Zijlstra, Jan Kara, Dave Chinner,
	Christoph Hellwig, linux-fsdevel

[-- Attachment #1: writeback-task_dirty_limit-comment.patch --]
[-- Type: text/plain, Size: 1193 bytes --]

Clarify the bdi_dirty_limit() comment.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-05-05 23:30:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-05 23:30:31.000000000 +0800
@@ -437,10 +437,17 @@ void global_dirty_limits(unsigned long *
 	*pdirty = dirty;
 }
 
-/*
+/**
  * bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ * @bdi: the backing_dev_info to query
+ * @dirty: global dirty limit in pages
+ *
+ * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
+ * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ * And the "limit" in the name is not seriously taken as hard limit in
+ * balance_dirty_pages().
  *
- * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * It allocates high/low dirty limits to fast/slow devices, in order to prevent
  * - starving fast devices
  * - piling up dirty pages (that will take long time to sync) on slow devices
  *



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 12/17] writeback: skip balance_dirty_pages() for in-memory fs
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (10 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 11/17] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-06  3:08 ` [PATCH 13/17] writeback: remove writeback_control.more_io Wu Fengguang
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Hugh Dickins, Rik van Riel, Peter Zijlstra,
	Jan Kara, Dave Chinner, Christoph Hellwig, linux-fsdevel

[-- Attachment #1: writeback-trace-global-dirty-states-fix.patch --]
[-- Type: text/plain, Size: 2966 bytes --]

This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.

It can also prevent

[  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050

in the balance_dirty_pages tracepoint, which will call

	dev_name(mapping->backing_dev_info->dev)

but shmem_backing_dev_info.dev is NULL.

Summary notes about the tmpfs/ramfs behavior changes:

As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
balance_dirty_pages() as long as we are over the (dirty+background)/2
global throttle threshold.  This is because both the dirty pages and
threshold will be 0 for tmpfs/ramfs. Hence this test will always
evaluate to TRUE:

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
                        || (nr_reclaimable + nr_writeback >= dirty_thresh);

For 2.6.37, someone complained that the current logic does not allow the
users to set vm.dirty_ratio=0.  So commit 4cbec4c8b9 changed the test to

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
                        || (nr_reclaimable + nr_writeback > dirty_thresh);

So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
throttled unless the global dirty threshold is exceeded (which is very
unlikely to happen; once happen, will block many tasks).

I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
for a busy writing server, tmpfs write()s may get livelocked! The
"inadvertent" throttling can hardly bring help to any workload because
of its "either no throttling, or get throttled to death" property.

So based on 2.6.37, this patch won't bring more noticeable changes.

CC: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-05-05 23:30:31.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-05 23:30:32.000000000 +0800
@@ -244,13 +244,8 @@ void task_dirty_inc(struct task_struct *
 static void bdi_writeout_fraction(struct backing_dev_info *bdi,
 		long *numerator, long *denominator)
 {
-	if (bdi_cap_writeback_dirty(bdi)) {
-		prop_fraction_percpu(&vm_completions, &bdi->completions,
+	prop_fraction_percpu(&vm_completions, &bdi->completions,
 				numerator, denominator);
-	} else {
-		*numerator = 0;
-		*denominator = 1;
-	}
 }
 
 static inline void task_dirties_fraction(struct task_struct *tsk,
@@ -495,6 +490,9 @@ static void balance_dirty_pages(struct a
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+	if (!bdi_cap_account_dirty(bdi))
+		return;
+
 	for (;;) {
 		struct writeback_control wbc = {
 			.sync_mode	= WB_SYNC_NONE,



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 13/17] writeback: remove writeback_control.more_io
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (11 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 12/17] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-06  3:08 ` [PATCH 14/17] writeback: make writeback_control.nr_to_write straight Wu Fengguang
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Mel Gorman, Minchan Kim,
	Dave Chinner, Christoph Hellwig, linux-fsdevel

[-- Attachment #1: writeback-kill-more_io.patch --]
[-- Type: text/plain, Size: 4417 bytes --]

When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
CC: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |    9 ++-------
 include/linux/writeback.h        |    1 -
 include/trace/events/ext4.h      |    6 ++----
 include/trace/events/writeback.h |    5 +----
 4 files changed, 5 insertions(+), 16 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:30.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:33.000000000 +0800
@@ -560,12 +560,8 @@ static int writeback_sb_inodes(struct su
 		iput(inode);
 		cond_resched();
 		spin_lock(&wb->list_lock);
-		if (wbc->nr_to_write <= 0) {
-			wbc->more_io = 1;
+		if (wbc->nr_to_write <= 0)
 			return 1;
-		}
-		if (!list_empty(&wb->b_more_io))
-			wbc->more_io = 1;
 	}
 	/* b_io is empty */
 	return 1;
@@ -707,7 +703,6 @@ static long wb_writeback(struct bdi_writ
 			wbc.older_than_this = &oldest_jif;
 		}
 
-		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 		wbc.inodes_cleaned = 0;
@@ -755,7 +750,7 @@ retry:
 		/*
 		 * No more inodes for IO, bail
 		 */
-		if (!wbc.more_io)
+		if (list_empty(&wb->b_more_io))
 			break;
 		/*
 		 * Nothing written. Wait for some inode to
--- linux-next.orig/include/linux/writeback.h	2011-05-05 23:30:30.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-05 23:30:33.000000000 +0800
@@ -46,7 +46,6 @@ struct writeback_control {
 	unsigned tagged_sync:1;		/* do livelock prevention for sync */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
-	unsigned more_io:1;		/* more io to be dispatched */
 };
 
 /*
--- linux-next.orig/include/trace/events/writeback.h	2011-05-05 23:29:49.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-05 23:30:33.000000000 +0800
@@ -101,7 +101,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(int, more_io)
 		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
@@ -116,7 +115,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->more_io	= wbc->more_io;
 		__entry->older_than_this = wbc->older_than_this ?
 						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
@@ -124,7 +122,7 @@ DECLARE_EVENT_CLASS(wbc_class,
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d older=0x%lx "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -134,7 +132,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->more_io,
 		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
--- linux-next.orig/include/trace/events/ext4.h	2011-05-05 23:29:50.000000000 +0800
+++ linux-next/include/trace/events/ext4.h	2011-05-05 23:30:33.000000000 +0800
@@ -404,7 +404,6 @@ TRACE_EVENT(ext4_da_writepages_result,
 		__field(	int,	pages_written		)
 		__field(	long,	pages_skipped		)
 		__field(	int,	sync_mode		)
-		__field(	char,	more_io			)	
 		__field(       pgoff_t,	writeback_index		)
 	),
 
@@ -415,16 +414,15 @@ TRACE_EVENT(ext4_da_writepages_result,
 		__entry->pages_written	= pages_written;
 		__entry->pages_skipped	= wbc->pages_skipped;
 		__entry->sync_mode	= wbc->sync_mode;
-		__entry->more_io	= wbc->more_io;
 		__entry->writeback_index = inode->i_mapping->writeback_index;
 	),
 
 	TP_printk("dev %d,%d ino %lu ret %d pages_written %d pages_skipped %ld "
-		  " more_io %d sync_mode %d writeback_index %lu",
+		  "sync_mode %d writeback_index %lu",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino, __entry->ret,
 		  __entry->pages_written, __entry->pages_skipped,
-		  __entry->more_io, __entry->sync_mode,
+		  __entry->sync_mode,
 		  (unsigned long) __entry->writeback_index)
 );
 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 14/17] writeback: make writeback_control.nr_to_write straight
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (12 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 13/17] writeback: remove writeback_control.more_io Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-09 16:54   ` Jan Kara
  2011-05-06  3:08 ` [PATCH 15/17] writeback: remove .nonblocking and .encountered_congestion Wu Fengguang
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

[-- Attachment #1: writeback-pass-work-to-writeback_sb_inodes.patch --]
[-- Type: text/plain, Size: 14527 bytes --]

Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.

Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/extent_io.c             |    2 
 fs/fs-writeback.c                |  184 +++++++++++++++--------------
 include/linux/writeback.h        |    3 
 include/trace/events/writeback.h |    6 
 mm/backing-dev.c                 |    1 
 mm/page-writeback.c              |    1 
 6 files changed, 99 insertions(+), 98 deletions(-)

change set:
- move writeback_control from wb_writeback() down to writeback_sb_inodes()
- change return value of writeback_sb_inodes()/__writeback_inodes_wb()
  to the number of pages and/or inodes written
- move writeback_control.older_than_this to struct wb_writeback_work
- remove writeback_control.inodes_cleaned
- remove wbc_writeback_* trace events for now

--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:52:19.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:53:26.000000000 +0800
@@ -30,11 +30,21 @@
 #include "internal.h"
 
 /*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES     1024L
+
+/*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
 struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
+	unsigned long *older_than_this;
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_sync:1;
 	unsigned int for_kupdate:1;
@@ -463,7 +473,6 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
-			wbc->inodes_cleaned++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -496,6 +505,31 @@ static bool pin_sb_for_writeback(struct 
 	return false;
 }
 
+static long writeback_chunk_size(struct wb_writeback_work *work)
+{
+	long pages;
+
+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 * The intended call sequence for WB_SYNC_ALL writeback is:
+	 *
+	 *      wb_writeback()
+	 *          writeback_sb_inodes()       <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                   (quickly) tag currently dirty pages
+	 *                   (maybe slowly) sync all tagged pages
+	 */
+	if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync)
+		pages = LONG_MAX;
+	else
+		pages = min(MAX_WRITEBACK_PAGES, work->nr_pages);
+
+	return pages;
+}
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -503,18 +537,29 @@ static bool pin_sb_for_writeback(struct 
  * inodes. Otherwise write only ones which go sequentially
  * in reverse order.
  *
- * Return 1, if the caller writeback routine should be
- * interrupted. Otherwise return 0.
+ * Return the number of pages and/or inodes cleaned.
  */
-static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
-		struct writeback_control *wbc, bool only_this_sb)
+static long writeback_sb_inodes(struct super_block *sb,
+				struct bdi_writeback *wb,
+				struct wb_writeback_work *work)
 {
+	struct writeback_control wbc = {
+		.sync_mode		= work->sync_mode,
+		.tagged_sync		= work->tagged_sync,
+		.for_kupdate		= work->for_kupdate,
+		.for_background		= work->for_background,
+		.range_cyclic		= work->range_cyclic,
+		.range_start		= 0,
+		.range_end		= LLONG_MAX,
+	};
+	long write_chunk;
+	long wrote = 0;  /* count both pages and inodes */
+
 	while (!list_empty(&wb->b_io)) {
-		long pages_skipped;
 		struct inode *inode = wb_inode(wb->b_io.prev);
 
 		if (inode->i_sb != sb) {
-			if (only_this_sb) {
+			if (work->sb) {
 				/*
 				 * We only want to write back data for this
 				 * superblock, move all inodes not belonging
@@ -529,7 +574,7 @@ static int writeback_sb_inodes(struct su
 			 * Bounce back to the caller to unpin this and
 			 * pin the next superblock.
 			 */
-			return 0;
+			break;
 		}
 
 		/*
@@ -543,34 +588,40 @@ static int writeback_sb_inodes(struct su
 			requeue_io(inode, wb);
 			continue;
 		}
-
 		__iget(inode);
+		write_chunk = writeback_chunk_size(work);
+		wbc.nr_to_write = write_chunk;
+		wbc.pages_skipped = 0;
+
+		writeback_single_inode(inode, wb, &wbc);
 
-		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wb, wbc);
-		if (wbc->pages_skipped != pages_skipped) {
+		work->nr_pages -= write_chunk - wbc.nr_to_write;
+		wrote += write_chunk - wbc.nr_to_write;
+		if (wbc.pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
 			redirty_tail(inode, wb);
-		}
+		} else if (!(inode->i_state & I_DIRTY))
+			wrote++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
 		iput(inode);
 		cond_resched();
 		spin_lock(&wb->list_lock);
-		if (wbc->nr_to_write <= 0)
-			return 1;
+		if (wrote >= MAX_WRITEBACK_PAGES)
+			break;
+		if (work->nr_pages <= 0)
+			break;
 	}
-	/* b_io is empty */
-	return 1;
+	return wrote;
 }
 
-static void __writeback_inodes_wb(struct bdi_writeback *wb,
-				  struct writeback_control *wbc)
+static long __writeback_inodes_wb(struct bdi_writeback *wb,
+				  struct wb_writeback_work *work)
 {
-	int ret = 0;
+	long wrote = 0;
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -580,33 +631,35 @@ static void __writeback_inodes_wb(struct
 			requeue_io(inode, wb);
 			continue;
 		}
-		ret = writeback_sb_inodes(sb, wb, wbc, false);
+		wrote += writeback_sb_inodes(sb, wb, work);
 		drop_super(sb);
 
-		if (ret)
+		if (wrote >= MAX_WRITEBACK_PAGES)
+			break;
+		if (work->nr_pages <= 0)
 			break;
 	}
 	/* Leave any unwritten inodes on b_io */
+	return wrote;
 }
 
 void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc)
+			 struct writeback_control *wbc)
 {
+	struct wb_writeback_work work = {
+		.nr_pages	= wbc->nr_to_write,
+		.sync_mode	= wbc->sync_mode,
+		.range_cyclic	= wbc->range_cyclic,
+	};
+
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
-	__writeback_inodes_wb(wb, wbc);
+		queue_io(wb, NULL);
+	__writeback_inodes_wb(wb, &work);
 	spin_unlock(&wb->list_lock);
-}
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
+	wbc->nr_to_write = work.nr_pages;
+}
 
 static inline bool over_bground_thresh(void)
 {
@@ -636,42 +689,13 @@ static inline bool over_bground_thresh(v
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
-	struct writeback_control wbc = {
-		.sync_mode		= work->sync_mode,
-		.tagged_sync		= work->tagged_sync,
-		.older_than_this	= NULL,
-		.for_kupdate		= work->for_kupdate,
-		.for_background		= work->for_background,
-		.range_cyclic		= work->range_cyclic,
-	};
+	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	long wrote = 0;
-	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
-
-	if (!wbc.range_cyclic) {
-		wbc.range_start = 0;
-		wbc.range_end = LLONG_MAX;
-	}
-
-	/*
-	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
-	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
-	 * here avoids calling into writeback_inodes_wb() more than once.
-	 *
-	 * The intended call sequence for WB_SYNC_ALL writeback is:
-	 *
-	 *      wb_writeback()
-	 *          writeback_sb_inodes()       <== called only once
-	 *              write_cache_pages()     <== called once for each inode
-	 *                   (quickly) tag currently dirty pages
-	 *                   (maybe slowly) sync all tagged pages
-	 */
-	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
-		write_chunk = LONG_MAX;
+	long progress;
 
 	oldest_jif = jiffies;
-	wbc.older_than_this = &oldest_jif;
+	work->older_than_this = &oldest_jif;
 
 	for (;;) {
 		/*
@@ -700,27 +724,18 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_kupdate || work->for_background) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
-			wbc.older_than_this = &oldest_jif;
+			work->older_than_this = &oldest_jif;
 		}
 
-		wbc.nr_to_write = write_chunk;
-		wbc.pages_skipped = 0;
-		wbc.inodes_cleaned = 0;
-
 retry:
-		trace_wbc_writeback_start(&wbc, wb->bdi);
 		spin_lock(&wb->list_lock);
 		if (list_empty(&wb->b_io))
-			queue_io(wb, wbc.older_than_this);
+			queue_io(wb, work->older_than_this);
 		if (work->sb)
-			writeback_sb_inodes(work->sb, wb, &wbc, true);
+			progress = writeback_sb_inodes(work->sb, wb, work);
 		else
-			__writeback_inodes_wb(wb, &wbc);
+			progress = __writeback_inodes_wb(wb, work);
 		spin_unlock(&wb->list_lock);
-		trace_wbc_writeback_written(&wbc, wb->bdi);
-
-		work->nr_pages -= write_chunk - wbc.nr_to_write;
-		wrote += write_chunk - wbc.nr_to_write;
 
 		/*
 		 * Did we write something? Try for more
@@ -730,9 +745,7 @@ retry:
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		if (wbc.inodes_cleaned)
+		if (progress)
 			continue;
 		/*
 		 * background writeback will start with expired inodes, and
@@ -741,10 +754,10 @@ retry:
 		 * lists and cause trouble to the page reclaim.
 		 */
 		if (work->for_background &&
-		    wbc.older_than_this &&
+		    work->older_than_this &&
 		    list_empty(&wb->b_io) &&
 		    list_empty(&wb->b_more_io)) {
-			wbc.older_than_this = NULL;
+			work->older_than_this = NULL;
 			goto retry;
 		}
 		/*
@@ -760,7 +773,6 @@ retry:
 		spin_lock(&wb->list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = wb_inode(wb->b_more_io.prev);
-			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			spin_lock(&inode->i_lock);
 			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
@@ -768,7 +780,7 @@ retry:
 		spin_unlock(&wb->list_lock);
 	}
 
-	return wrote;
+	return nr_pages - work->nr_pages;
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-05-05 23:52:19.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-05 23:53:26.000000000 +0800
@@ -24,12 +24,9 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
-	long inodes_cleaned;		/* # of inodes cleaned */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is
--- linux-next.orig/mm/backing-dev.c	2011-05-05 23:41:35.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-05-05 23:53:26.000000000 +0800
@@ -264,7 +264,6 @@ static void bdi_flush_io(struct backing_
 {
 	struct writeback_control wbc = {
 		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 		.nr_to_write		= 1024,
 	};
--- linux-next.orig/mm/page-writeback.c	2011-05-05 23:52:18.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-05 23:53:26.000000000 +0800
@@ -496,7 +496,6 @@ static void balance_dirty_pages(struct a
 	for (;;) {
 		struct writeback_control wbc = {
 			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
 			.nr_to_write	= write_chunk,
 			.range_cyclic	= 1,
 		};
--- linux-next.orig/include/trace/events/writeback.h	2011-05-05 23:52:19.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-05 23:53:26.000000000 +0800
@@ -101,7 +101,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -115,14 +114,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -132,7 +129,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
--- linux-next.orig/fs/btrfs/extent_io.c	2011-05-05 23:41:35.000000000 +0800
+++ linux-next/fs/btrfs/extent_io.c	2011-05-05 23:53:26.000000000 +0800
@@ -2587,7 +2587,6 @@ int extent_write_full_page(struct extent
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= wbc->sync_mode,
-		.older_than_this = NULL,
 		.nr_to_write	= 64,
 		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
 		.range_end	= (loff_t)-1,
@@ -2620,7 +2619,6 @@ int extent_write_locked_range(struct ext
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= mode,
-		.older_than_this = NULL,
 		.nr_to_write	= nr_pages * 2,
 		.range_start	= start,
 		.range_end	= end + 1,



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 15/17] writeback: remove .nonblocking and .encountered_congestion
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (13 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 14/17] writeback: make writeback_control.nr_to_write straight Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-06  3:08 ` [PATCH 16/17] writeback: trace event writeback_single_inode Wu Fengguang
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

[-- Attachment #1: writeback-kill-nonblocking.patch --]
[-- Type: text/plain, Size: 3579 bytes --]

Remove two unused struct writeback_control fields:

	.encountered_congestion	(completely unused)
	.nonblocking		(never set, checked/showed in XFS,NFS/btrfs)

The .for_background check in nfs_write_inode() is also removed btw,
as .for_background implies WB_SYNC_NONE.

Reviewed-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c               |    3 +--
 fs/xfs/linux-2.6/xfs_aops.c  |    2 +-
 include/linux/writeback.h    |    2 --
 include/trace/events/btrfs.h |    6 ++----
 4 files changed, 4 insertions(+), 9 deletions(-)

--- linux-next.orig/fs/xfs/linux-2.6/xfs_aops.c	2011-05-06 00:35:11.000000000 +0800
+++ linux-next/fs/xfs/linux-2.6/xfs_aops.c	2011-05-06 00:35:14.000000000 +0800
@@ -970,7 +970,7 @@ xfs_vm_writepage(
 	offset = page_offset(page);
 	type = IO_OVERWRITE;
 
-	if (wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking)
+	if (wbc->sync_mode == WB_SYNC_NONE)
 		nonblocking = 1;
 
 	do {
--- linux-next.orig/include/trace/events/btrfs.h	2011-05-06 00:35:11.000000000 +0800
+++ linux-next/include/trace/events/btrfs.h	2011-05-06 00:35:14.000000000 +0800
@@ -284,7 +284,6 @@ DECLARE_EVENT_CLASS(btrfs__writepage,
 		__field(	long,   pages_skipped		)
 		__field(	loff_t, range_start		)
 		__field(	loff_t, range_end		)
-		__field(	char,   nonblocking		)
 		__field(	char,   for_kupdate		)
 		__field(	char,   for_reclaim		)
 		__field(	char,   range_cyclic		)
@@ -299,7 +298,6 @@ DECLARE_EVENT_CLASS(btrfs__writepage,
 		__entry->pages_skipped	= wbc->pages_skipped;
 		__entry->range_start	= wbc->range_start;
 		__entry->range_end	= wbc->range_end;
-		__entry->nonblocking	= wbc->nonblocking;
 		__entry->for_kupdate	= wbc->for_kupdate;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
@@ -310,13 +308,13 @@ DECLARE_EVENT_CLASS(btrfs__writepage,
 
 	TP_printk("root = %llu(%s), ino = %lu, page_index = %lu, "
 		  "nr_to_write = %ld, pages_skipped = %ld, range_start = %llu, "
-		  "range_end = %llu, nonblocking = %d, for_kupdate = %d, "
+		  "range_end = %llu, for_kupdate = %d, "
 		  "for_reclaim = %d, range_cyclic = %d, writeback_index = %lu",
 		  show_root_type(__entry->root_objectid),
 		  (unsigned long)__entry->ino, __entry->index,
 		  __entry->nr_to_write, __entry->pages_skipped,
 		  __entry->range_start, __entry->range_end,
-		  __entry->nonblocking, __entry->for_kupdate,
+		  __entry->for_kupdate,
 		  __entry->for_reclaim, __entry->range_cyclic,
 		  (unsigned long)__entry->writeback_index)
 );
--- linux-next.orig/fs/nfs/write.c	2011-05-06 00:35:11.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-05-06 00:35:14.000000000 +0800
@@ -1562,8 +1562,7 @@ int nfs_write_inode(struct inode *inode,
 		int status;
 		bool sync = true;
 
-		if (wbc->sync_mode == WB_SYNC_NONE || wbc->nonblocking ||
-		    wbc->for_background)
+		if (wbc->sync_mode == WB_SYNC_NONE)
 			sync = false;
 
 		status = pnfs_layoutcommit_inode(inode, sync);
--- linux-next.orig/include/linux/writeback.h	2011-05-06 00:35:14.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-06 00:35:14.000000000 +0800
@@ -36,8 +36,6 @@ struct writeback_control {
 	loff_t range_start;
 	loff_t range_end;
 
-	unsigned nonblocking:1;		/* Don't get stuck on request queues */
-	unsigned encountered_congestion:1; /* An output: a queue is full */
 	unsigned for_kupdate:1;		/* A kupdate writeback */
 	unsigned for_background:1;	/* A background writeback */
 	unsigned tagged_sync:1;		/* do livelock prevention for sync */



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 16/17] writeback: trace event writeback_single_inode
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (14 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 15/17] writeback: remove .nonblocking and .encountered_congestion Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-06  4:16   ` [PATCH 16/17] writeback: trace event writeback_single_inode (v2) Wu Fengguang
  2011-05-06  3:08 ` [PATCH 17/17] writeback: trace event writeback_queue_io Wu Fengguang
  2011-05-06  4:06 ` [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Anca Emanuel
  17 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

[-- Attachment #1: writeback-trace-writeback_single_inode.patch --]
[-- Type: text/plain, Size: 3158 bytes --]

It is valuable to know how the dirty inodes are iterated and their IO size.

"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 wrote=0 to_write=1024"

- "state" reflects inode->i_state at the end of writeback_single_inode()
- "index" reflects mapping->writeback_index after the ->writepages() call
- "wrote" is the number of pages written in this writeback_single_inode()
- "to_write" is the remained wbc->nr_to_write

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |    3 +
 include/trace/events/writeback.h |   56 +++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+), 1 deletion(-)

--- linux-next.orig/include/trace/events/writeback.h	2011-05-05 23:30:34.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-05 23:30:37.000000000 +0800
@@ -8,6 +8,19 @@
 #include <linux/device.h>
 #include <linux/writeback.h>
 
+#define show_inode_state(state)					\
+	__print_flags(state, "|",				\
+		{I_DIRTY_SYNC,		"I_DIRTY_SYNC"},	\
+		{I_DIRTY_DATASYNC,	"I_DIRTY_DATASYNC"},	\
+		{I_DIRTY_PAGES,		"I_DIRTY_PAGES"},	\
+		{I_NEW,			"I_NEW"},		\
+		{I_WILL_FREE,		"I_WILL_FREE"},		\
+		{I_FREEING,		"I_FREEING"},		\
+		{I_CLEAR,		"I_CLEAR"},		\
+		{I_SYNC,		"I_SYNC"},		\
+		{I_REFERENCED,		"I_REFERENCED"}		\
+	)
+
 struct wb_writeback_work;
 
 DECLARE_EVENT_CLASS(writeback_work_class,
@@ -180,6 +193,49 @@ DEFINE_EVENT(writeback_congest_waited_te
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
+TRACE_EVENT(writeback_single_inode,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long wrote
+	),
+
+	TP_ARGS(inode, wbc, wrote),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, state)
+		__field(unsigned long, age)
+		__field(unsigned long, writeback_index)
+		__field(unsigned long, wrote)
+		__field(long, nr_to_write)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->state		= inode->i_state;
+		__entry->age		= (jiffies - inode->dirtied_when) *
+								1000 / HZ;
+		__entry->writeback_index = inode->i_mapping->writeback_index;
+		__entry->wrote		= wrote;
+		__entry->nr_to_write	= wbc->nr_to_write;
+	),
+
+	TP_printk("bdi %s: ino=%lu state=%s age=%lu "
+		  "index=%lu wrote=%lu to_write=%ld",
+		  __entry->name,
+		  __entry->ino,
+		  show_inode_state(__entry->state),
+		  __entry->age,
+		  __entry->writeback_index,
+		  __entry->wrote,
+		  __entry->nr_to_write
+	)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:34.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:37.000000000 +0800
@@ -594,7 +594,8 @@ static long writeback_sb_inodes(struct s
 		wbc.pages_skipped = 0;
 
 		writeback_single_inode(inode, wb, &wbc);
-
+		trace_writeback_single_inode(inode, &wbc,
+					     write_chunk - wbc.nr_to_write);
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
 		wrote += write_chunk - wbc.nr_to_write;
 		if (wbc.pages_skipped) {



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 17/17] writeback: trace event writeback_queue_io
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (15 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 16/17] writeback: trace event writeback_single_inode Wu Fengguang
@ 2011-05-06  3:08 ` Wu Fengguang
  2011-05-06  4:06 ` [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Anca Emanuel
  17 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

[-- Attachment #1: writeback-trace-queue_io.patch --]
[-- Type: text/plain, Size: 3503 bytes --]

Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   14 ++++++++++----
 include/trace/events/writeback.h |   25 +++++++++++++++++++++++++
 2 files changed, 35 insertions(+), 4 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:53:33.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:53:34.000000000 +0800
@@ -258,15 +258,16 @@ static bool inode_dirtied_after(struct i
 /*
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
-static void move_expired_inodes(struct list_head *delaying_queue,
+static int move_expired_inodes(struct list_head *delaying_queue,
 			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+			       unsigned long *older_than_this)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
+	int moved = 0;
 
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
@@ -277,12 +278,13 @@ static void move_expired_inodes(struct l
 			do_sb_sort = 1;
 		sb = inode->i_sb;
 		list_move(&inode->i_wb_list, &tmp);
+		moved++;
 	}
 
 	/* just one sb in list, splice to dispatch_queue and we're done */
 	if (!do_sb_sort) {
 		list_splice(&tmp, dispatch_queue);
-		return;
+		goto out;
 	}
 
 	/* Move inodes from one superblock together */
@@ -294,6 +296,8 @@ static void move_expired_inodes(struct l
 				list_move(&inode->i_wb_list, dispatch_queue);
 		}
 	}
+out:
+	return moved;
 }
 
 /*
@@ -309,9 +313,11 @@ static void move_expired_inodes(struct l
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
+	int moved;
 	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	trace_writeback_queue_io(wb, older_than_this, moved);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
--- linux-next.orig/include/trace/events/writeback.h	2011-05-05 23:53:33.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-06 00:14:09.000000000 +0800
@@ -158,6 +158,31 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_writt
 DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(writeback_queue_io,
+	TP_PROTO(struct bdi_writeback *wb,
+		 unsigned long *older_than_this,
+		 int moved),
+	TP_ARGS(wb, older_than_this, moved),
+	TP_STRUCT__entry(
+		__array(char,		name, 32)
+		__field(unsigned long,	older)
+		__field(long,		age)
+		__field(int,		moved)
+	),
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(wb->bdi->dev), 32);
+		__entry->older	= older_than_this ?  *older_than_this : 0;
+		__entry->age	= older_than_this ?
+				  (jiffies - *older_than_this) * 1000 / HZ : -1;
+		__entry->moved	= moved;
+	),
+	TP_printk("bdi %s: older=%lu age=%ld enqueue=%d",
+		__entry->name,
+		__entry->older,	/* older_than_this in jiffies */
+		__entry->age,	/* older_than_this in relative milliseconds */
+		__entry->moved)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/17] writeback fixes and cleanups for 2.6.40
  2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
                   ` (16 preceding siblings ...)
  2011-05-06  3:08 ` [PATCH 17/17] writeback: trace event writeback_queue_io Wu Fengguang
@ 2011-05-06  4:06 ` Anca Emanuel
  2011-05-06  4:09   ` Wu Fengguang
  17 siblings, 1 reply; 42+ messages in thread
From: Anca Emanuel @ 2011-05-06  4:06 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Fri, May 6, 2011 at 6:08 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> The patches are also git pullable from
>
> git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git fs-writeback

git pull git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git
fs-writeback
fatal: Couldn't find remote ref fs-writeback


git ls-remote git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git
5895198c56d131cc696556a45f7ff0ea99ac297b	HEAD
dbe30770c0b9d280742d3ef0c12b8187ce6e19a7	refs/heads/dirty-throttling-v2
a56422a5103f57066e401e1d506f1e0cadbe3ad7	refs/heads/dirty-throttling-v3
1fe1fbdcbcd2bc1a0034ec84a927222b3278dca9	refs/heads/dirty-throttling-v4
6b7ca194675e2fb4aaf4394df3aedfda69ab14a4	refs/heads/dirty-throttling-v5
5f35306e8d2e56dd6973556a5b844e4eb8ce24c4	refs/heads/dirty-throttling-v6
d0e30163e390d87387ec13e3b1c2168238c26793	refs/heads/dirty-throttling-v7
5895198c56d131cc696556a45f7ff0ea99ac297b	refs/heads/master
...

no branch fs-writeback.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/17] writeback fixes and cleanups for 2.6.40
  2011-05-06  4:06 ` [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Anca Emanuel
@ 2011-05-06  4:09   ` Wu Fengguang
  0 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  4:09 UTC (permalink / raw)
  To: Anca Emanuel
  Cc: Andrew Morton, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Fri, May 06, 2011 at 12:06:42PM +0800, Anca Emanuel wrote:
> On Fri, May 6, 2011 at 6:08 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > The patches are also git pullable from
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git fs-writeback
> 
> git pull git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git
> fs-writeback
> fatal: Couldn't find remote ref fs-writeback
> 
> 
> git ls-remote git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git
> 5895198c56d131cc696556a45f7ff0ea99ac297b	HEAD
> dbe30770c0b9d280742d3ef0c12b8187ce6e19a7	refs/heads/dirty-throttling-v2
> a56422a5103f57066e401e1d506f1e0cadbe3ad7	refs/heads/dirty-throttling-v3
> 1fe1fbdcbcd2bc1a0034ec84a927222b3278dca9	refs/heads/dirty-throttling-v4
> 6b7ca194675e2fb4aaf4394df3aedfda69ab14a4	refs/heads/dirty-throttling-v5
> 5f35306e8d2e56dd6973556a5b844e4eb8ce24c4	refs/heads/dirty-throttling-v6
> d0e30163e390d87387ec13e3b1c2168238c26793	refs/heads/dirty-throttling-v7
> 5895198c56d131cc696556a45f7ff0ea99ac297b	refs/heads/master
> ...
> 
> no branch fs-writeback.

It seems that it takes quite some time for the new branch to show up
publicly..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 16/17] writeback: trace event writeback_single_inode (v2)
  2011-05-06  3:08 ` [PATCH 16/17] writeback: trace event writeback_single_inode Wu Fengguang
@ 2011-05-06  4:16   ` Wu Fengguang
  0 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-06  4:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Jan Kara, Dave Chinner, Christoph Hellwig, linux-fsdevel

[changes:
- move trace point into writeback_single_inode() to cover the other 2 callers;
- change format to show the original rather than remained wbc->nr_to_write
]

It is valuable to know how the dirty inodes are iterated and their IO size.

"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 wrote=0 to_write=1024"

- "state" reflects inode->i_state at the end of writeback_single_inode()
- "index" reflects mapping->writeback_index after the ->writepages() call
- "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
- "wrote" is the number of pages actually written

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |    6 ++-
 include/trace/events/writeback.h |   56 +++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+), 1 deletion(-)

--- linux-next.orig/include/trace/events/writeback.h	2011-05-06 12:10:16.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-06 12:13:04.000000000 +0800
@@ -8,6 +8,19 @@
 #include <linux/device.h>
 #include <linux/writeback.h>
 
+#define show_inode_state(state)					\
+	__print_flags(state, "|",				\
+		{I_DIRTY_SYNC,		"I_DIRTY_SYNC"},	\
+		{I_DIRTY_DATASYNC,	"I_DIRTY_DATASYNC"},	\
+		{I_DIRTY_PAGES,		"I_DIRTY_PAGES"},	\
+		{I_NEW,			"I_NEW"},		\
+		{I_WILL_FREE,		"I_WILL_FREE"},		\
+		{I_FREEING,		"I_FREEING"},		\
+		{I_CLEAR,		"I_CLEAR"},		\
+		{I_SYNC,		"I_SYNC"},		\
+		{I_REFERENCED,		"I_REFERENCED"}		\
+	)
+
 struct wb_writeback_work;
 
 DECLARE_EVENT_CLASS(writeback_work_class,
@@ -180,6 +193,49 @@ DEFINE_EVENT(writeback_congest_waited_te
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
+TRACE_EVENT(writeback_single_inode,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long nr_to_write
+	),
+
+	TP_ARGS(inode, wbc, nr_to_write),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, state)
+		__field(unsigned long, age)
+		__field(unsigned long, writeback_index)
+		__field(long, nr_to_write)
+		__field(unsigned long, wrote)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->state		= inode->i_state;
+		__entry->age		= (jiffies - inode->dirtied_when) *
+								1000 / HZ;
+		__entry->writeback_index = inode->i_mapping->writeback_index;
+		__entry->nr_to_write	= nr_to_write;
+		__entry->wrote		= nr_to_write - wbc->nr_to_write;
+	),
+
+	TP_printk("bdi %s: ino=%lu state=%s age=%lu "
+		  "index=%lu to_write=%ld wrote=%lu",
+		  __entry->name,
+		  __entry->ino,
+		  show_inode_state(__entry->state),
+		  __entry->age,
+		  __entry->writeback_index,
+		  __entry->nr_to_write,
+		  __entry->wrote
+	)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
--- linux-next.orig/fs/fs-writeback.c	2011-05-06 12:10:16.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-06 12:11:48.000000000 +0800
@@ -356,6 +356,7 @@ writeback_single_inode(struct inode *ino
 		       struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
+	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -378,7 +379,8 @@ writeback_single_inode(struct inode *ino
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
 			requeue_io(inode, wb);
-			return 0;
+			ret = 0;
+			goto out;
 		}
 
 		/*
@@ -476,6 +478,8 @@ writeback_single_inode(struct inode *ino
 		}
 	}
 	inode_sync_complete(inode);
+out:
+	trace_writeback_single_inode(inode, wbc, nr_to_write);
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/17] writeback: introduce writeback_control.inodes_cleaned
  2011-05-06  3:08 ` [PATCH 03/17] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
@ 2011-05-06 14:36   ` Jan Kara
  2011-05-10  2:23     ` Wu Fengguang
  0 siblings, 1 reply; 42+ messages in thread
From: Jan Kara @ 2011-05-06 14:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Jan Kara, Mel Gorman, Dave Chinner,
	Christoph Hellwig, linux-fsdevel

On Fri 06-05-11 11:08:24, Wu Fengguang wrote:
> The flusher works on dirty inodes in batches, and may quit prematurely
> if the batch of inodes happen to be metadata-only dirtied: in this case
> wbc->nr_to_write won't be decreased at all, which stands for "no pages
> written" but also mis-interpreted as "no progress".
> 
> So introduce writeback_control.inodes_cleaned to count the inodes get
> cleaned.  A non-zero value means there are some progress on writeback,
> in which case more writeback can be tried.
> 
> about v1: The initial version was to count successful ->write_inode()
> calls.  However it leads to busy loops for sync() over NFS, because NFS
> ridiculously returns 0 (success) while at the same time redirties the
> inode.  The NFS case can be trivially fixed, however there may be more
> hidden bugs in other filesystems..
  I think Christoph didn't like this patch
(https://lkml.org/lkml/2011/5/4/123) and suggested that inodes_cleaned
should remain local to fs-writeback.c...

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/17] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage
  2011-05-06  3:08 ` [PATCH 01/17] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage Wu Fengguang
@ 2011-05-06 16:08   ` Jan Kara
  0 siblings, 0 replies; 42+ messages in thread
From: Jan Kara @ 2011-05-06 16:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Fri 06-05-11 11:08:22, Wu Fengguang wrote:
> sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
> WB_SYNC_ALL sync. Tag the first stage with wbc.tagged_sync and do
> livelock prevention for it, too.
> 
> Note that writeback_inodes_sb() is called by not only sync(), they are
> treated the same because the other callers also need livelock prevention.
> 
> Impact:  It changes the order in which pages/inodes are synced to disk.
> Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
> until finished with the current inode.
  You can add:
Acked-by: Jan Kara <jack@suse.cz>

> CC: Jan Kara <jack@suse.cz>
> CC: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/ext4/inode.c           |    4 ++--
>  fs/fs-writeback.c         |    9 +++++----
>  include/linux/writeback.h |    1 +
>  mm/page-writeback.c       |    4 ++--
>  4 files changed, 10 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:29:51.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:22.000000000 +0800
> @@ -36,6 +36,7 @@ struct wb_writeback_work {
>  	long nr_pages;
>  	struct super_block *sb;
>  	enum writeback_sync_modes sync_mode;
> +	unsigned int tagged_sync:1;
>  	unsigned int for_kupdate:1;
>  	unsigned int range_cyclic:1;
>  	unsigned int for_background:1;
> @@ -650,6 +651,7 @@ static long wb_writeback(struct bdi_writ
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= work->sync_mode,
> +		.tagged_sync		= work->tagged_sync,
>  		.older_than_this	= NULL,
>  		.for_kupdate		= work->for_kupdate,
>  		.for_background		= work->for_background,
> @@ -657,7 +659,7 @@ static long wb_writeback(struct bdi_writ
>  	};
>  	unsigned long oldest_jif;
>  	long wrote = 0;
> -	long write_chunk;
> +	long write_chunk = MAX_WRITEBACK_PAGES;
>  	struct inode *inode;
>  
>  	if (wbc.for_kupdate) {
> @@ -683,9 +685,7 @@ static long wb_writeback(struct bdi_writ
>  	 *                   (quickly) tag currently dirty pages
>  	 *                   (maybe slowly) sync all tagged pages
>  	 */
> -	if (wbc.sync_mode == WB_SYNC_NONE)
> -		write_chunk = MAX_WRITEBACK_PAGES;
> -	else
> +	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
>  		write_chunk = LONG_MAX;
>  
>  	wbc.wb_start = jiffies; /* livelock avoidance */
> @@ -1193,6 +1193,7 @@ void writeback_inodes_sb_nr(struct super
>  	struct wb_writeback_work work = {
>  		.sb		= sb,
>  		.sync_mode	= WB_SYNC_NONE,
> +		.tagged_sync	= 1,
>  		.done		= &done,
>  		.nr_pages	= nr,
>  	};
> --- linux-next.orig/include/linux/writeback.h	2011-05-05 23:29:51.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-05-05 23:30:22.000000000 +0800
> @@ -47,6 +47,7 @@ struct writeback_control {
>  	unsigned encountered_congestion:1; /* An output: a queue is full */
>  	unsigned for_kupdate:1;		/* A kupdate writeback */
>  	unsigned for_background:1;	/* A background writeback */
> +	unsigned tagged_sync:1;		/* do livelock prevention for sync */
>  	unsigned for_reclaim:1;		/* Invoked from the page allocator */
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
>  	unsigned more_io:1;		/* more io to be dispatched */
> --- linux-next.orig/mm/page-writeback.c	2011-05-05 23:29:51.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-05-05 23:30:22.000000000 +0800
> @@ -892,12 +892,12 @@ int write_cache_pages(struct address_spa
>  			range_whole = 1;
>  		cycled = 1; /* ignore range_cyclic tests */
>  	}
> -	if (wbc->sync_mode == WB_SYNC_ALL)
> +	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
>  		tag = PAGECACHE_TAG_TOWRITE;
>  	else
>  		tag = PAGECACHE_TAG_DIRTY;
>  retry:
> -	if (wbc->sync_mode == WB_SYNC_ALL)
> +	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
>  		tag_pages_for_writeback(mapping, index, end);
>  	done_index = index;
>  	while (!done && (index <= end)) {
> --- linux-next.orig/fs/ext4/inode.c	2011-05-05 23:29:51.000000000 +0800
> +++ linux-next/fs/ext4/inode.c	2011-05-05 23:30:22.000000000 +0800
> @@ -2741,7 +2741,7 @@ static int write_cache_pages_da(struct a
>  	index = wbc->range_start >> PAGE_CACHE_SHIFT;
>  	end = wbc->range_end >> PAGE_CACHE_SHIFT;
>  
> -	if (wbc->sync_mode == WB_SYNC_ALL)
> +	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
>  		tag = PAGECACHE_TAG_TOWRITE;
>  	else
>  		tag = PAGECACHE_TAG_DIRTY;
> @@ -2975,7 +2975,7 @@ static int ext4_da_writepages(struct add
>  	}
>  
>  retry:
> -	if (wbc->sync_mode == WB_SYNC_ALL)
> +	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
>  		tag_pages_for_writeback(mapping, index, end);
>  
>  	while (!ret && wbc->nr_to_write > 0) {
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/17] writeback: update dirtied_when for synced inode to prevent livelock
  2011-05-06  3:08 ` [PATCH 02/17] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
@ 2011-05-06 16:33   ` Jan Kara
  2011-05-10  2:14     ` Wu Fengguang
  0 siblings, 1 reply; 42+ messages in thread
From: Jan Kara @ 2011-05-06 16:33 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Fri 06-05-11 11:08:23, Wu Fengguang wrote:
> With the more aggressive "keep writeback as long as we wrote something"
> logic in wb_writeback(), the "use LONG_MAX .nr_to_write" trick in commit
> b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") is
> no longer enough to stop sync livelock.
> 
> The fix is to explicitly update .dirtied_when on synced inodes, so that
> they are no longer considered for writeback in the next round.
  The changelog doesn't make sense now when the patch is in the second
place in the series... Also as we discussed in the previous iteration of
the patches, I thought you'll move the condition before mapping_tagged()
test. I.e. the code would be (comment updated):
	if (!(inode->i_state & I_FREEING)) {
		/*
		 * Sync livelock prevention: Each inode is tagged and
		 * synced in one shot. So we can unconditionally update its
		 * dirty time to prevent syncing it again. Note that time
		 * ordering of b_dirty list will be kept because the
		 * following code either removes the inode from b_dirty
		 * or calls redirty_tail().
		 */
		if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
			inode->dirtied_when = jiffies;
		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
...

								Honza
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |    9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> ext3/ext4 are working fine now, however tests show that XFS may still
> livelock inside the XFS routines:
> 
> [ 3581.181253] sync            D ffff8800b6ca15d8  4560  4403   4392 0x00000000
> [ 3581.181734]  ffff88006f775bc8 0000000000000046 ffff8800b6ca12b8 00000001b6ca1938
> [ 3581.182411]  ffff88006f774000 00000000001d2e40 00000000001d2e40 ffff8800b6ca1280
> [ 3581.183088]  00000000001d2e40 ffff88006f775fd8 00000340af111ef2 00000000001d2e40
> [ 3581.183765] Call Trace:
> [ 3581.184008]  [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
> [ 3581.184392]  [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
> [ 3581.184756]  [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
> [ 3581.185120]  [<ffffffff812ed520>] xfs_ioend_wait+0x87/0x9f
> [ 3581.185474]  [<ffffffff8108c97a>] ? wake_up_bit+0x2a/0x2a
> [ 3581.185827]  [<ffffffff812f742a>] xfs_sync_inode_data+0x92/0x9d
> [ 3581.186198]  [<ffffffff812f76e2>] xfs_inode_ag_walk+0x1a5/0x287
> [ 3581.186569]  [<ffffffff812f779b>] ? xfs_inode_ag_walk+0x25e/0x287
> [ 3581.186946]  [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
> [ 3581.187311]  [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
> [ 3581.187669]  [<ffffffff81092175>] ? local_clock+0x41/0x5a
> [ 3581.188020]  [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
> [ 3581.188403]  [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
> [ 3581.188773]  [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
> [ 3581.189130]  [<ffffffff812e236c>] ? xfs_perag_get+0x80/0xd0
> [ 3581.189488]  [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
> [ 3581.189858]  [<ffffffff812f7831>] ? xfs_inode_ag_iterator+0x6d/0x8f
> [ 3581.190241]  [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
> [ 3581.190606]  [<ffffffff812f780b>] xfs_inode_ag_iterator+0x47/0x8f
> [ 3581.190982]  [<ffffffff811611f5>] ? __sync_filesystem+0x7a/0x7a
> [ 3581.191352]  [<ffffffff812f7877>] xfs_sync_data+0x24/0x43
> [ 3581.191703]  [<ffffffff812f7911>] xfs_quiesce_data+0x2c/0x88
> [ 3581.192065]  [<ffffffff812f5556>] xfs_fs_sync_fs+0x21/0x48
> [ 3581.192419]  [<ffffffff811611e1>] __sync_filesystem+0x66/0x7a
> [ 3581.192783]  [<ffffffff8116120b>] sync_one_sb+0x16/0x18
> [ 3581.193128]  [<ffffffff8113e3e3>] iterate_supers+0x72/0xce
> [ 3581.193482]  [<ffffffff81161140>] sync_filesystems+0x20/0x22
> [ 3581.193842]  [<ffffffff8116127e>] sys_sync+0x21/0x33
> [ 3581.194177]  [<ffffffff819016c2>] system_call_fastpath+0x16/0x1b
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:22.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:23.000000000 +0800
> @@ -432,6 +432,15 @@ writeback_single_inode(struct inode *ino
>  				requeue_io(inode);
>  			} else {
>  				/*
> +				 * sync livelock prevention: each inode is
> +				 * tagged and synced in one shot. If still
> +				 * dirty, move it back to s_dirty with updated
> +				 * dirty time to prevent syncing it again.
> +				 */
> +				if (wbc->sync_mode == WB_SYNC_ALL ||
> +				    wbc->tagged_sync)
> +					inode->dirtied_when = jiffies;
> +				/*
>  				 * Writeback blocked by something other than
>  				 * congestion. Delay the inode for some time to
>  				 * avoid spinning on the CPU (100% iowait)
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/17] writeback: sync expired inodes first in background writeback
  2011-05-06  3:08 ` [PATCH 06/17] writeback: sync expired inodes first in background writeback Wu Fengguang
@ 2011-05-06 19:02   ` Rik van Riel
  2011-05-09 16:08   ` Jan Kara
  1 sibling, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2011-05-06 19:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Dave Chinner, Jan Kara, Mel Gorman,
	Christoph Hellwig, linux-fsdevel

On 05/05/2011 11:08 PM, Wu Fengguang wrote:
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
>
> At each queue_io() time, first try enqueuing only newly expired inodes.
> If there are zero expired inodes to work with, then relax the rule and
> enqueue all dirty inodes.
>
> CC: Dave Chinner<david@fromorbit.com>
> CC: Jan Kara<jack@suse.cz>
> CC: Rik van Riel<riel@redhat.com>
> Acked-by: Mel Gorman<mel@csn.ul.ie>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel<riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/17] writeback: try more writeback as long as something was written
  2011-05-06  3:08 ` [PATCH 04/17] writeback: try more writeback as long as something was written Wu Fengguang
@ 2011-05-09 16:05   ` Jan Kara
  2011-05-10  2:40     ` Wu Fengguang
  0 siblings, 1 reply; 42+ messages in thread
From: Jan Kara @ 2011-05-09 16:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Fri 06-05-11 11:08:25, Wu Fengguang wrote:
> writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> they only populate possibly a subset of eligible inodes into b_io at
> entrance time. When the queued set of inodes are all synced, they just
> return, possibly with all queued inode pages written but still
> wbc.nr_to_write > 0.
> 
> For kupdate and background writeback, there may be more eligible inodes
> sitting in b_dirty when the current set of b_io inodes are completed. So
> it is necessary to try another round of writeback as long as we made some
> progress in this round. When there are no more eligible inodes, no more
> inodes will be enqueued in queue_io(), hence nothing could/will be
> synced and we may safely bail.
> 
> For example, imagine 100 inodes
> 
>         i0, i1, i2, ..., i90, i91, i99
> 
> At queue_io() time, i90-i99 happen to be expired and moved to s_io for
> IO. When finished successfully, if their total size is less than
> MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
> quit the background work (w/o this patch) while it's still over
> background threshold. This will be a fairly normal/frequent case I guess.
> 
> Jan raised the concern
> 
> 	I'm just afraid that in some pathological cases this could
> 	result in bad writeback pattern - like if there is some process
> 	which manages to dirty just a few pages while we are doing
> 	writeout, this looping could result in writing just a few pages
> 	in each round which is bad for fragmentation etc.
> 
> However it requires really strong timing to make that to (continuously)
> happen.  In practice it's very hard to produce such a pattern even if
> there is such a possibility in theory. I actually tried to write 1 page
> per 1ms with this command
> 
> 	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
> 
> and do sync(1) at the same time. The sync completes quickly on ext4,
> xfs, btrfs. The readers could try other write-and-sleep patterns and
> check if it can block sync for longer time.
  After some thought I realized that i_dirtied_when is going to be updated
in these cases and so we stop writing back the inode soon. So I think we
should be fine in the end. You can add:
Acked-by: Jan Kara <jack@suse.cz>

								Honza
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:24.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:25.000000000 +0800
> @@ -739,23 +739,23 @@ static long wb_writeback(struct bdi_writ
>  		wrote += write_chunk - wbc.nr_to_write;
>  
>  		/*
> -		 * If we consumed everything, see if we have more
> +		 * Did we write something? Try for more
> +		 *
> +		 * Dirty inodes are moved to b_io for writeback in batches.
> +		 * The completion of the current batch does not necessarily
> +		 * mean the overall work is done. So we keep looping as long
> +		 * as made some progress on cleaning pages or inodes.
>  		 */
> -		if (wbc.nr_to_write <= 0)
> +		if (wbc.nr_to_write < write_chunk)
>  			continue;
>  		if (wbc.inodes_cleaned)
>  			continue;
>  		/*
> -		 * Didn't write everything and we don't have more IO, bail
> +		 * No more inodes for IO, bail
>  		 */
>  		if (!wbc.more_io)
>  			break;
>  		/*
> -		 * Did we write something? Try for more
> -		 */
> -		if (wbc.nr_to_write < write_chunk)
> -			continue;
> -		/*
>  		 * Nothing written. Wait for some inode to
>  		 * become available for writeback. Otherwise
>  		 * we'll just busyloop.
> 
> 

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/17] writeback: sync expired inodes first in background writeback
  2011-05-06  3:08 ` [PATCH 06/17] writeback: sync expired inodes first in background writeback Wu Fengguang
  2011-05-06 19:02   ` Rik van Riel
@ 2011-05-09 16:08   ` Jan Kara
  2011-05-09 16:18     ` Rik van Riel
  1 sibling, 1 reply; 42+ messages in thread
From: Jan Kara @ 2011-05-09 16:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Dave Chinner, Jan Kara, Rik van Riel,
	Mel Gorman, Christoph Hellwig, linux-fsdevel

On Fri 06-05-11 11:08:27, Wu Fengguang wrote:
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> At each queue_io() time, first try enqueuing only newly expired inodes.
> If there are zero expired inodes to work with, then relax the rule and
> enqueue all dirty inodes.
> 
> This will help reduce the number of dirty pages encountered by page
> reclaim, eg. the pageout() calls. Normally older inodes contain older
> dirty pages, which are more close to the end of the LRU lists. So
> syncing older inodes first helps reducing the dirty pages reached by
> the page reclaim code.
> 
> More background: as Mel put it, "it makes sense to write old pages first
> to reduce the chances page reclaim is initiating IO."
> 
> Rik also presented the situation with a graph:
> 
> LRU head                                 [*] dirty page
> [                          *              *      * *  *  * * * * * *]
> 
> Ideally, most dirty pages should lie close to the LRU tail instead of
> LRU head. That requires the flusher thread to sync old/expired inodes
> first (as there are obvious correlations between inode age and page
> age), and to give fair opportunities to newly expired inodes rather
> than sticking with some large eldest inodes (as larger inodes have
> weaker correlations in the inode<=>page ages).
> 
> This patch helps the flusher to meet both the above requirements.
> 
> Side effects: it might reduce the batch size and hence reduce
> inode_wb_list_lock hold time, but in turn make the cluster-by-partition
> logic in the same function less effective on reducing disk seeks.
> 
> v2: keep policy changes inside wb_writeback() and keep the
> wbc.older_than_this visibility as suggested by Dave.
  The age of pages in LRU is not necessarily related with the
i_dirtied_when time stamp so I'm not sure how much this will help after all
but it makes some sense from the data integrity point of view at least. You
can add:
Acked-by: Jan Kara <jack@suse.cz>

								Honza
> 
> CC: Dave Chinner <david@fromorbit.com>
> CC: Jan Kara <jack@suse.cz>
> CC: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:25.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:26.000000000 +0800
> @@ -718,7 +718,7 @@ static long wb_writeback(struct bdi_writ
>  		if (work->for_background && !over_bground_thresh())
>  			break;
>  
> -		if (work->for_kupdate) {
> +		if (work->for_kupdate || work->for_background) {
>  			oldest_jif = jiffies -
>  				msecs_to_jiffies(dirty_expire_interval * 10);
>  			wbc.older_than_this = &oldest_jif;
> @@ -729,6 +729,7 @@ static long wb_writeback(struct bdi_writ
>  		wbc.pages_skipped = 0;
>  		wbc.inodes_cleaned = 0;
>  
> +retry:
>  		trace_wbc_writeback_start(&wbc, wb->bdi);
>  		if (work->sb)
>  			__writeback_inodes_sb(work->sb, wb, &wbc);
> @@ -752,6 +753,19 @@ static long wb_writeback(struct bdi_writ
>  		if (wbc.inodes_cleaned)
>  			continue;
>  		/*
> +		 * background writeback will start with expired inodes, and
> +		 * if none is found, fallback to all inodes. This order helps
> +		 * reduce the number of dirty pages reaching the end of LRU
> +		 * lists and cause trouble to the page reclaim.
> +		 */
> +		if (work->for_background &&
> +		    wbc.older_than_this &&
> +		    list_empty(&wb->b_io) &&
> +		    list_empty(&wb->b_more_io)) {
> +			wbc.older_than_this = NULL;
> +			goto retry;
> +		}
> +		/*
>  		 * No more inodes for IO, bail
>  		 */
>  		if (!wbc.more_io)
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 09/17] writeback: elevate queue_io() into wb_writeback()
  2011-05-06  3:08 ` [PATCH 09/17] writeback: elevate queue_io() into wb_writeback() Wu Fengguang
@ 2011-05-09 16:15   ` Jan Kara
  0 siblings, 0 replies; 42+ messages in thread
From: Jan Kara @ 2011-05-09 16:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Fri 06-05-11 11:08:30, Wu Fengguang wrote:
> Code refactor for more logical code layout.
> No behavior change.
> 
> - remove the mis-named __writeback_inodes_sb()
> 
> - wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
>   before calling __writeback_inodes_wb()
> 
  You can add:
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   27 ++++++++++++---------------
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:41:36.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-05-05 23:51:59.000000000 +0800
> @@ -580,17 +580,13 @@ static int writeback_sb_inodes(struct su
>  	return 1;
>  }
>  
> -void writeback_inodes_wb(struct bdi_writeback *wb,
> -		struct writeback_control *wbc)
> +static void __writeback_inodes_wb(struct bdi_writeback *wb,
> +				  struct writeback_control *wbc)
>  {
>  	int ret = 0;
>  
>  	if (!wbc->wb_start)
>  		wbc->wb_start = jiffies; /* livelock avoidance */
> -	spin_lock(&wb->list_lock);
> -
> -	if (list_empty(&wb->b_io))
> -		queue_io(wb, wbc->older_than_this);
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = wb_inode(wb->b_io.prev);
> @@ -606,19 +602,16 @@ void writeback_inodes_wb(struct bdi_writ
>  		if (ret)
>  			break;
>  	}
> -	spin_unlock(&wb->list_lock);
>  	/* Leave any unwritten inodes on b_io */
>  }
>  
> -static void __writeback_inodes_sb(struct super_block *sb,
> -		struct bdi_writeback *wb, struct writeback_control *wbc)
> +void writeback_inodes_wb(struct bdi_writeback *wb,
> +		struct writeback_control *wbc)
>  {
> -	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> -
>  	spin_lock(&wb->list_lock);
>  	if (list_empty(&wb->b_io))
>  		queue_io(wb, wbc->older_than_this);
> -	writeback_sb_inodes(sb, wb, wbc, true);
> +	__writeback_inodes_wb(wb, wbc);
>  	spin_unlock(&wb->list_lock);
>  }
>  
> @@ -685,7 +678,7 @@ static long wb_writeback(struct bdi_writ
>  	 * The intended call sequence for WB_SYNC_ALL writeback is:
>  	 *
>  	 *      wb_writeback()
> -	 *          __writeback_inodes_sb()     <== called only once
> +	 *          writeback_sb_inodes()       <== called only once
>  	 *              write_cache_pages()     <== called once for each inode
>  	 *                   (quickly) tag currently dirty pages
>  	 *                   (maybe slowly) sync all tagged pages
> @@ -731,10 +724,14 @@ static long wb_writeback(struct bdi_writ
>  
>  retry:
>  		trace_wbc_writeback_start(&wbc, wb->bdi);
> +		spin_lock(&wb->list_lock);
> +		if (list_empty(&wb->b_io))
> +			queue_io(wb, wbc.older_than_this);
>  		if (work->sb)
> -			__writeback_inodes_sb(work->sb, wb, &wbc);
> +			writeback_sb_inodes(work->sb, wb, &wbc, true);
>  		else
> -			writeback_inodes_wb(wb, &wbc);
> +			__writeback_inodes_wb(wb, &wbc);
> +		spin_unlock(&wb->list_lock);
>  		trace_wbc_writeback_written(&wbc, wb->bdi);
>  
>  		work->nr_pages -= write_chunk - wbc.nr_to_write;
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/17] writeback: avoid extra sync work at enqueue time
  2011-05-06  3:08 ` [PATCH 10/17] writeback: avoid extra sync work at enqueue time Wu Fengguang
@ 2011-05-09 16:16   ` Jan Kara
  0 siblings, 0 replies; 42+ messages in thread
From: Jan Kara @ 2011-05-09 16:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Fri 06-05-11 11:08:31, Wu Fengguang wrote:
> This removes writeback_control.wb_start and does more straightforward
> sync livelock prevention by setting .older_than_this to prevent extra
> inodes from being enqueued in the first place.
> 
> CC: Jan Kara <jack@suse.cz>
  Acked-by: Jan Kara <jack@suse.cz>

								Honza

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c         |   16 +++-------------
>  include/linux/writeback.h |    3 ---
>  2 files changed, 3 insertions(+), 16 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:29.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:30.000000000 +0800
> @@ -544,15 +544,6 @@ static int writeback_sb_inodes(struct su
>  			continue;
>  		}
>  
> -		/*
> -		 * Was this inode dirtied after sync_sb_inodes was called?
> -		 * This keeps sync from extra jobs and livelock.
> -		 */
> -		if (inode_dirtied_after(inode, wbc->wb_start)) {
> -			spin_unlock(&inode->i_lock);
> -			return 1;
> -		}
> -
>  		__iget(inode);
>  
>  		pages_skipped = wbc->pages_skipped;
> @@ -585,9 +576,6 @@ static void __writeback_inodes_wb(struct
>  {
>  	int ret = 0;
>  
> -	if (!wbc->wb_start)
> -		wbc->wb_start = jiffies; /* livelock avoidance */
> -
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = wb_inode(wb->b_io.prev);
>  		struct super_block *sb = inode->i_sb;
> @@ -686,7 +674,9 @@ static long wb_writeback(struct bdi_writ
>  	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
>  		write_chunk = LONG_MAX;
>  
> -	wbc.wb_start = jiffies; /* livelock avoidance */
> +	oldest_jif = jiffies;
> +	wbc.older_than_this = &oldest_jif;
> +
>  	for (;;) {
>  		/*
>  		 * Stop writeback when nr_pages has been consumed
> --- linux-next.orig/include/linux/writeback.h	2011-05-05 23:30:28.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-05-05 23:30:30.000000000 +0800
> @@ -26,9 +26,6 @@ struct writeback_control {
>  	enum writeback_sync_modes sync_mode;
>  	unsigned long *older_than_this;	/* If !NULL, only write back inodes
>  					   older than this */
> -	unsigned long wb_start;         /* Time writeback_inodes_wb was
> -					   called. This is needed to avoid
> -					   extra jobs and livelock */
>  	long nr_to_write;		/* Write this many pages, and decrement
>  					   this for each page written */
>  	long pages_skipped;		/* Pages which were not written */
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/17] writeback: sync expired inodes first in background writeback
  2011-05-09 16:08   ` Jan Kara
@ 2011-05-09 16:18     ` Rik van Riel
  2011-05-10  2:45       ` Wu Fengguang
  0 siblings, 1 reply; 42+ messages in thread
From: Rik van Riel @ 2011-05-09 16:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, Andrew Morton, LKML, Dave Chinner, Mel Gorman,
	Christoph Hellwig, linux-fsdevel

On 05/09/2011 12:08 PM, Jan Kara wrote:

>    The age of pages in LRU is not necessarily related with the
> i_dirtied_when time stamp so I'm not sure how much this will help after all

Not necessarily, but given that the inactive file list is a
FIFO list, I expect there will be decent correlation.

> but it makes some sense from the data integrity point of view at least. You
> can add:
> Acked-by: Jan Kara<jack@suse.cz>

Good point on the data integrity.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 14/17] writeback: make writeback_control.nr_to_write straight
  2011-05-06  3:08 ` [PATCH 14/17] writeback: make writeback_control.nr_to_write straight Wu Fengguang
@ 2011-05-09 16:54   ` Jan Kara
  2011-05-10  3:19     ` Wu Fengguang
  0 siblings, 1 reply; 42+ messages in thread
From: Jan Kara @ 2011-05-09 16:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Fri 06-05-11 11:08:35, Wu Fengguang wrote:
> Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
> and initialize the struct writeback_control there.
> 
> struct writeback_control is basically designed to control writeback of a
> single file, but we keep abuse it for writing multiple files in
> writeback_sb_inodes() and its callers.
> 
> It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
> work->nr_pages starts to make sense, and instead of saving and restoring
> pages_skipped in writeback_sb_inodes it can always start with a clean
> zero value.
> 
> It also makes a neat IO pattern change: large dirty files are now
> written in the full 4MB writeback chunk size, rather than whatever
> remained quota in wbc->nr_to_write.
> 
> Proposed-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
...
> @@ -543,34 +588,40 @@ static int writeback_sb_inodes(struct su
>  			requeue_io(inode, wb);
>  			continue;
>  		}
> -
>  		__iget(inode);
> +		write_chunk = writeback_chunk_size(work);
> +		wbc.nr_to_write = write_chunk;
> +		wbc.pages_skipped = 0;
> +
> +		writeback_single_inode(inode, wb, &wbc);
>  
> -		pages_skipped = wbc->pages_skipped;
> -		writeback_single_inode(inode, wb, wbc);
> -		if (wbc->pages_skipped != pages_skipped) {
> +		work->nr_pages -= write_chunk - wbc.nr_to_write;
> +		wrote += write_chunk - wbc.nr_to_write;
> +		if (wbc.pages_skipped) {
>  			/*
>  			 * writeback is not making progress due to locked
>  			 * buffers.  Skip this inode for now.
>  			 */
>  			redirty_tail(inode, wb);
> -		}
> +		} else if (!(inode->i_state & I_DIRTY))
> +			wrote++;
>  		spin_unlock(&inode->i_lock);
>  		spin_unlock(&wb->list_lock);
>  		iput(inode);
>  		cond_resched();
>  		spin_lock(&wb->list_lock);
> -		if (wbc->nr_to_write <= 0)
> -			return 1;
> +		if (wrote >= MAX_WRITEBACK_PAGES)
> +			break;
  This definitely deserves a comment (as well as a similar check in
__writeback_inodes_wb()). I guess you bail out here so that we perform the
background limit check and livelocking of for_kupdate/for_background check
often enough. I'm undecided whether it's good to bail out like this. It's
not necessary in some cases (like WB_SYNC_ALL or for_sync writeback) but
OTOH moving the necessary checks here does not look ideal either...

>  void writeback_inodes_wb(struct bdi_writeback *wb,
> -		struct writeback_control *wbc)
> +			 struct writeback_control *wbc)
>  {
> +	struct wb_writeback_work work = {
> +		.nr_pages	= wbc->nr_to_write,
> +		.sync_mode	= wbc->sync_mode,
> +		.range_cyclic	= wbc->range_cyclic,
> +	};
> +
>  	spin_lock(&wb->list_lock);
>  	if (list_empty(&wb->b_io))
> -		queue_io(wb, wbc->older_than_this);
> -	__writeback_inodes_wb(wb, wbc);
> +		queue_io(wb, NULL);
> +	__writeback_inodes_wb(wb, &work);
>  	spin_unlock(&wb->list_lock);
> -}
  Hmm, maybe we should just pass in number of pages (similarly as in
writeback_inodes_sb_nr())? It would look like a cleaner interface than
passing whole writeback_control and then ignoring parts of it.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/17] writeback: update dirtied_when for synced inode to prevent livelock
  2011-05-06 16:33   ` Jan Kara
@ 2011-05-10  2:14     ` Wu Fengguang
  2011-05-10 12:05       ` Jan Kara
  0 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-05-10  2:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, Dave Chinner, Christoph Hellwig, linux-fsdevel

On Sat, May 07, 2011 at 12:33:12AM +0800, Jan Kara wrote:
> On Fri 06-05-11 11:08:23, Wu Fengguang wrote:
> > With the more aggressive "keep writeback as long as we wrote something"
> > logic in wb_writeback(), the "use LONG_MAX .nr_to_write" trick in commit
> > b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") is
> > no longer enough to stop sync livelock.
> > 
> > The fix is to explicitly update .dirtied_when on synced inodes, so that
> > they are no longer considered for writeback in the next round.
>   The changelog doesn't make sense now when the patch is in the second
> place in the series... Also as we discussed in the previous iteration of
> the patches, I thought you'll move the condition before mapping_tagged()
> test. I.e. the code would be (comment updated):
> 	if (!(inode->i_state & I_FREEING)) {
> 		/*
> 		 * Sync livelock prevention: Each inode is tagged and
> 		 * synced in one shot. So we can unconditionally update its
> 		 * dirty time to prevent syncing it again. Note that time
> 		 * ordering of b_dirty list will be kept because the
> 		 * following code either removes the inode from b_dirty
> 		 * or calls redirty_tail().
> 		 */
> 		if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
> 			inode->dirtied_when = jiffies;
> 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {

It looks good, thanks!  Here is the updated patch with your
reviewed-by. Note that I retains a modified paragraph in the changelog
to clarify its relationship to the other two patches.

Thanks,
Fengguang
---
Subject: writeback: update dirtied_when for synced inode to prevent livelock
Date: Wed Apr 27 19:05:21 CST 2011

Explicitly update .dirtied_when on synced inodes, so that they are no
longer considered for writeback in the next round.

We'll do more aggressive "keep writeback as long as we wrote something"
logic in wb_writeback(). Then the "use LONG_MAX .nr_to_write" trick in
commit b9543dac5b ("writeback: avoid livelocking WB_SYNC_ALL writeback")
will no longer be enough to stop sync livelock.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    9 +++++++++
 1 file changed, 9 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2011-05-10 09:50:07.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-10 10:03:00.000000000 +0800
@@ -419,6 +419,15 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
+		/*
+		 * Sync livelock prevention. Each inode is tagged and synced in
+		 * one shot, so we can unconditionally update its dirty time to
+		 * prevent syncing it again. Note that time ordering of b_dirty
+		 * list will be kept because the following code either removes
+		 * the inode from b_dirty or calls redirty_tail().
+		 */
+		if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
+			inode->dirtied_when = jiffies;
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/17] writeback: introduce writeback_control.inodes_cleaned
  2011-05-06 14:36   ` Jan Kara
@ 2011-05-10  2:23     ` Wu Fengguang
  2011-05-10 13:52       ` Jan Kara
  0 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-05-10  2:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, Mel Gorman, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Fri, May 06, 2011 at 10:36:14PM +0800, Jan Kara wrote:
> On Fri 06-05-11 11:08:24, Wu Fengguang wrote:
> > The flusher works on dirty inodes in batches, and may quit prematurely
> > if the batch of inodes happen to be metadata-only dirtied: in this case
> > wbc->nr_to_write won't be decreased at all, which stands for "no pages
> > written" but also mis-interpreted as "no progress".
> > 
> > So introduce writeback_control.inodes_cleaned to count the inodes get
> > cleaned.  A non-zero value means there are some progress on writeback,
> > in which case more writeback can be tried.
> > 
> > about v1: The initial version was to count successful ->write_inode()
> > calls.  However it leads to busy loops for sync() over NFS, because NFS
> > ridiculously returns 0 (success) while at the same time redirties the
> > inode.  The NFS case can be trivially fixed, however there may be more
> > hidden bugs in other filesystems..
>   I think Christoph didn't like this patch
> (https://lkml.org/lkml/2011/5/4/123) and suggested that inodes_cleaned
> should remain local to fs-writeback.c...

Yes, he didn't like introducing one more field to writeback_control,
which will be removed in patch 14. (It will be a lot of fuss to move
patch 14 here or move this logic after patch 14.)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/17] writeback: try more writeback as long as something was written
  2011-05-09 16:05   ` Jan Kara
@ 2011-05-10  2:40     ` Wu Fengguang
  0 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-10  2:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, Dave Chinner, Christoph Hellwig, linux-fsdevel

On Tue, May 10, 2011 at 12:05:38AM +0800, Jan Kara wrote:
> On Fri 06-05-11 11:08:25, Wu Fengguang wrote:
> > writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> > they only populate possibly a subset of eligible inodes into b_io at
> > entrance time. When the queued set of inodes are all synced, they just
> > return, possibly with all queued inode pages written but still
> > wbc.nr_to_write > 0.
> > 
> > For kupdate and background writeback, there may be more eligible inodes
> > sitting in b_dirty when the current set of b_io inodes are completed. So
> > it is necessary to try another round of writeback as long as we made some
> > progress in this round. When there are no more eligible inodes, no more
> > inodes will be enqueued in queue_io(), hence nothing could/will be
> > synced and we may safely bail.
> > 
> > For example, imagine 100 inodes
> > 
> >         i0, i1, i2, ..., i90, i91, i99
> > 
> > At queue_io() time, i90-i99 happen to be expired and moved to s_io for
> > IO. When finished successfully, if their total size is less than
> > MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
> > quit the background work (w/o this patch) while it's still over
> > background threshold. This will be a fairly normal/frequent case I guess.
> > 
> > Jan raised the concern
> > 
> > 	I'm just afraid that in some pathological cases this could
> > 	result in bad writeback pattern - like if there is some process
> > 	which manages to dirty just a few pages while we are doing
> > 	writeout, this looping could result in writing just a few pages
> > 	in each round which is bad for fragmentation etc.
> > 
> > However it requires really strong timing to make that to (continuously)
> > happen.  In practice it's very hard to produce such a pattern even if
> > there is such a possibility in theory. I actually tried to write 1 page
> > per 1ms with this command
> > 
> > 	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
> > 
> > and do sync(1) at the same time. The sync completes quickly on ext4,
> > xfs, btrfs. The readers could try other write-and-sleep patterns and
> > check if it can block sync for longer time.
>   After some thought I realized that i_dirtied_when is going to be updated
> in these cases and so we stop writing back the inode soon. So I think we
> should be fine in the end. You can add:

Right, now that we do tagged sync and update i_dirtied_when, the test
case is guaranteed to not livelock.

But note that the test in the changelog was carried out before the
newly added sync livelock patches. 

So I simplified the last changelog paragraphs as follows.

---
Subject: writeback: try more writeback as long as something was written
Date: Thu Jul 22 10:23:44 CST 2010

writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of eligible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

For example, imagine 100 inodes

        i0, i1, i2, ..., i90, i91, i99

At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold. This will be a fairly normal/frequent case I guess.

Now that we do tagged sync and update inode->dirtied_when after the sync,
this change won't livelock sync(1).  I actually tried to write 1 page
per 1ms with this command

	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test

and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:24.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:25.000000000 +0800
@@ -739,23 +739,23 @@ static long wb_writeback(struct bdi_writ
 		wrote += write_chunk - wbc.nr_to_write;
 
 		/*
-		 * If we consumed everything, see if we have more
+		 * Did we write something? Try for more
+		 *
+		 * Dirty inodes are moved to b_io for writeback in batches.
+		 * The completion of the current batch does not necessarily
+		 * mean the overall work is done. So we keep looping as long
+		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write <= 0)
+		if (wbc.nr_to_write < write_chunk)
 			continue;
 		if (wbc.inodes_cleaned)
 			continue;
 		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * No more inodes for IO, bail
 		 */
 		if (!wbc.more_io)
 			break;
 		/*
-		 * Did we write something? Try for more
-		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/17] writeback: sync expired inodes first in background writeback
  2011-05-09 16:18     ` Rik van Riel
@ 2011-05-10  2:45       ` Wu Fengguang
  0 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-10  2:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jan Kara, Andrew Morton, LKML, Dave Chinner, Mel Gorman,
	Christoph Hellwig, linux-fsdevel

On Tue, May 10, 2011 at 12:18:19AM +0800, Rik van Riel wrote:
> On 05/09/2011 12:08 PM, Jan Kara wrote:
> 
> >    The age of pages in LRU is not necessarily related with the
> > i_dirtied_when time stamp so I'm not sure how much this will help after all
> 
> Not necessarily, but given that the inactive file list is a
> FIFO list, I expect there will be decent correlation.
> 
> > but it makes some sense from the data integrity point of view at least. You
> > can add:
> > Acked-by: Jan Kara<jack@suse.cz>
> 
> Good point on the data integrity.

Yeah, I added this good point to the changelog :)

Thanks,
Fengguang
---
Subject: writeback: sync expired inodes first in background writeback
Date: Wed Jul 21 20:11:53 CST 2010

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

At each queue_io() time, first try enqueuing only newly expired inodes.
If there are zero expired inodes to work with, then relax the rule and
enqueue all dirty inodes.

It at least makes sense from the data integrity point of view.

This may also reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by the
page reclaim code.

More background: as Mel put it, "it makes sense to write old pages first
to reduce the chances page reclaim is initiating IO."

Rik also presented the situation with a graph:

LRU head                                 [*] dirty page
[                          *              *      * *  *  * * * * * *]

Ideally, most dirty pages should lie close to the LRU tail instead of
LRU head. That requires the flusher thread to sync old/expired inodes
first (as there are obvious correlations between inode age and page
age), and to give fair opportunities to newly expired inodes rather
than sticking with some large eldest inodes (as larger inodes have
weaker correlations in the inode<=>page ages).

This patch helps the flusher to meet both the above requirements.

Side effects: it might reduce the batch size and hence reduce
inode_wb_list_lock hold time, but in turn make the cluster-by-partition
logic in the same function less effective on reducing disk seeks.

v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.

CC: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel<riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-05 23:30:25.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-05 23:30:26.000000000 +0800
@@ -718,7 +718,7 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
-		if (work->for_kupdate) {
+		if (work->for_kupdate || work->for_background) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
 			wbc.older_than_this = &oldest_jif;
@@ -729,6 +729,7 @@ static long wb_writeback(struct bdi_writ
 		wbc.pages_skipped = 0;
 		wbc.inodes_cleaned = 0;
 
+retry:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
 			__writeback_inodes_sb(work->sb, wb, &wbc);
@@ -752,6 +753,19 @@ static long wb_writeback(struct bdi_writ
 		if (wbc.inodes_cleaned)
 			continue;
 		/*
+		 * background writeback will start with expired inodes, and
+		 * if none is found, fallback to all inodes. This order helps
+		 * reduce the number of dirty pages reaching the end of LRU
+		 * lists and cause trouble to the page reclaim.
+		 */
+		if (work->for_background &&
+		    wbc.older_than_this &&
+		    list_empty(&wb->b_io) &&
+		    list_empty(&wb->b_more_io)) {
+			wbc.older_than_this = NULL;
+			goto retry;
+		}
+		/*
 		 * No more inodes for IO, bail
 		 */
 		if (!wbc.more_io)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 14/17] writeback: make writeback_control.nr_to_write straight
  2011-05-09 16:54   ` Jan Kara
@ 2011-05-10  3:19     ` Wu Fengguang
  2011-05-10 13:44       ` Jan Kara
  0 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-05-10  3:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, Dave Chinner, Christoph Hellwig, linux-fsdevel

On Tue, May 10, 2011 at 12:54:58AM +0800, Jan Kara wrote:
> On Fri 06-05-11 11:08:35, Wu Fengguang wrote:
> > Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
> > and initialize the struct writeback_control there.
> > 
> > struct writeback_control is basically designed to control writeback of a
> > single file, but we keep abuse it for writing multiple files in
> > writeback_sb_inodes() and its callers.
> > 
> > It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
> > work->nr_pages starts to make sense, and instead of saving and restoring
> > pages_skipped in writeback_sb_inodes it can always start with a clean
> > zero value.
> > 
> > It also makes a neat IO pattern change: large dirty files are now
> > written in the full 4MB writeback chunk size, rather than whatever
> > remained quota in wbc->nr_to_write.
> > 
> > Proposed-by: Christoph Hellwig <hch@infradead.org>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ...
> > @@ -543,34 +588,40 @@ static int writeback_sb_inodes(struct su
> >  			requeue_io(inode, wb);
> >  			continue;
> >  		}
> > -
> >  		__iget(inode);
> > +		write_chunk = writeback_chunk_size(work);
> > +		wbc.nr_to_write = write_chunk;
> > +		wbc.pages_skipped = 0;
> > +
> > +		writeback_single_inode(inode, wb, &wbc);
> >  
> > -		pages_skipped = wbc->pages_skipped;
> > -		writeback_single_inode(inode, wb, wbc);
> > -		if (wbc->pages_skipped != pages_skipped) {
> > +		work->nr_pages -= write_chunk - wbc.nr_to_write;
> > +		wrote += write_chunk - wbc.nr_to_write;
> > +		if (wbc.pages_skipped) {
> >  			/*
> >  			 * writeback is not making progress due to locked
> >  			 * buffers.  Skip this inode for now.
> >  			 */
> >  			redirty_tail(inode, wb);
> > -		}
> > +		} else if (!(inode->i_state & I_DIRTY))
> > +			wrote++;
> >  		spin_unlock(&inode->i_lock);
> >  		spin_unlock(&wb->list_lock);
> >  		iput(inode);
> >  		cond_resched();
> >  		spin_lock(&wb->list_lock);
> > -		if (wbc->nr_to_write <= 0)
> > -			return 1;
> > +		if (wrote >= MAX_WRITEBACK_PAGES)
> > +			break;
>   This definitely deserves a comment (as well as a similar check in
> __writeback_inodes_wb()). I guess you bail out here so that we perform the
> background limit check and livelocking of for_kupdate/for_background check
> often enough.

OK, do you like this one?

                /*
                 * bail out to wb_writeback() often enough to check
                 * background threshold and other termination conditions.
                 */

> I'm undecided whether it's good to bail out like this. It's
> not necessary in some cases (like WB_SYNC_ALL or for_sync writeback) but
> OTOH moving the necessary checks here does not look ideal either...

Yeah, it's OK either way and it's good to have less overheads in
normal non-sync operations.

When syncing large files, "wrote" will actually be much larger than
"MAX_WRITEBACK_PAGES". So it's not big overheads for sync.
 
> >  void writeback_inodes_wb(struct bdi_writeback *wb,
> > -		struct writeback_control *wbc)
> > +			 struct writeback_control *wbc)
> >  {
> > +	struct wb_writeback_work work = {
> > +		.nr_pages	= wbc->nr_to_write,
> > +		.sync_mode	= wbc->sync_mode,
> > +		.range_cyclic	= wbc->range_cyclic,
> > +	};
> > +
> >  	spin_lock(&wb->list_lock);
> >  	if (list_empty(&wb->b_io))
> > -		queue_io(wb, wbc->older_than_this);
> > -	__writeback_inodes_wb(wb, wbc);
> > +		queue_io(wb, NULL);
> > +	__writeback_inodes_wb(wb, &work);
> >  	spin_unlock(&wb->list_lock);
> > -}
>   Hmm, maybe we should just pass in number of pages (similarly as in
> writeback_inodes_sb_nr())? It would look like a cleaner interface than
> passing whole writeback_control and then ignoring parts of it.

Good point!

Thanks,
Fengguang
---
Subject: writeback: make writeback_control.nr_to_write straight 
Date: Wed May 04 19:54:37 CST 2011

Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.

Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/extent_io.c             |    2 
 fs/fs-writeback.c                |  189 +++++++++++++++--------------
 include/linux/writeback.h        |    6 
 include/trace/events/writeback.h |    6 
 mm/backing-dev.c                 |    9 -
 mm/page-writeback.c              |   14 --
 6 files changed, 107 insertions(+), 119 deletions(-)

change set:
- move writeback_control from wb_writeback() down to writeback_sb_inodes()
- change return value of writeback_sb_inodes()/__writeback_inodes_wb()
  to the number of pages and/or inodes written
- move writeback_control.older_than_this to struct wb_writeback_work
- remove writeback_control.inodes_cleaned
- remove wbc_writeback_* trace events for now

--- linux-next.orig/fs/fs-writeback.c	2011-05-10 10:50:13.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-10 11:09:38.000000000 +0800
@@ -30,11 +30,21 @@
 #include "internal.h"
 
 /*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES     1024L
+
+/*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
 struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
+	unsigned long *older_than_this;
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_sync:1;
 	unsigned int for_kupdate:1;
@@ -463,7 +473,6 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
-			wbc->inodes_cleaned++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -496,6 +505,31 @@ static bool pin_sb_for_writeback(struct 
 	return false;
 }
 
+static long writeback_chunk_size(struct wb_writeback_work *work)
+{
+	long pages;
+
+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 * The intended call sequence for WB_SYNC_ALL writeback is:
+	 *
+	 *      wb_writeback()
+	 *          writeback_sb_inodes()       <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                   (quickly) tag currently dirty pages
+	 *                   (maybe slowly) sync all tagged pages
+	 */
+	if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync)
+		pages = LONG_MAX;
+	else
+		pages = min(MAX_WRITEBACK_PAGES, work->nr_pages);
+
+	return pages;
+}
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -503,18 +537,29 @@ static bool pin_sb_for_writeback(struct 
  * inodes. Otherwise write only ones which go sequentially
  * in reverse order.
  *
- * Return 1, if the caller writeback routine should be
- * interrupted. Otherwise return 0.
+ * Return the number of pages and/or inodes cleaned.
  */
-static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
-		struct writeback_control *wbc, bool only_this_sb)
+static long writeback_sb_inodes(struct super_block *sb,
+				struct bdi_writeback *wb,
+				struct wb_writeback_work *work)
 {
+	struct writeback_control wbc = {
+		.sync_mode		= work->sync_mode,
+		.tagged_sync		= work->tagged_sync,
+		.for_kupdate		= work->for_kupdate,
+		.for_background		= work->for_background,
+		.range_cyclic		= work->range_cyclic,
+		.range_start		= 0,
+		.range_end		= LLONG_MAX,
+	};
+	long write_chunk;
+	long wrote = 0;  /* count both pages and inodes */
+
 	while (!list_empty(&wb->b_io)) {
-		long pages_skipped;
 		struct inode *inode = wb_inode(wb->b_io.prev);
 
 		if (inode->i_sb != sb) {
-			if (only_this_sb) {
+			if (work->sb) {
 				/*
 				 * We only want to write back data for this
 				 * superblock, move all inodes not belonging
@@ -529,7 +574,7 @@ static int writeback_sb_inodes(struct su
 			 * Bounce back to the caller to unpin this and
 			 * pin the next superblock.
 			 */
-			return 0;
+			break;
 		}
 
 		/*
@@ -543,34 +588,44 @@ static int writeback_sb_inodes(struct su
 			requeue_io(inode, wb);
 			continue;
 		}
-
 		__iget(inode);
+		write_chunk = writeback_chunk_size(work);
+		wbc.nr_to_write = write_chunk;
+		wbc.pages_skipped = 0;
+
+		writeback_single_inode(inode, wb, &wbc);
 
-		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wb, wbc);
-		if (wbc->pages_skipped != pages_skipped) {
+		work->nr_pages -= write_chunk - wbc.nr_to_write;
+		wrote += write_chunk - wbc.nr_to_write;
+		if (wbc.pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
 			redirty_tail(inode, wb);
-		}
+		} else if (!(inode->i_state & I_DIRTY))
+			wrote++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
 		iput(inode);
 		cond_resched();
 		spin_lock(&wb->list_lock);
-		if (wbc->nr_to_write <= 0)
-			return 1;
+		/*
+		 * bail out to wb_writeback() often enough to check
+		 * background threshold and other termination conditions.
+		 */
+		if (wrote >= MAX_WRITEBACK_PAGES)
+			break;
+		if (work->nr_pages <= 0)
+			break;
 	}
-	/* b_io is empty */
-	return 1;
+	return wrote;
 }
 
-static void __writeback_inodes_wb(struct bdi_writeback *wb,
-				  struct writeback_control *wbc)
+static long __writeback_inodes_wb(struct bdi_writeback *wb,
+				  struct wb_writeback_work *work)
 {
-	int ret = 0;
+	long wrote = 0;
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -580,33 +635,34 @@ static void __writeback_inodes_wb(struct
 			requeue_io(inode, wb);
 			continue;
 		}
-		ret = writeback_sb_inodes(sb, wb, wbc, false);
+		wrote += writeback_sb_inodes(sb, wb, work);
 		drop_super(sb);
 
-		if (ret)
+		if (wrote >= MAX_WRITEBACK_PAGES)
+			break;
+		if (work->nr_pages <= 0)
 			break;
 	}
 	/* Leave any unwritten inodes on b_io */
+	return wrote;
 }
 
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc)
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages)
 {
+	struct wb_writeback_work work = {
+		.nr_pages	= nr_pages,
+		.sync_mode	= WB_SYNC_NONE,
+		.range_cyclic	= 1,
+	};
+
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
-	__writeback_inodes_wb(wb, wbc);
+		queue_io(wb, NULL);
+	__writeback_inodes_wb(wb, &work);
 	spin_unlock(&wb->list_lock);
-}
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
+	return nr_pages - work.nr_pages;
+}
 
 static inline bool over_bground_thresh(void)
 {
@@ -636,42 +692,13 @@ static inline bool over_bground_thresh(v
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
-	struct writeback_control wbc = {
-		.sync_mode		= work->sync_mode,
-		.tagged_sync		= work->tagged_sync,
-		.older_than_this	= NULL,
-		.for_kupdate		= work->for_kupdate,
-		.for_background		= work->for_background,
-		.range_cyclic		= work->range_cyclic,
-	};
+	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	long wrote = 0;
-	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
-
-	if (!wbc.range_cyclic) {
-		wbc.range_start = 0;
-		wbc.range_end = LLONG_MAX;
-	}
-
-	/*
-	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
-	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
-	 * here avoids calling into writeback_inodes_wb() more than once.
-	 *
-	 * The intended call sequence for WB_SYNC_ALL writeback is:
-	 *
-	 *      wb_writeback()
-	 *          writeback_sb_inodes()       <== called only once
-	 *              write_cache_pages()     <== called once for each inode
-	 *                   (quickly) tag currently dirty pages
-	 *                   (maybe slowly) sync all tagged pages
-	 */
-	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
-		write_chunk = LONG_MAX;
+	long progress;
 
 	oldest_jif = jiffies;
-	wbc.older_than_this = &oldest_jif;
+	work->older_than_this = &oldest_jif;
 
 	for (;;) {
 		/*
@@ -700,27 +727,18 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_kupdate || work->for_background) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
-			wbc.older_than_this = &oldest_jif;
+			work->older_than_this = &oldest_jif;
 		}
 
-		wbc.nr_to_write = write_chunk;
-		wbc.pages_skipped = 0;
-		wbc.inodes_cleaned = 0;
-
 retry:
-		trace_wbc_writeback_start(&wbc, wb->bdi);
 		spin_lock(&wb->list_lock);
 		if (list_empty(&wb->b_io))
-			queue_io(wb, wbc.older_than_this);
+			queue_io(wb, work->older_than_this);
 		if (work->sb)
-			writeback_sb_inodes(work->sb, wb, &wbc, true);
+			progress = writeback_sb_inodes(work->sb, wb, work);
 		else
-			__writeback_inodes_wb(wb, &wbc);
+			progress = __writeback_inodes_wb(wb, work);
 		spin_unlock(&wb->list_lock);
-		trace_wbc_writeback_written(&wbc, wb->bdi);
-
-		work->nr_pages -= write_chunk - wbc.nr_to_write;
-		wrote += write_chunk - wbc.nr_to_write;
 
 		/*
 		 * Did we write something? Try for more
@@ -730,9 +748,7 @@ retry:
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		if (wbc.inodes_cleaned)
+		if (progress)
 			continue;
 		/*
 		 * background writeback will start with expired inodes, and
@@ -741,10 +757,10 @@ retry:
 		 * lists and cause trouble to the page reclaim.
 		 */
 		if (work->for_background &&
-		    wbc.older_than_this &&
+		    work->older_than_this &&
 		    list_empty(&wb->b_io) &&
 		    list_empty(&wb->b_more_io)) {
-			wbc.older_than_this = NULL;
+			work->older_than_this = NULL;
 			goto retry;
 		}
 		/*
@@ -760,7 +776,6 @@ retry:
 		spin_lock(&wb->list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = wb_inode(wb->b_more_io.prev);
-			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			spin_lock(&inode->i_lock);
 			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
@@ -768,7 +783,7 @@ retry:
 		spin_unlock(&wb->list_lock);
 	}
 
-	return wrote;
+	return nr_pages - work->nr_pages;
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-05-10 10:50:13.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-10 11:13:48.000000000 +0800
@@ -24,12 +24,9 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
-	long inodes_cleaned;		/* # of inodes cleaned */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is
@@ -58,8 +55,7 @@ void writeback_inodes_sb_nr(struct super
 int writeback_inodes_sb_if_idle(struct super_block *);
 int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr);
 void sync_inodes_sb(struct super_block *);
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc);
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages);
 
--- linux-next.orig/mm/backing-dev.c	2011-05-10 10:50:13.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-05-10 11:13:25.000000000 +0800
@@ -262,14 +262,7 @@ int bdi_has_dirty_io(struct backing_dev_
 
 static void bdi_flush_io(struct backing_dev_info *bdi)
 {
-	struct writeback_control wbc = {
-		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
-		.range_cyclic		= 1,
-		.nr_to_write		= 1024,
-	};
-
-	writeback_inodes_wb(&bdi->wb, &wbc);
+	writeback_inodes_wb(&bdi->wb, 1024);
 }
 
 /*
--- linux-next.orig/mm/page-writeback.c	2011-05-10 10:50:13.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-10 11:11:32.000000000 +0800
@@ -494,13 +494,6 @@ static void balance_dirty_pages(struct a
 		return;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -562,15 +555,12 @@ static void balance_dirty_pages(struct a
 		 * threshold otherwise wait until the disk writes catch
 		 * up.
 		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
 		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
+			pages_written += writeback_inodes_wb(&bdi->wb,
+							     write_chunk);
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
 		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 
--- linux-next.orig/include/trace/events/writeback.h	2011-05-10 10:50:13.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-10 10:50:13.000000000 +0800
@@ -101,7 +101,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -115,14 +114,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -132,7 +129,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
--- linux-next.orig/fs/btrfs/extent_io.c	2011-05-10 09:50:03.000000000 +0800
+++ linux-next/fs/btrfs/extent_io.c	2011-05-10 10:50:13.000000000 +0800
@@ -2596,7 +2596,6 @@ int extent_write_full_page(struct extent
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= wbc->sync_mode,
-		.older_than_this = NULL,
 		.nr_to_write	= 64,
 		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
 		.range_end	= (loff_t)-1,
@@ -2629,7 +2628,6 @@ int extent_write_locked_range(struct ext
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= mode,
-		.older_than_this = NULL,
 		.nr_to_write	= nr_pages * 2,
 		.range_start	= start,
 		.range_end	= end + 1,

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/17] writeback: update dirtied_when for synced inode to prevent livelock
  2011-05-10  2:14     ` Wu Fengguang
@ 2011-05-10 12:05       ` Jan Kara
  0 siblings, 0 replies; 42+ messages in thread
From: Jan Kara @ 2011-05-10 12:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, LKML, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Tue 10-05-11 10:14:21, Wu Fengguang wrote:
> On Sat, May 07, 2011 at 12:33:12AM +0800, Jan Kara wrote:
> > On Fri 06-05-11 11:08:23, Wu Fengguang wrote:
> > > With the more aggressive "keep writeback as long as we wrote something"
> > > logic in wb_writeback(), the "use LONG_MAX .nr_to_write" trick in commit
> > > b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") is
> > > no longer enough to stop sync livelock.
> > > 
> > > The fix is to explicitly update .dirtied_when on synced inodes, so that
> > > they are no longer considered for writeback in the next round.
> >   The changelog doesn't make sense now when the patch is in the second
> > place in the series... Also as we discussed in the previous iteration of
> > the patches, I thought you'll move the condition before mapping_tagged()
> > test. I.e. the code would be (comment updated):
> > 	if (!(inode->i_state & I_FREEING)) {
> > 		/*
> > 		 * Sync livelock prevention: Each inode is tagged and
> > 		 * synced in one shot. So we can unconditionally update its
> > 		 * dirty time to prevent syncing it again. Note that time
> > 		 * ordering of b_dirty list will be kept because the
> > 		 * following code either removes the inode from b_dirty
> > 		 * or calls redirty_tail().
> > 		 */
> > 		if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
> > 			inode->dirtied_when = jiffies;
> > 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> 
> It looks good, thanks!  Here is the updated patch with your
> reviewed-by. Note that I retains a modified paragraph in the changelog
> to clarify its relationship to the other two patches.
  Thanks. It's fine now.

								Honza

> Thanks,
> Fengguang
> ---
> Subject: writeback: update dirtied_when for synced inode to prevent livelock
> Date: Wed Apr 27 19:05:21 CST 2011
> 
> Explicitly update .dirtied_when on synced inodes, so that they are no
> longer considered for writeback in the next round.
> 
> We'll do more aggressive "keep writeback as long as we wrote something"
> logic in wb_writeback(). Then the "use LONG_MAX .nr_to_write" trick in
> commit b9543dac5b ("writeback: avoid livelocking WB_SYNC_ALL writeback")
> will no longer be enough to stop sync livelock.
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |    9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-05-10 09:50:07.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-05-10 10:03:00.000000000 +0800
> @@ -419,6 +419,15 @@ writeback_single_inode(struct inode *ino
>  	spin_lock(&inode->i_lock);
>  	inode->i_state &= ~I_SYNC;
>  	if (!(inode->i_state & I_FREEING)) {
> +		/*
> +		 * Sync livelock prevention. Each inode is tagged and synced in
> +		 * one shot, so we can unconditionally update its dirty time to
> +		 * prevent syncing it again. Note that time ordering of b_dirty
> +		 * list will be kept because the following code either removes
> +		 * the inode from b_dirty or calls redirty_tail().
> +		 */
> +		if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
> +			inode->dirtied_when = jiffies;
>  		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  			/*
>  			 * We didn't write back all the pages.  nfs_writepages()
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 14/17] writeback: make writeback_control.nr_to_write straight
  2011-05-10  3:19     ` Wu Fengguang
@ 2011-05-10 13:44       ` Jan Kara
  2011-05-11 14:38         ` Wu Fengguang
  0 siblings, 1 reply; 42+ messages in thread
From: Jan Kara @ 2011-05-10 13:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, LKML, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Tue 10-05-11 11:19:39, Wu Fengguang wrote:
> On Tue, May 10, 2011 at 12:54:58AM +0800, Jan Kara wrote:
> > On Fri 06-05-11 11:08:35, Wu Fengguang wrote:
> > > Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
> > > and initialize the struct writeback_control there.
> > > 
> > > struct writeback_control is basically designed to control writeback of a
> > > single file, but we keep abuse it for writing multiple files in
> > > writeback_sb_inodes() and its callers.
> > > 
> > > It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
> > > work->nr_pages starts to make sense, and instead of saving and restoring
> > > pages_skipped in writeback_sb_inodes it can always start with a clean
> > > zero value.
> > > 
> > > It also makes a neat IO pattern change: large dirty files are now
> > > written in the full 4MB writeback chunk size, rather than whatever
> > > remained quota in wbc->nr_to_write.
> > > 
> > > Proposed-by: Christoph Hellwig <hch@infradead.org>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ...
> > > @@ -543,34 +588,40 @@ static int writeback_sb_inodes(struct su
> > >  			requeue_io(inode, wb);
> > >  			continue;
> > >  		}
> > > -
> > >  		__iget(inode);
> > > +		write_chunk = writeback_chunk_size(work);
> > > +		wbc.nr_to_write = write_chunk;
> > > +		wbc.pages_skipped = 0;
> > > +
> > > +		writeback_single_inode(inode, wb, &wbc);
> > >  
> > > -		pages_skipped = wbc->pages_skipped;
> > > -		writeback_single_inode(inode, wb, wbc);
> > > -		if (wbc->pages_skipped != pages_skipped) {
> > > +		work->nr_pages -= write_chunk - wbc.nr_to_write;
> > > +		wrote += write_chunk - wbc.nr_to_write;
> > > +		if (wbc.pages_skipped) {
> > >  			/*
> > >  			 * writeback is not making progress due to locked
> > >  			 * buffers.  Skip this inode for now.
> > >  			 */
> > >  			redirty_tail(inode, wb);
> > > -		}
> > > +		} else if (!(inode->i_state & I_DIRTY))
> > > +			wrote++;
> > >  		spin_unlock(&inode->i_lock);
> > >  		spin_unlock(&wb->list_lock);
> > >  		iput(inode);
> > >  		cond_resched();
> > >  		spin_lock(&wb->list_lock);
> > > -		if (wbc->nr_to_write <= 0)
> > > -			return 1;
> > > +		if (wrote >= MAX_WRITEBACK_PAGES)
> > > +			break;
> >   This definitely deserves a comment (as well as a similar check in
> > __writeback_inodes_wb()). I guess you bail out here so that we perform the
> > background limit check and livelocking of for_kupdate/for_background check
> > often enough.
> 
> OK, do you like this one?
> 
>                 /*
>                  * bail out to wb_writeback() often enough to check
>                  * background threshold and other termination conditions.
>                  */
  Yes, this looks OK.

> ---
> Subject: writeback: make writeback_control.nr_to_write straight 
> Date: Wed May 04 19:54:37 CST 2011
> 
> Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
> and initialize the struct writeback_control there.
> 
> struct writeback_control is basically designed to control writeback of a
> single file, but we keep abuse it for writing multiple files in
> writeback_sb_inodes() and its callers.
> 
> It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
> work->nr_pages starts to make sense, and instead of saving and restoring
> pages_skipped in writeback_sb_inodes it can always start with a clean
> zero value.
> 
> It also makes a neat IO pattern change: large dirty files are now
> written in the full 4MB writeback chunk size, rather than whatever
> remained quota in wbc->nr_to_write.
  One more thing that I've noticed while looking through this patch: You
are removing several trace points (without actually removing the trace
point definitions BTW):
trace_wbc_writeback_start(&wbc, wb->bdi);
trace_wbc_writeback_written(&wbc, wb->bdi);
trace_wbc_writeback_wait(&wbc, wb->bdi);
trace_wbc_balance_dirty_start(&wbc, bdi);
trace_wbc_balance_dirty_written(&wbc, bdi);
trace_wbc_balance_dirty_wait(&wbc, bdi);

I guess they should be replaced with equivalent trace points that take
struct wb_writeback_work as an argument instead of just removing them?

								Honza

> Proposed-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/btrfs/extent_io.c             |    2 
>  fs/fs-writeback.c                |  189 +++++++++++++++--------------
>  include/linux/writeback.h        |    6 
>  include/trace/events/writeback.h |    6 
>  mm/backing-dev.c                 |    9 -
>  mm/page-writeback.c              |   14 --
>  6 files changed, 107 insertions(+), 119 deletions(-)
> 
> change set:
> - move writeback_control from wb_writeback() down to writeback_sb_inodes()
> - change return value of writeback_sb_inodes()/__writeback_inodes_wb()
>   to the number of pages and/or inodes written
> - move writeback_control.older_than_this to struct wb_writeback_work
> - remove writeback_control.inodes_cleaned
> - remove wbc_writeback_* trace events for now
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-05-10 10:50:13.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-05-10 11:09:38.000000000 +0800
> @@ -30,11 +30,21 @@
>  #include "internal.h"
>  
>  /*
> + * The maximum number of pages to writeout in a single bdi flush/kupdate
> + * operation.  We do this so we don't hold I_SYNC against an inode for
> + * enormous amounts of time, which would block a userspace task which has
> + * been forced to throttle against that inode.  Also, the code reevaluates
> + * the dirty each time it has written this many pages.
> + */
> +#define MAX_WRITEBACK_PAGES     1024L
> +
> +/*
>   * Passed into wb_writeback(), essentially a subset of writeback_control
>   */
>  struct wb_writeback_work {
>  	long nr_pages;
>  	struct super_block *sb;
> +	unsigned long *older_than_this;
>  	enum writeback_sync_modes sync_mode;
>  	unsigned int tagged_sync:1;
>  	unsigned int for_kupdate:1;
> @@ -463,7 +473,6 @@ writeback_single_inode(struct inode *ino
>  			 * No need to add it back to the LRU.
>  			 */
>  			list_del_init(&inode->i_wb_list);
> -			wbc->inodes_cleaned++;
>  		}
>  	}
>  	inode_sync_complete(inode);
> @@ -496,6 +505,31 @@ static bool pin_sb_for_writeback(struct 
>  	return false;
>  }
>  
> +static long writeback_chunk_size(struct wb_writeback_work *work)
> +{
> +	long pages;
> +
> +	/*
> +	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
> +	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
> +	 * here avoids calling into writeback_inodes_wb() more than once.
> +	 *
> +	 * The intended call sequence for WB_SYNC_ALL writeback is:
> +	 *
> +	 *      wb_writeback()
> +	 *          writeback_sb_inodes()       <== called only once
> +	 *              write_cache_pages()     <== called once for each inode
> +	 *                   (quickly) tag currently dirty pages
> +	 *                   (maybe slowly) sync all tagged pages
> +	 */
> +	if (work->sync_mode == WB_SYNC_ALL || work->tagged_sync)
> +		pages = LONG_MAX;
> +	else
> +		pages = min(MAX_WRITEBACK_PAGES, work->nr_pages);
> +
> +	return pages;
> +}
> +
>  /*
>   * Write a portion of b_io inodes which belong to @sb.
>   *
> @@ -503,18 +537,29 @@ static bool pin_sb_for_writeback(struct 
>   * inodes. Otherwise write only ones which go sequentially
>   * in reverse order.
>   *
> - * Return 1, if the caller writeback routine should be
> - * interrupted. Otherwise return 0.
> + * Return the number of pages and/or inodes cleaned.
>   */
> -static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
> -		struct writeback_control *wbc, bool only_this_sb)
> +static long writeback_sb_inodes(struct super_block *sb,
> +				struct bdi_writeback *wb,
> +				struct wb_writeback_work *work)
>  {
> +	struct writeback_control wbc = {
> +		.sync_mode		= work->sync_mode,
> +		.tagged_sync		= work->tagged_sync,
> +		.for_kupdate		= work->for_kupdate,
> +		.for_background		= work->for_background,
> +		.range_cyclic		= work->range_cyclic,
> +		.range_start		= 0,
> +		.range_end		= LLONG_MAX,
> +	};
> +	long write_chunk;
> +	long wrote = 0;  /* count both pages and inodes */
> +
>  	while (!list_empty(&wb->b_io)) {
> -		long pages_skipped;
>  		struct inode *inode = wb_inode(wb->b_io.prev);
>  
>  		if (inode->i_sb != sb) {
> -			if (only_this_sb) {
> +			if (work->sb) {
>  				/*
>  				 * We only want to write back data for this
>  				 * superblock, move all inodes not belonging
> @@ -529,7 +574,7 @@ static int writeback_sb_inodes(struct su
>  			 * Bounce back to the caller to unpin this and
>  			 * pin the next superblock.
>  			 */
> -			return 0;
> +			break;
>  		}
>  
>  		/*
> @@ -543,34 +588,44 @@ static int writeback_sb_inodes(struct su
>  			requeue_io(inode, wb);
>  			continue;
>  		}
> -
>  		__iget(inode);
> +		write_chunk = writeback_chunk_size(work);
> +		wbc.nr_to_write = write_chunk;
> +		wbc.pages_skipped = 0;
> +
> +		writeback_single_inode(inode, wb, &wbc);
>  
> -		pages_skipped = wbc->pages_skipped;
> -		writeback_single_inode(inode, wb, wbc);
> -		if (wbc->pages_skipped != pages_skipped) {
> +		work->nr_pages -= write_chunk - wbc.nr_to_write;
> +		wrote += write_chunk - wbc.nr_to_write;
> +		if (wbc.pages_skipped) {
>  			/*
>  			 * writeback is not making progress due to locked
>  			 * buffers.  Skip this inode for now.
>  			 */
>  			redirty_tail(inode, wb);
> -		}
> +		} else if (!(inode->i_state & I_DIRTY))
> +			wrote++;
>  		spin_unlock(&inode->i_lock);
>  		spin_unlock(&wb->list_lock);
>  		iput(inode);
>  		cond_resched();
>  		spin_lock(&wb->list_lock);
> -		if (wbc->nr_to_write <= 0)
> -			return 1;
> +		/*
> +		 * bail out to wb_writeback() often enough to check
> +		 * background threshold and other termination conditions.
> +		 */
> +		if (wrote >= MAX_WRITEBACK_PAGES)
> +			break;
> +		if (work->nr_pages <= 0)
> +			break;
>  	}
> -	/* b_io is empty */
> -	return 1;
> +	return wrote;
>  }
>  
> -static void __writeback_inodes_wb(struct bdi_writeback *wb,
> -				  struct writeback_control *wbc)
> +static long __writeback_inodes_wb(struct bdi_writeback *wb,
> +				  struct wb_writeback_work *work)
>  {
> -	int ret = 0;
> +	long wrote = 0;
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = wb_inode(wb->b_io.prev);
> @@ -580,33 +635,34 @@ static void __writeback_inodes_wb(struct
>  			requeue_io(inode, wb);
>  			continue;
>  		}
> -		ret = writeback_sb_inodes(sb, wb, wbc, false);
> +		wrote += writeback_sb_inodes(sb, wb, work);
>  		drop_super(sb);
>  
> -		if (ret)
> +		if (wrote >= MAX_WRITEBACK_PAGES)
> +			break;
> +		if (work->nr_pages <= 0)
>  			break;
>  	}
>  	/* Leave any unwritten inodes on b_io */
> +	return wrote;
>  }
>  
> -void writeback_inodes_wb(struct bdi_writeback *wb,
> -		struct writeback_control *wbc)
> +long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages)
>  {
> +	struct wb_writeback_work work = {
> +		.nr_pages	= nr_pages,
> +		.sync_mode	= WB_SYNC_NONE,
> +		.range_cyclic	= 1,
> +	};
> +
>  	spin_lock(&wb->list_lock);
>  	if (list_empty(&wb->b_io))
> -		queue_io(wb, wbc->older_than_this);
> -	__writeback_inodes_wb(wb, wbc);
> +		queue_io(wb, NULL);
> +	__writeback_inodes_wb(wb, &work);
>  	spin_unlock(&wb->list_lock);
> -}
>  
> -/*
> - * The maximum number of pages to writeout in a single bdi flush/kupdate
> - * operation.  We do this so we don't hold I_SYNC against an inode for
> - * enormous amounts of time, which would block a userspace task which has
> - * been forced to throttle against that inode.  Also, the code reevaluates
> - * the dirty each time it has written this many pages.
> - */
> -#define MAX_WRITEBACK_PAGES     1024
> +	return nr_pages - work.nr_pages;
> +}
>  
>  static inline bool over_bground_thresh(void)
>  {
> @@ -636,42 +692,13 @@ static inline bool over_bground_thresh(v
>  static long wb_writeback(struct bdi_writeback *wb,
>  			 struct wb_writeback_work *work)
>  {
> -	struct writeback_control wbc = {
> -		.sync_mode		= work->sync_mode,
> -		.tagged_sync		= work->tagged_sync,
> -		.older_than_this	= NULL,
> -		.for_kupdate		= work->for_kupdate,
> -		.for_background		= work->for_background,
> -		.range_cyclic		= work->range_cyclic,
> -	};
> +	long nr_pages = work->nr_pages;
>  	unsigned long oldest_jif;
> -	long wrote = 0;
> -	long write_chunk = MAX_WRITEBACK_PAGES;
>  	struct inode *inode;
> -
> -	if (!wbc.range_cyclic) {
> -		wbc.range_start = 0;
> -		wbc.range_end = LLONG_MAX;
> -	}
> -
> -	/*
> -	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
> -	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
> -	 * here avoids calling into writeback_inodes_wb() more than once.
> -	 *
> -	 * The intended call sequence for WB_SYNC_ALL writeback is:
> -	 *
> -	 *      wb_writeback()
> -	 *          writeback_sb_inodes()       <== called only once
> -	 *              write_cache_pages()     <== called once for each inode
> -	 *                   (quickly) tag currently dirty pages
> -	 *                   (maybe slowly) sync all tagged pages
> -	 */
> -	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
> -		write_chunk = LONG_MAX;
> +	long progress;
>  
>  	oldest_jif = jiffies;
> -	wbc.older_than_this = &oldest_jif;
> +	work->older_than_this = &oldest_jif;
>  
>  	for (;;) {
>  		/*
> @@ -700,27 +727,18 @@ static long wb_writeback(struct bdi_writ
>  		if (work->for_kupdate || work->for_background) {
>  			oldest_jif = jiffies -
>  				msecs_to_jiffies(dirty_expire_interval * 10);
> -			wbc.older_than_this = &oldest_jif;
> +			work->older_than_this = &oldest_jif;
>  		}
>  
> -		wbc.nr_to_write = write_chunk;
> -		wbc.pages_skipped = 0;
> -		wbc.inodes_cleaned = 0;
> -
>  retry:
> -		trace_wbc_writeback_start(&wbc, wb->bdi);
>  		spin_lock(&wb->list_lock);
>  		if (list_empty(&wb->b_io))
> -			queue_io(wb, wbc.older_than_this);
> +			queue_io(wb, work->older_than_this);
>  		if (work->sb)
> -			writeback_sb_inodes(work->sb, wb, &wbc, true);
> +			progress = writeback_sb_inodes(work->sb, wb, work);
>  		else
> -			__writeback_inodes_wb(wb, &wbc);
> +			progress = __writeback_inodes_wb(wb, work);
>  		spin_unlock(&wb->list_lock);
> -		trace_wbc_writeback_written(&wbc, wb->bdi);
> -
> -		work->nr_pages -= write_chunk - wbc.nr_to_write;
> -		wrote += write_chunk - wbc.nr_to_write;
>  
>  		/*
>  		 * Did we write something? Try for more
> @@ -730,9 +748,7 @@ retry:
>  		 * mean the overall work is done. So we keep looping as long
>  		 * as made some progress on cleaning pages or inodes.
>  		 */
> -		if (wbc.nr_to_write < write_chunk)
> -			continue;
> -		if (wbc.inodes_cleaned)
> +		if (progress)
>  			continue;
>  		/*
>  		 * background writeback will start with expired inodes, and
> @@ -741,10 +757,10 @@ retry:
>  		 * lists and cause trouble to the page reclaim.
>  		 */
>  		if (work->for_background &&
> -		    wbc.older_than_this &&
> +		    work->older_than_this &&
>  		    list_empty(&wb->b_io) &&
>  		    list_empty(&wb->b_more_io)) {
> -			wbc.older_than_this = NULL;
> +			work->older_than_this = NULL;
>  			goto retry;
>  		}
>  		/*
> @@ -760,7 +776,6 @@ retry:
>  		spin_lock(&wb->list_lock);
>  		if (!list_empty(&wb->b_more_io))  {
>  			inode = wb_inode(wb->b_more_io.prev);
> -			trace_wbc_writeback_wait(&wbc, wb->bdi);
>  			spin_lock(&inode->i_lock);
>  			inode_wait_for_writeback(inode, wb);
>  			spin_unlock(&inode->i_lock);
> @@ -768,7 +783,7 @@ retry:
>  		spin_unlock(&wb->list_lock);
>  	}
>  
> -	return wrote;
> +	return nr_pages - work->nr_pages;
>  }
>  
>  /*
> --- linux-next.orig/include/linux/writeback.h	2011-05-10 10:50:13.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-05-10 11:13:48.000000000 +0800
> @@ -24,12 +24,9 @@ enum writeback_sync_modes {
>   */
>  struct writeback_control {
>  	enum writeback_sync_modes sync_mode;
> -	unsigned long *older_than_this;	/* If !NULL, only write back inodes
> -					   older than this */
>  	long nr_to_write;		/* Write this many pages, and decrement
>  					   this for each page written */
>  	long pages_skipped;		/* Pages which were not written */
> -	long inodes_cleaned;		/* # of inodes cleaned */
>  
>  	/*
>  	 * For a_ops->writepages(): is start or end are non-zero then this is
> @@ -58,8 +55,7 @@ void writeback_inodes_sb_nr(struct super
>  int writeback_inodes_sb_if_idle(struct super_block *);
>  int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr);
>  void sync_inodes_sb(struct super_block *);
> -void writeback_inodes_wb(struct bdi_writeback *wb,
> -		struct writeback_control *wbc);
> +long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages);
>  long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
>  void wakeup_flusher_threads(long nr_pages);
>  
> --- linux-next.orig/mm/backing-dev.c	2011-05-10 10:50:13.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2011-05-10 11:13:25.000000000 +0800
> @@ -262,14 +262,7 @@ int bdi_has_dirty_io(struct backing_dev_
>  
>  static void bdi_flush_io(struct backing_dev_info *bdi)
>  {
> -	struct writeback_control wbc = {
> -		.sync_mode		= WB_SYNC_NONE,
> -		.older_than_this	= NULL,
> -		.range_cyclic		= 1,
> -		.nr_to_write		= 1024,
> -	};
> -
> -	writeback_inodes_wb(&bdi->wb, &wbc);
> +	writeback_inodes_wb(&bdi->wb, 1024);
>  }
>  
>  /*
> --- linux-next.orig/mm/page-writeback.c	2011-05-10 10:50:13.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-05-10 11:11:32.000000000 +0800
> @@ -494,13 +494,6 @@ static void balance_dirty_pages(struct a
>  		return;
>  
>  	for (;;) {
> -		struct writeback_control wbc = {
> -			.sync_mode	= WB_SYNC_NONE,
> -			.older_than_this = NULL,
> -			.nr_to_write	= write_chunk,
> -			.range_cyclic	= 1,
> -		};
> -
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
>  		nr_writeback = global_page_state(NR_WRITEBACK);
> @@ -562,15 +555,12 @@ static void balance_dirty_pages(struct a
>  		 * threshold otherwise wait until the disk writes catch
>  		 * up.
>  		 */
> -		trace_wbc_balance_dirty_start(&wbc, bdi);
>  		if (bdi_nr_reclaimable > bdi_thresh) {
> -			writeback_inodes_wb(&bdi->wb, &wbc);
> -			pages_written += write_chunk - wbc.nr_to_write;
> -			trace_wbc_balance_dirty_written(&wbc, bdi);
> +			pages_written += writeback_inodes_wb(&bdi->wb,
> +							     write_chunk);
>  			if (pages_written >= write_chunk)
>  				break;		/* We've done our duty */
>  		}
> -		trace_wbc_balance_dirty_wait(&wbc, bdi);
>  		__set_current_state(TASK_UNINTERRUPTIBLE);
>  		io_schedule_timeout(pause);
>  
> --- linux-next.orig/include/trace/events/writeback.h	2011-05-10 10:50:13.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2011-05-10 10:50:13.000000000 +0800
> @@ -101,7 +101,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__field(int, for_background)
>  		__field(int, for_reclaim)
>  		__field(int, range_cyclic)
> -		__field(unsigned long, older_than_this)
>  		__field(long, range_start)
>  		__field(long, range_end)
>  	),
> @@ -115,14 +114,12 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_background	= wbc->for_background;
>  		__entry->for_reclaim	= wbc->for_reclaim;
>  		__entry->range_cyclic	= wbc->range_cyclic;
> -		__entry->older_than_this = wbc->older_than_this ?
> -						*wbc->older_than_this : 0;
>  		__entry->range_start	= (long)wbc->range_start;
>  		__entry->range_end	= (long)wbc->range_end;
>  	),
>  
>  	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> -		"bgrd=%d reclm=%d cyclic=%d older=0x%lx "
> +		"bgrd=%d reclm=%d cyclic=%d "
>  		"start=0x%lx end=0x%lx",
>  		__entry->name,
>  		__entry->nr_to_write,
> @@ -132,7 +129,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_background,
>  		__entry->for_reclaim,
>  		__entry->range_cyclic,
> -		__entry->older_than_this,
>  		__entry->range_start,
>  		__entry->range_end)
>  )
> --- linux-next.orig/fs/btrfs/extent_io.c	2011-05-10 09:50:03.000000000 +0800
> +++ linux-next/fs/btrfs/extent_io.c	2011-05-10 10:50:13.000000000 +0800
> @@ -2596,7 +2596,6 @@ int extent_write_full_page(struct extent
>  	};
>  	struct writeback_control wbc_writepages = {
>  		.sync_mode	= wbc->sync_mode,
> -		.older_than_this = NULL,
>  		.nr_to_write	= 64,
>  		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
>  		.range_end	= (loff_t)-1,
> @@ -2629,7 +2628,6 @@ int extent_write_locked_range(struct ext
>  	};
>  	struct writeback_control wbc_writepages = {
>  		.sync_mode	= mode,
> -		.older_than_this = NULL,
>  		.nr_to_write	= nr_pages * 2,
>  		.range_start	= start,
>  		.range_end	= end + 1,
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/17] writeback: introduce writeback_control.inodes_cleaned
  2011-05-10  2:23     ` Wu Fengguang
@ 2011-05-10 13:52       ` Jan Kara
  2011-05-10 15:00         ` Wu Fengguang
  0 siblings, 1 reply; 42+ messages in thread
From: Jan Kara @ 2011-05-10 13:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, LKML, Mel Gorman, Dave Chinner,
	Christoph Hellwig, linux-fsdevel

On Tue 10-05-11 10:23:36, Wu Fengguang wrote:
> On Fri, May 06, 2011 at 10:36:14PM +0800, Jan Kara wrote:
> > On Fri 06-05-11 11:08:24, Wu Fengguang wrote:
> > > The flusher works on dirty inodes in batches, and may quit prematurely
> > > if the batch of inodes happen to be metadata-only dirtied: in this case
> > > wbc->nr_to_write won't be decreased at all, which stands for "no pages
> > > written" but also mis-interpreted as "no progress".
> > > 
> > > So introduce writeback_control.inodes_cleaned to count the inodes get
> > > cleaned.  A non-zero value means there are some progress on writeback,
> > > in which case more writeback can be tried.
> > > 
> > > about v1: The initial version was to count successful ->write_inode()
> > > calls.  However it leads to busy loops for sync() over NFS, because NFS
> > > ridiculously returns 0 (success) while at the same time redirties the
> > > inode.  The NFS case can be trivially fixed, however there may be more
> > > hidden bugs in other filesystems..
> >   I think Christoph didn't like this patch
> > (https://lkml.org/lkml/2011/5/4/123) and suggested that inodes_cleaned
> > should remain local to fs-writeback.c...
> 
> Yes, he didn't like introducing one more field to writeback_control,
> which will be removed in patch 14. (It will be a lot of fuss to move
> patch 14 here or move this logic after patch 14.)
  Yes, I've noticed (after I've sent this email) that you eventually remove
the code later in the series :). So I wonder would it be that much work to
just drop this patch? It's kind of dumb to add a new code which is removed
a few patches later. But I'm stopping bitching about this now ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/17] writeback: introduce writeback_control.inodes_cleaned
  2011-05-10 13:52       ` Jan Kara
@ 2011-05-10 15:00         ` Wu Fengguang
  0 siblings, 0 replies; 42+ messages in thread
From: Wu Fengguang @ 2011-05-10 15:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, Mel Gorman, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Tue, May 10, 2011 at 09:52:00PM +0800, Jan Kara wrote:
> On Tue 10-05-11 10:23:36, Wu Fengguang wrote:
> > On Fri, May 06, 2011 at 10:36:14PM +0800, Jan Kara wrote:
> > > On Fri 06-05-11 11:08:24, Wu Fengguang wrote:
> > > > The flusher works on dirty inodes in batches, and may quit prematurely
> > > > if the batch of inodes happen to be metadata-only dirtied: in this case
> > > > wbc->nr_to_write won't be decreased at all, which stands for "no pages
> > > > written" but also mis-interpreted as "no progress".
> > > > 
> > > > So introduce writeback_control.inodes_cleaned to count the inodes get
> > > > cleaned.  A non-zero value means there are some progress on writeback,
> > > > in which case more writeback can be tried.
> > > > 
> > > > about v1: The initial version was to count successful ->write_inode()
> > > > calls.  However it leads to busy loops for sync() over NFS, because NFS
> > > > ridiculously returns 0 (success) while at the same time redirties the
> > > > inode.  The NFS case can be trivially fixed, however there may be more
> > > > hidden bugs in other filesystems..
> > >   I think Christoph didn't like this patch
> > > (https://lkml.org/lkml/2011/5/4/123) and suggested that inodes_cleaned
> > > should remain local to fs-writeback.c...
> > 
> > Yes, he didn't like introducing one more field to writeback_control,
> > which will be removed in patch 14. (It will be a lot of fuss to move
> > patch 14 here or move this logic after patch 14.)
>   Yes, I've noticed (after I've sent this email) that you eventually remove
> the code later in the series :). So I wonder would it be that much work to
> just drop this patch? It's kind of dumb to add a new code which is removed
> a few patches later. But I'm stopping bitching about this now ;).

OK, thanks. I'm just uncomfortable that something will be broken in
between some commits if simply drop this patch :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 14/17] writeback: make writeback_control.nr_to_write straight
  2011-05-10 13:44       ` Jan Kara
@ 2011-05-11 14:38         ` Wu Fengguang
  2011-05-11 14:54           ` Jan Kara
  0 siblings, 1 reply; 42+ messages in thread
From: Wu Fengguang @ 2011-05-11 14:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, Dave Chinner, Christoph Hellwig, linux-fsdevel

>   One more thing that I've noticed while looking through this patch: You
> are removing several trace points (without actually removing the trace
> point definitions BTW):
> trace_wbc_writeback_start(&wbc, wb->bdi);
> trace_wbc_writeback_written(&wbc, wb->bdi);
> trace_wbc_writeback_wait(&wbc, wb->bdi);
> trace_wbc_balance_dirty_start(&wbc, bdi);
> trace_wbc_balance_dirty_written(&wbc, bdi);
> trace_wbc_balance_dirty_wait(&wbc, bdi);
> 
> I guess they should be replaced with equivalent trace points that take
> struct wb_writeback_work as an argument instead of just removing them?

OK. How about this change?

---
 fs/fs-writeback.c                |    3 ++
 include/trace/events/writeback.h |   33 +++++++++++++++++++++++------
 mm/page-writeback.c              |    3 ++
 3 files changed, 33 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-11 22:19:22.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-11 22:20:50.000000000 +0800
@@ -731,6 +731,7 @@ static long wb_writeback(struct bdi_writ
 		}
 
 retry:
+		trace_writeback_start(wb->bdi, work);
 		spin_lock(&wb->list_lock);
 		if (list_empty(&wb->b_io))
 			queue_io(wb, work->older_than_this);
@@ -739,6 +740,7 @@ retry:
 		else
 			progress = __writeback_inodes_wb(wb, work);
 		spin_unlock(&wb->list_lock);
+		trace_writeback_written(wb->bdi, work);
 
 		/*
 		 * Did we write something? Try for more
@@ -775,6 +777,7 @@ retry:
 		 */
 		spin_lock(&wb->list_lock);
 		if (!list_empty(&wb->b_more_io))  {
+			trace_writeback_wait(wb->bdi, work);
 			inode = wb_inode(wb->b_more_io.prev);
 			spin_lock(&inode->i_lock);
 			inode_wait_for_writeback(inode, wb);
--- linux-next.orig/mm/page-writeback.c	2011-05-11 22:19:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-11 22:20:50.000000000 +0800
@@ -555,14 +555,17 @@ static void balance_dirty_pages(struct a
 		 * threshold otherwise wait until the disk writes catch
 		 * up.
 		 */
+		trace_balance_dirty_start(bdi);
 		if (bdi_nr_reclaimable > bdi_thresh) {
 			pages_written += writeback_inodes_wb(&bdi->wb,
 							     write_chunk);
+			trace_balance_dirty_written(bdi, pages_written);
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
 		}
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		trace_balance_dirty_wait(bdi);
 
 		/*
 		 * Increase the delay for each loop, up to our previous
--- linux-next.orig/include/trace/events/writeback.h	2011-05-11 22:19:22.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-11 22:22:20.000000000 +0800
@@ -49,6 +49,9 @@ DEFINE_EVENT(writeback_work_class, name,
 DEFINE_WRITEBACK_WORK_EVENT(writeback_nothread);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_queue);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_exec);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_written);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_wait);
 
 TRACE_EVENT(writeback_pages_written,
 	TP_PROTO(long pages_written),
@@ -88,6 +91,30 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
+DEFINE_WRITEBACK_EVENT(balance_dirty_start);
+DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
+
+TRACE_EVENT(balance_dirty_written,
+
+	TP_PROTO(struct backing_dev_info *bdi, int written),
+
+	TP_ARGS(bdi, written),
+
+	TP_STRUCT__entry(
+		__array(char,	name, 32)
+		__field(int,	written)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->written = written;
+	),
+
+	TP_printk("bdi %s written %d",
+		  __entry->name,
+		  __entry->written
+	)
+);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
@@ -137,12 +164,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 DEFINE_EVENT(wbc_class, name, \
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), \
 	TP_ARGS(wbc, bdi))
-DEFINE_WBC_EVENT(wbc_writeback_start);
-DEFINE_WBC_EVENT(wbc_writeback_written);
-DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 14/17] writeback: make writeback_control.nr_to_write straight
  2011-05-11 14:38         ` Wu Fengguang
@ 2011-05-11 14:54           ` Jan Kara
  0 siblings, 0 replies; 42+ messages in thread
From: Jan Kara @ 2011-05-11 14:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, LKML, Dave Chinner, Christoph Hellwig,
	linux-fsdevel

On Wed 11-05-11 22:38:07, Wu Fengguang wrote:
> >   One more thing that I've noticed while looking through this patch: You
> > are removing several trace points (without actually removing the trace
> > point definitions BTW):
> > trace_wbc_writeback_start(&wbc, wb->bdi);
> > trace_wbc_writeback_written(&wbc, wb->bdi);
> > trace_wbc_writeback_wait(&wbc, wb->bdi);
> > trace_wbc_balance_dirty_start(&wbc, bdi);
> > trace_wbc_balance_dirty_written(&wbc, bdi);
> > trace_wbc_balance_dirty_wait(&wbc, bdi);
> > 
> > I guess they should be replaced with equivalent trace points that take
> > struct wb_writeback_work as an argument instead of just removing them?
> 
> OK. How about this change?
  Great. This should be fine. Thanks.

								Honza

> 
> ---
>  fs/fs-writeback.c                |    3 ++
>  include/trace/events/writeback.h |   33 +++++++++++++++++++++++------
>  mm/page-writeback.c              |    3 ++
>  3 files changed, 33 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-05-11 22:19:22.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-05-11 22:20:50.000000000 +0800
> @@ -731,6 +731,7 @@ static long wb_writeback(struct bdi_writ
>  		}
>  
>  retry:
> +		trace_writeback_start(wb->bdi, work);
>  		spin_lock(&wb->list_lock);
>  		if (list_empty(&wb->b_io))
>  			queue_io(wb, work->older_than_this);
> @@ -739,6 +740,7 @@ retry:
>  		else
>  			progress = __writeback_inodes_wb(wb, work);
>  		spin_unlock(&wb->list_lock);
> +		trace_writeback_written(wb->bdi, work);
>  
>  		/*
>  		 * Did we write something? Try for more
> @@ -775,6 +777,7 @@ retry:
>  		 */
>  		spin_lock(&wb->list_lock);
>  		if (!list_empty(&wb->b_more_io))  {
> +			trace_writeback_wait(wb->bdi, work);
>  			inode = wb_inode(wb->b_more_io.prev);
>  			spin_lock(&inode->i_lock);
>  			inode_wait_for_writeback(inode, wb);
> --- linux-next.orig/mm/page-writeback.c	2011-05-11 22:19:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-05-11 22:20:50.000000000 +0800
> @@ -555,14 +555,17 @@ static void balance_dirty_pages(struct a
>  		 * threshold otherwise wait until the disk writes catch
>  		 * up.
>  		 */
> +		trace_balance_dirty_start(bdi);
>  		if (bdi_nr_reclaimable > bdi_thresh) {
>  			pages_written += writeback_inodes_wb(&bdi->wb,
>  							     write_chunk);
> +			trace_balance_dirty_written(bdi, pages_written);
>  			if (pages_written >= write_chunk)
>  				break;		/* We've done our duty */
>  		}
>  		__set_current_state(TASK_UNINTERRUPTIBLE);
>  		io_schedule_timeout(pause);
> +		trace_balance_dirty_wait(bdi);
>  
>  		/*
>  		 * Increase the delay for each loop, up to our previous
> --- linux-next.orig/include/trace/events/writeback.h	2011-05-11 22:19:22.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2011-05-11 22:22:20.000000000 +0800
> @@ -49,6 +49,9 @@ DEFINE_EVENT(writeback_work_class, name,
>  DEFINE_WRITEBACK_WORK_EVENT(writeback_nothread);
>  DEFINE_WRITEBACK_WORK_EVENT(writeback_queue);
>  DEFINE_WRITEBACK_WORK_EVENT(writeback_exec);
> +DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
> +DEFINE_WRITEBACK_WORK_EVENT(writeback_written);
> +DEFINE_WRITEBACK_WORK_EVENT(writeback_wait);
>  
>  TRACE_EVENT(writeback_pages_written,
>  	TP_PROTO(long pages_written),
> @@ -88,6 +91,30 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
>  DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
>  DEFINE_WRITEBACK_EVENT(writeback_thread_start);
>  DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
> +DEFINE_WRITEBACK_EVENT(balance_dirty_start);
> +DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
> +
> +TRACE_EVENT(balance_dirty_written,
> +
> +	TP_PROTO(struct backing_dev_info *bdi, int written),
> +
> +	TP_ARGS(bdi, written),
> +
> +	TP_STRUCT__entry(
> +		__array(char,	name, 32)
> +		__field(int,	written)
> +	),
> +
> +	TP_fast_assign(
> +		strncpy(__entry->name, dev_name(bdi->dev), 32);
> +		__entry->written = written;
> +	),
> +
> +	TP_printk("bdi %s written %d",
> +		  __entry->name,
> +		  __entry->written
> +	)
> +);
>  
>  DECLARE_EVENT_CLASS(wbc_class,
>  	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
> @@ -137,12 +164,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  DEFINE_EVENT(wbc_class, name, \
>  	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), \
>  	TP_ARGS(wbc, bdi))
> -DEFINE_WBC_EVENT(wbc_writeback_start);
> -DEFINE_WBC_EVENT(wbc_writeback_written);
> -DEFINE_WBC_EVENT(wbc_writeback_wait);
> -DEFINE_WBC_EVENT(wbc_balance_dirty_start);
> -DEFINE_WBC_EVENT(wbc_balance_dirty_written);
> -DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
>  DEFINE_WBC_EVENT(wbc_writepage);
>  
>  DECLARE_EVENT_CLASS(writeback_congest_waited_template,
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2011-05-11 16:02 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-06  3:08 [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Wu Fengguang
2011-05-06  3:08 ` [PATCH 01/17] writeback: introduce wbc.tagged_sync for the WB_SYNC_NONE sync stage Wu Fengguang
2011-05-06 16:08   ` Jan Kara
2011-05-06  3:08 ` [PATCH 02/17] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
2011-05-06 16:33   ` Jan Kara
2011-05-10  2:14     ` Wu Fengguang
2011-05-10 12:05       ` Jan Kara
2011-05-06  3:08 ` [PATCH 03/17] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
2011-05-06 14:36   ` Jan Kara
2011-05-10  2:23     ` Wu Fengguang
2011-05-10 13:52       ` Jan Kara
2011-05-10 15:00         ` Wu Fengguang
2011-05-06  3:08 ` [PATCH 04/17] writeback: try more writeback as long as something was written Wu Fengguang
2011-05-09 16:05   ` Jan Kara
2011-05-10  2:40     ` Wu Fengguang
2011-05-06  3:08 ` [PATCH 05/17] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
2011-05-06  3:08 ` [PATCH 06/17] writeback: sync expired inodes first in background writeback Wu Fengguang
2011-05-06 19:02   ` Rik van Riel
2011-05-09 16:08   ` Jan Kara
2011-05-09 16:18     ` Rik van Riel
2011-05-10  2:45       ` Wu Fengguang
2011-05-06  3:08 ` [PATCH 07/17] writeback: refill b_io iff empty Wu Fengguang
2011-05-06  3:08 ` [PATCH 08/17] writeback: split inode_wb_list_lock into bdi_writeback.list_lock Wu Fengguang
2011-05-06  3:08 ` [PATCH 09/17] writeback: elevate queue_io() into wb_writeback() Wu Fengguang
2011-05-09 16:15   ` Jan Kara
2011-05-06  3:08 ` [PATCH 10/17] writeback: avoid extra sync work at enqueue time Wu Fengguang
2011-05-09 16:16   ` Jan Kara
2011-05-06  3:08 ` [PATCH 11/17] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
2011-05-06  3:08 ` [PATCH 12/17] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
2011-05-06  3:08 ` [PATCH 13/17] writeback: remove writeback_control.more_io Wu Fengguang
2011-05-06  3:08 ` [PATCH 14/17] writeback: make writeback_control.nr_to_write straight Wu Fengguang
2011-05-09 16:54   ` Jan Kara
2011-05-10  3:19     ` Wu Fengguang
2011-05-10 13:44       ` Jan Kara
2011-05-11 14:38         ` Wu Fengguang
2011-05-11 14:54           ` Jan Kara
2011-05-06  3:08 ` [PATCH 15/17] writeback: remove .nonblocking and .encountered_congestion Wu Fengguang
2011-05-06  3:08 ` [PATCH 16/17] writeback: trace event writeback_single_inode Wu Fengguang
2011-05-06  4:16   ` [PATCH 16/17] writeback: trace event writeback_single_inode (v2) Wu Fengguang
2011-05-06  3:08 ` [PATCH 17/17] writeback: trace event writeback_queue_io Wu Fengguang
2011-05-06  4:06 ` [PATCH 00/17] writeback fixes and cleanups for 2.6.40 Anca Emanuel
2011-05-06  4:09   ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.