All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3)
@ 2011-05-19 21:45 Wu Fengguang
  2011-05-19 21:45 ` [PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage Wu Fengguang
                   ` (18 more replies)
  0 siblings, 19 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, linux-fsdevel,
	Wu Fengguang, LKML

Andrew,

This is the combination of all the recent writeback patches that get
reasonably reviewed and tested.

The first 10 patches are already in -mm tree, with updates:

- remove "writeback: pass writeback_control down to move_expired_inodes()", and
  resolve the resulting merge conflicts in other patches.
- move ahead the sync livelock prevention patches (01, 02) so that (04) won't livelock sync
- merge the three -mm fixes to (08)
- fixed changelog of (01)
- rename .for_sync to .tagged_writepages

	[PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
	[PATCH 02/18] writeback: update dirtied_when for synced inode to prevent livelock
	[PATCH 03/18] writeback: introduce writeback_control.inodes_cleaned
	[PATCH 04/18] writeback: try more writeback as long as something was written
	[PATCH 05/18] writeback: the kupdate expire timestamp should be a moving target
	[PATCH 06/18] writeback: sync expired inodes first in background writeback
	[PATCH 07/18] writeback: refill b_io iff empty
	[PATCH 08/18] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
	[PATCH 09/18] writeback: elevate queue_io() into wb_writeback()
	[PATCH 10/18] writeback: avoid extra sync work at enqueue time

The following 8 patches were posted and reviewed these days:

	[PATCH 11/18] writeback: add bdi_dirty_limit() kernel-doc
	[PATCH 12/18] writeback: skip balance_dirty_pages() for in-memory fs
	[PATCH 13/18] writeback: remove writeback_control.more_io
	[PATCH 14/18] writeback: make writeback_control.nr_to_write straight
	[PATCH 15/18] writeback: remove .nonblocking and .encountered_congestion
	[PATCH 16/18] writeback: trace event writeback_single_inode
	[PATCH 17/18] writeback: trace event writeback_queue_io
	[PATCH 18/18] writeback: rearrange the wb_writeback() loop

Thanks to Jan and Dave for the careful reviews!

The patches are git pullable from

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git fs-writeback

 fs/block_dev.c                   |   16 +-
 fs/btrfs/extent_io.c             |    2 -
 fs/ext4/inode.c                  |    4 +-
 fs/fs-writeback.c                |  411 +++++++++++++++++++++-----------------
 fs/inode.c                       |    5 +-
 fs/nfs/write.c                   |    3 +-
 fs/xfs/linux-2.6/xfs_aops.c      |    2 +-
 include/linux/backing-dev.h      |    2 +
 include/linux/writeback.h        |   14 +-
 include/trace/events/btrfs.h     |    6 +-
 include/trace/events/ext4.h      |    6 +-
 include/trace/events/writeback.h |  137 ++++++++++++--
 mm/backing-dev.c                 |   30 ++-
 mm/filemap.c                     |    6 +-
 mm/page-writeback.c              |   42 ++--
 mm/rmap.c                        |    4 +-
 16 files changed, 420 insertions(+), 270 deletions(-)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 02/18] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Wu Fengguang, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-for-sync.patch --]
[-- Type: text/plain, Size: 4595 bytes --]

sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
do livelock prevention for it, too.

Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers also need livelock prevention.

Impact:  It changes the order in which pages/inodes are synced to disk.
Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
until finished with the current inode.

Acked-by: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/ext4/inode.c           |    4 ++--
 fs/fs-writeback.c         |   17 +++++++++--------
 include/linux/writeback.h |    1 +
 mm/page-writeback.c       |    4 ++--
 4 files changed, 14 insertions(+), 12 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:01:40.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:02:18.000000000 +0800
@@ -36,6 +36,7 @@ struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
 	enum writeback_sync_modes sync_mode;
+	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
@@ -650,6 +651,7 @@ static long wb_writeback(struct bdi_writ
 {
 	struct writeback_control wbc = {
 		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
 		.older_than_this	= NULL,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
@@ -657,7 +659,7 @@ static long wb_writeback(struct bdi_writ
 	};
 	unsigned long oldest_jif;
 	long wrote = 0;
-	long write_chunk;
+	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
 
 	if (wbc.for_kupdate) {
@@ -683,9 +685,7 @@ static long wb_writeback(struct bdi_writ
 	 *                   (quickly) tag currently dirty pages
 	 *                   (maybe slowly) sync all tagged pages
 	 */
-	if (wbc.sync_mode == WB_SYNC_NONE)
-		write_chunk = MAX_WRITEBACK_PAGES;
-	else
+	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_writepages)
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
@@ -1191,10 +1191,11 @@ void writeback_inodes_sb_nr(struct super
 {
 	DECLARE_COMPLETION_ONSTACK(done);
 	struct wb_writeback_work work = {
-		.sb		= sb,
-		.sync_mode	= WB_SYNC_NONE,
-		.done		= &done,
-		.nr_pages	= nr,
+		.sb			= sb,
+		.sync_mode		= WB_SYNC_NONE,
+		.tagged_writepages	= 1,
+		.done			= &done,
+		.nr_pages		= nr,
 	};
 
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
--- linux-next.orig/include/linux/writeback.h	2011-05-20 05:01:40.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-20 05:01:42.000000000 +0800
@@ -47,6 +47,7 @@ struct writeback_control {
 	unsigned encountered_congestion:1; /* An output: a queue is full */
 	unsigned for_kupdate:1;		/* A kupdate writeback */
 	unsigned for_background:1;	/* A background writeback */
+	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned more_io:1;		/* more io to be dispatched */
--- linux-next.orig/mm/page-writeback.c	2011-05-20 05:01:40.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-20 05:01:42.000000000 +0800
@@ -892,12 +892,12 @@ int write_cache_pages(struct address_spa
 			range_whole = 1;
 		cycled = 1; /* ignore range_cyclic tests */
 	}
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
 retry:
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
--- linux-next.orig/fs/ext4/inode.c	2011-05-20 05:01:40.000000000 +0800
+++ linux-next/fs/ext4/inode.c	2011-05-20 05:01:42.000000000 +0800
@@ -2741,7 +2741,7 @@ static int write_cache_pages_da(struct a
 	index = wbc->range_start >> PAGE_CACHE_SHIFT;
 	end = wbc->range_end >> PAGE_CACHE_SHIFT;
 
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
@@ -2975,7 +2975,7 @@ static int ext4_da_writepages(struct add
 	}
 
 retry:
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag_pages_for_writeback(mapping, index, end);
 
 	while (!ret && wbc->nr_to_write > 0) {



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 02/18] writeback: update dirtied_when for synced inode to prevent livelock
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
  2011-05-19 21:45 ` [PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 03/18] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-fix-sync-busyloop.patch --]
[-- Type: text/plain, Size: 3605 bytes --]

Explicitly update .dirtied_when on synced inodes, so that they are no
longer considered for writeback in the next round.

We'll do more aggressive "keep writeback as long as we wrote something"
logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
no longer be enough to stop sync livelock.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    9 +++++++++
 1 file changed, 9 insertions(+)

ext3/ext4 are working fine now, however tests show that XFS may still
livelock inside the XFS routines:

[ 3581.181253] sync            D ffff8800b6ca15d8  4560  4403   4392 0x00000000
[ 3581.181734]  ffff88006f775bc8 0000000000000046 ffff8800b6ca12b8 00000001b6ca1938
[ 3581.182411]  ffff88006f774000 00000000001d2e40 00000000001d2e40 ffff8800b6ca1280
[ 3581.183088]  00000000001d2e40 ffff88006f775fd8 00000340af111ef2 00000000001d2e40
[ 3581.183765] Call Trace:
[ 3581.184008]  [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
[ 3581.184392]  [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
[ 3581.184756]  [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
[ 3581.185120]  [<ffffffff812ed520>] xfs_ioend_wait+0x87/0x9f
[ 3581.185474]  [<ffffffff8108c97a>] ? wake_up_bit+0x2a/0x2a
[ 3581.185827]  [<ffffffff812f742a>] xfs_sync_inode_data+0x92/0x9d
[ 3581.186198]  [<ffffffff812f76e2>] xfs_inode_ag_walk+0x1a5/0x287
[ 3581.186569]  [<ffffffff812f779b>] ? xfs_inode_ag_walk+0x25e/0x287
[ 3581.186946]  [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
[ 3581.187311]  [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
[ 3581.187669]  [<ffffffff81092175>] ? local_clock+0x41/0x5a
[ 3581.188020]  [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
[ 3581.188403]  [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
[ 3581.188773]  [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
[ 3581.189130]  [<ffffffff812e236c>] ? xfs_perag_get+0x80/0xd0
[ 3581.189488]  [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
[ 3581.189858]  [<ffffffff812f7831>] ? xfs_inode_ag_iterator+0x6d/0x8f
[ 3581.190241]  [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
[ 3581.190606]  [<ffffffff812f780b>] xfs_inode_ag_iterator+0x47/0x8f
[ 3581.190982]  [<ffffffff811611f5>] ? __sync_filesystem+0x7a/0x7a
[ 3581.191352]  [<ffffffff812f7877>] xfs_sync_data+0x24/0x43
[ 3581.191703]  [<ffffffff812f7911>] xfs_quiesce_data+0x2c/0x88
[ 3581.192065]  [<ffffffff812f5556>] xfs_fs_sync_fs+0x21/0x48
[ 3581.192419]  [<ffffffff811611e1>] __sync_filesystem+0x66/0x7a
[ 3581.192783]  [<ffffffff8116120b>] sync_one_sb+0x16/0x18
[ 3581.193128]  [<ffffffff8113e3e3>] iterate_supers+0x72/0xce
[ 3581.193482]  [<ffffffff81161140>] sync_filesystems+0x20/0x22
[ 3581.193842]  [<ffffffff8116127e>] sys_sync+0x21/0x33
[ 3581.194177]  [<ffffffff819016c2>] system_call_fastpath+0x16/0x1b

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:02:18.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:11:08.000000000 +0800
@@ -432,6 +432,15 @@ writeback_single_inode(struct inode *ino
 				requeue_io(inode);
 			} else {
 				/*
+				 * Sync livelock prevention. Each inode is
+				 * tagged and synced in one shot, so we can
+				 * unconditionally update its dirty time to
+				 * prevent syncing it again.
+				 */
+				if (wbc->sync_mode == WB_SYNC_ALL ||
+				    wbc->tagged_writepages)
+					inode->dirtied_when = jiffies;
+				/*
 				 * Writeback blocked by something other than
 				 * congestion. Delay the inode for some time to
 				 * avoid spinning on the CPU (100% iowait)



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 03/18] writeback: introduce writeback_control.inodes_cleaned
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
  2011-05-19 21:45 ` [PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage Wu Fengguang
  2011-05-19 21:45 ` [PATCH 02/18] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 04/18] writeback: try more writeback as long as something was written Wu Fengguang
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Wu Fengguang, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-inodes_written.patch --]
[-- Type: text/plain, Size: 2319 bytes --]

The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".

So introduce writeback_control.inodes_cleaned to count the inodes get
cleaned.  A non-zero value means there are some progress on writeback,
in which case more writeback can be tried.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    4 ++++
 include/linux/writeback.h |    1 +
 2 files changed, 5 insertions(+)

about v1: The initial version was to count successful ->write_inode()
calls.  However it leads to busy loops for sync() over NFS, because NFS
ridiculously returns 0 (success) while at the same time redirties the
inode.  The NFS case can be trivially fixed, however there may be more
hidden bugs in other filesystems..

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:11:08.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:11:24.000000000 +0800
@@ -464,6 +464,7 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
+			wbc->inodes_cleaned++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -725,6 +726,7 @@ static long wb_writeback(struct bdi_writ
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
+		wbc.inodes_cleaned = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
@@ -741,6 +743,8 @@ static long wb_writeback(struct bdi_writ
 		 */
 		if (wbc.nr_to_write <= 0)
 			continue;
+		if (wbc.inodes_cleaned)
+			continue;
 		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
--- linux-next.orig/include/linux/writeback.h	2011-05-20 05:08:27.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-20 05:11:24.000000000 +0800
@@ -34,6 +34,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long inodes_cleaned;		/* # of inodes cleaned */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 04/18] writeback: try more writeback as long as something was written
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (2 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 03/18] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 05/18] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-background-retry.patch --]
[-- Type: text/plain, Size: 2708 bytes --]

writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of eligible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

For example, imagine 100 inodes

        i0, i1, i2, ..., i90, i91, i99

At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold. This will be a fairly normal/frequent case I guess.

Now that we do tagged sync and update inode->dirtied_when after the sync,
this change won't livelock sync(1).  I actually tried to write 1 page
per 1ms with this command

	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test

and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:11:24.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:11:28.000000000 +0800
@@ -739,23 +739,23 @@ static long wb_writeback(struct bdi_writ
 		wrote += write_chunk - wbc.nr_to_write;
 
 		/*
-		 * If we consumed everything, see if we have more
+		 * Did we write something? Try for more
+		 *
+		 * Dirty inodes are moved to b_io for writeback in batches.
+		 * The completion of the current batch does not necessarily
+		 * mean the overall work is done. So we keep looping as long
+		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write <= 0)
+		if (wbc.nr_to_write < write_chunk)
 			continue;
 		if (wbc.inodes_cleaned)
 			continue;
 		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * No more inodes for IO, bail
 		 */
 		if (!wbc.more_io)
 			break;
 		/*
-		 * Did we write something? Try for more
-		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 05/18] writeback: the kupdate expire timestamp should be a moving target
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (3 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 04/18] writeback: try more writeback as long as something was written Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 06/18] writeback: sync expired inodes first in background writeback Wu Fengguang
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Mel Gorman, Itaru Kitayama, Wu Fengguang,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-moving-dirty-expired.patch --]
[-- Type: text/plain, Size: 2063 bytes --]

Dynamically compute the dirty expire timestamp at queue_io() time.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the evil pageout(). Nevertheless this patch merely addresses part
  of the problem.

v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.

CC: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:11:28.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:11:29.000000000 +0800
@@ -672,11 +672,6 @@ static long wb_writeback(struct bdi_writ
 	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -723,6 +718,12 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
+		if (work->for_kupdate) {
+			oldest_jif = jiffies -
+				msecs_to_jiffies(dirty_expire_interval * 10);
+			wbc.older_than_this = &oldest_jif;
+		}
+
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 06/18] writeback: sync expired inodes first in background writeback
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (4 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 05/18] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 07/18] writeback: refill b_io iff empty Wu Fengguang
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Rik van Riel, Mel Gorman, Wu Fengguang,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-moving-dirty-expired-background.patch --]
[-- Type: text/plain, Size: 3354 bytes --]

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

At each queue_io() time, first try enqueuing only newly expired inodes.
If there are zero expired inodes to work with, then relax the rule and
enqueue all dirty inodes.

It at least makes sense from the data integrity point of view.

This may also reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by the
page reclaim code.

More background: as Mel put it, "it makes sense to write old pages first
to reduce the chances page reclaim is initiating IO."

Rik also presented the situation with a graph:

LRU head                                 [*] dirty page
[                          *              *      * *  *  * * * * * *]

Ideally, most dirty pages should lie close to the LRU tail instead of
LRU head. That requires the flusher thread to sync old/expired inodes
first (as there are obvious correlations between inode age and page
age), and to give fair opportunities to newly expired inodes rather
than sticking with some large eldest inodes (as larger inodes have
weaker correlations in the inode<=>page ages).

This patch helps the flusher to meet both the above requirements.

Side effects: it might reduce the batch size and hence reduce
inode_wb_list_lock hold time, but in turn make the cluster-by-partition
logic in the same function less effective on reducing disk seeks.

v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.

CC: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel<riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:11:29.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:11:30.000000000 +0800
@@ -718,7 +718,7 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
-		if (work->for_kupdate) {
+		if (work->for_kupdate || work->for_background) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
 			wbc.older_than_this = &oldest_jif;
@@ -729,6 +729,7 @@ static long wb_writeback(struct bdi_writ
 		wbc.pages_skipped = 0;
 		wbc.inodes_cleaned = 0;
 
+retry:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
 			__writeback_inodes_sb(work->sb, wb, &wbc);
@@ -752,6 +753,19 @@ static long wb_writeback(struct bdi_writ
 		if (wbc.inodes_cleaned)
 			continue;
 		/*
+		 * background writeback will start with expired inodes, and
+		 * if none is found, fallback to all inodes. This order helps
+		 * reduce the number of dirty pages reaching the end of LRU
+		 * lists and cause trouble to the page reclaim.
+		 */
+		if (work->for_background &&
+		    wbc.older_than_this &&
+		    list_empty(&wb->b_io) &&
+		    list_empty(&wb->b_more_io)) {
+			wbc.older_than_this = NULL;
+			goto retry;
+		}
+		/*
 		 * No more inodes for IO, bail
 		 */
 		if (!wbc.more_io)



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 07/18] writeback: refill b_io iff empty
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (5 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 06/18] writeback: sync expired inodes first in background writeback Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 08/18] writeback: split inode_wb_list_lock into bdi_writeback.list_lock Wu Fengguang
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Wu Fengguang, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-refill-queue-iff-empty.patch --]
[-- Type: text/plain, Size: 7353 bytes --]

There is no point to carry different refill policies between for_kupdate
and other type of works. Use a consistent "refill b_io iff empty" policy
which can guarantee fairness in an easy to understand way.

A b_io refill will setup a _fixed_ work set with all currently eligible
inodes and start a new round of walk through b_io. The "fixed" work set
means no new inodes will be added to the work set during the walk.
Only when a complete walk over b_io is done, new inodes that are
eligible at the time will be enqueued and the walk be started over.

This procedure provides fairness among the inodes because it guarantees
each inode to be synced once and only once at each round. So all inodes
will be free from starvations.

This change relies on wb_writeback() to keep retrying as long as we made
some progress on cleaning some pages and/or inodes. Without that ability,
the old logic on background works relies on aggressively queuing all
eligible inodes into b_io at every time. But that's not a guarantee.

The below test script completes a slightly faster now:

             2.6.39-rc3	  2.6.39-rc3-dyn-expire+
------------------------------------------------
all elapsed     256.043      252.367
stddev           24.381       12.530

tar elapsed      30.097       28.808
dd  elapsed      13.214       11.782

	#!/bin/zsh

	cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/

	umount /dev/sda7
	mkfs.xfs -f /dev/sda7
	mount /dev/sda7 /fs

	echo 3 > /proc/sys/vm/drop_caches

	tic=$(cat /proc/uptime|cut -d' ' -f2)

	cd /fs
	time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
	time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &

	wait
	sync
	tac=$(cat /proc/uptime|cut -d' ' -f2)
	echo elapsed: $((tac - tic))

It maintains roughly the same small vs. large file writeout shares, and
offers large files better chances to be written in nice 4M chunks.

Analyzes from Dave Chinner in great details:

Let's say we have lots of inodes with 100 dirty pages being created,
and one large writeback going on. We expire 8 new inodes for every
1024 pages we write back.

With the old code, we do:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)

	writeback  8 small inodes 800 pages
		   1 large inode 224 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)
	.....

Your new code:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 8s)
	writeback  8 small inodes 800 pages

	b_io empty: (1800 pages written)
		b_more_io (large inode) -> b_io (1l)
		14 newly expired inodes -> b_io (1l, 14s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 14s)
	writeback  10 small inodes 1000 pages
		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
	writeback  5 small inodes 500 pages
	b_io empty: (2548 pages written)
		b_more_io (large inode) -> b_io (1l, 1s(24))
		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
	......

Rough progression of pages written at b_io refill:

Old code:

	total	large file	% of writeback
	1024	224		21.9% (fixed)

New code:
	total	large file	% of writeback
	1800	1024		~55%
	2550	1024		~40%
	3050	1024		~33%
	3500	1024		~29%
	3950	1024		~26%
	4250	1024		~24%
	4500	1024		~22.7%
	4700	1024		~21.7%
	4800	1024		~21.3%
	4800	1024		~21.3%
	(pretty much steady state from here)

Ok, so the steady state is reached with a similar percentage of
writeback to the large file as the existing code. Ok, that's good,
but providing some evidence that is doesn't change the shared of
writeback to the large should be in the commit message ;)

The other advantage to this is that we always write 1024 page chunks
to the large file, rather than smaller "whatever remains" chunks.

CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

after + dyn-expire + ioless:
                &(&wb->list_lock)->rlock:          2291           2304           0.15         204.09        3125.12          35315         970712           0.10         223.84     1113437.05
                ------------------------
                &(&wb->list_lock)->rlock              9          [<ffffffff8115dc5d>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock           1614          [<ffffffff8115da6a>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock            459          [<ffffffff8115d351>] writeback_sb_inodes+0x108/0x154
                &(&wb->list_lock)->rlock            137          [<ffffffff8115cdcf>] writeback_single_inode+0x1b4/0x296
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110c367>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115dc5d>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock           1160          [<ffffffff8115da6a>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock            435          [<ffffffff8115dcb6>] writeback_inodes_wb+0x33/0x12b

after + dyn-expire:
                &(&wb->list_lock)->rlock:        226820         229719           0.10         194.28      809275.91         327372     1033513685           0.08         476.96  3590929811.61
                ------------------------
                &(&wb->list_lock)->rlock             11          [<ffffffff8115b6d3>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock          30559          [<ffffffff8115bb1f>] wb_writeback+0x2fb/0x3c3
                &(&wb->list_lock)->rlock          37339          [<ffffffff8115b72c>] writeback_inodes_wb+0x33/0x12b
                &(&wb->list_lock)->rlock          54880          [<ffffffff8115a87f>] writeback_single_inode+0x17f/0x227
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110b606>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115b6d3>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock          55347          [<ffffffff8115b72c>] writeback_inodes_wb+0x33/0x12b
                &(&wb->list_lock)->rlock          55338          [<ffffffff8115a87f>] writeback_single_inode+0x17f/0x227

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:11:30.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:11:31.000000000 +0800
@@ -589,7 +589,8 @@ void writeback_inodes_wb(struct bdi_writ
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_wb_list_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
 	while (!list_empty(&wb->b_io)) {
@@ -616,7 +617,7 @@ static void __writeback_inodes_sb(struct
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_wb_list_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_wb_list_lock);



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 08/18] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (6 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 07/18] writeback: refill b_io iff empty Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 09/18] writeback: elevate queue_io() into wb_writeback() Wu Fengguang
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Hugh Dickins, Wu Fengguang,
	Dave Chinner, Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-split-lock.patch --]
[-- Type: text/plain, Size: 21361 bytes --]

From: Christoph Hellwig <hch@infradead.org>

Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads.  It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
                      inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157

2.6.39-rc3 + patch:
                &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                ------------------------
                &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f

hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/block_dev.c              |   16 +++--
 fs/fs-writeback.c           |   97 +++++++++++++++++-----------------
 fs/inode.c                  |    5 -
 include/linux/backing-dev.h |    2 
 include/linux/writeback.h   |    2 
 mm/backing-dev.c            |   21 +++++--
 mm/filemap.c                |    6 +-
 mm/rmap.c                   |    4 -
 8 files changed, 85 insertions(+), 68 deletions(-)

--- linux-next.orig/fs/block_dev.c	2011-05-20 05:08:26.000000000 +0800
+++ linux-next/fs/block_dev.c	2011-05-20 05:11:32.000000000 +0800
@@ -44,24 +44,28 @@ inline struct block_device *I_BDEV(struc
 {
 	return &BDEV_I(inode)->bdev;
 }
-
 EXPORT_SYMBOL(I_BDEV);
 
 /*
- * move the inode from it's current bdi to the a new bdi. if the inode is dirty
- * we need to move it onto the dirty list of @dst so that the inode is always
- * on the right list.
+ * Move the inode from its current bdi to a new bdi. If the inode is dirty we
+ * need to move it onto the dirty list of @dst so that the inode is always on
+ * the right list.
  */
 static void bdev_inode_switch_bdi(struct inode *inode,
 			struct backing_dev_info *dst)
 {
-	spin_lock(&inode_wb_list_lock);
+	struct backing_dev_info *old = inode->i_data.backing_dev_info;
+
+	if (unlikely(dst == old))		/* deadlock avoidance */
+		return;
+	bdi_lock_two(&old->wb, &dst->wb);
 	spin_lock(&inode->i_lock);
 	inode->i_data.backing_dev_info = dst;
 	if (inode->i_state & I_DIRTY)
 		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&old->wb.list_lock);
+	spin_unlock(&dst->wb.list_lock);
 }
 
 static sector_t max_block(struct block_device *bdev)
--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:11:31.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:11:32.000000000 +0800
@@ -181,12 +181,13 @@ void bdi_start_background_writeback(stru
  */
 void inode_wb_list_del(struct inode *inode)
 {
-	spin_lock(&inode_wb_list_lock);
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.list_lock);
 	list_del_init(&inode->i_wb_list);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&bdi->wb.list_lock);
 }
 
-
 /*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
@@ -196,11 +197,9 @@ void inode_wb_list_del(struct inode *ino
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
-static void redirty_tail(struct inode *inode)
+static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -214,11 +213,9 @@ static void redirty_tail(struct inode *i
 /*
  * requeue inode for re-scanning after bdi->b_io list is exhausted.
  */
-static void requeue_io(struct inode *inode)
+static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
@@ -226,7 +223,7 @@ static void inode_sync_complete(struct i
 {
 	/*
 	 * Prevent speculative execution through
-	 * spin_unlock(&inode_wb_list_lock);
+	 * spin_unlock(&wb->list_lock);
 	 */
 
 	smp_mb();
@@ -302,7 +299,7 @@ static void move_expired_inodes(struct l
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
@@ -317,7 +314,8 @@ static int write_inode(struct inode *ino
 /*
  * Wait for writeback on an inode to complete.
  */
-static void inode_wait_for_writeback(struct inode *inode)
+static void inode_wait_for_writeback(struct inode *inode,
+				     struct bdi_writeback *wb)
 {
 	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
 	wait_queue_head_t *wqh;
@@ -325,15 +323,15 @@ static void inode_wait_for_writeback(str
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		spin_lock(&inode->i_lock);
 	}
 }
 
 /*
- * Write out an inode's dirty pages.  Called under inode_wb_list_lock and
+ * Write out an inode's dirty pages.  Called under wb->list_lock and
  * inode->i_lock.  Either the caller has an active reference on the inode or
  * the inode has I_WILL_FREE set.
  *
@@ -344,13 +342,14 @@ static void inode_wait_for_writeback(str
  * livelocks, etc.
  */
 static int
-writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
+writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
+		       struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
 	unsigned dirty;
 	int ret;
 
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
 
 	if (!atomic_read(&inode->i_count))
@@ -368,14 +367,14 @@ writeback_single_inode(struct inode *ino
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			return 0;
 		}
 
 		/*
 		 * It's a data-integrity sync.  We must wait.
 		 */
-		inode_wait_for_writeback(inode);
+		inode_wait_for_writeback(inode, wb);
 	}
 
 	BUG_ON(inode->i_state & I_SYNC);
@@ -384,7 +383,7 @@ writeback_single_inode(struct inode *ino
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 
 	ret = do_writepages(mapping, wbc);
 
@@ -415,7 +414,7 @@ writeback_single_inode(struct inode *ino
 			ret = err;
 	}
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
@@ -429,7 +428,7 @@ writeback_single_inode(struct inode *ino
 				/*
 				 * slice used up: queue for next turn
 				 */
-				requeue_io(inode);
+				requeue_io(inode, wb);
 			} else {
 				/*
 				 * Sync livelock prevention. Each inode is
@@ -447,7 +446,7 @@ writeback_single_inode(struct inode *ino
 				 * retrying writeback of the dirty page/inode
 				 * that cannot be performed immediately.
 				 */
-				redirty_tail(inode);
+				redirty_tail(inode, wb);
 			}
 		} else if (inode->i_state & I_DIRTY) {
 			/*
@@ -456,7 +455,7 @@ writeback_single_inode(struct inode *ino
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
-			redirty_tail(inode);
+			redirty_tail(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -521,7 +520,7 @@ static int writeback_sb_inodes(struct su
 				 * superblock, move all inodes not belonging
 				 * to it back onto the dirty list.
 				 */
-				redirty_tail(inode);
+				redirty_tail(inode, wb);
 				continue;
 			}
 
@@ -541,7 +540,7 @@ static int writeback_sb_inodes(struct su
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			continue;
 		}
 
@@ -557,19 +556,19 @@ static int writeback_sb_inodes(struct su
 		__iget(inode);
 
 		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wbc);
+		writeback_single_inode(inode, wb, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
-			redirty_tail(inode);
+			redirty_tail(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 		iput(inode);
 		cond_resched();
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
 			return 1;
@@ -588,7 +587,7 @@ void writeback_inodes_wb(struct bdi_writ
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
@@ -598,7 +597,7 @@ void writeback_inodes_wb(struct bdi_writ
 		struct super_block *sb = inode->i_sb;
 
 		if (!pin_sb_for_writeback(sb)) {
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			continue;
 		}
 		ret = writeback_sb_inodes(sb, wb, wbc, false);
@@ -607,7 +606,7 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
@@ -616,11 +615,11 @@ static void __writeback_inodes_sb(struct
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 }
 
 /*
@@ -776,15 +775,15 @@ retry:
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = wb_inode(wb->b_more_io.prev);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode);
+			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 	}
 
 	return wrote;
@@ -1121,10 +1120,10 @@ void __mark_inode_dirty(struct inode *in
 			}
 
 			spin_unlock(&inode->i_lock);
-			spin_lock(&inode_wb_list_lock);
+			spin_lock(&bdi->wb.list_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
-			spin_unlock(&inode_wb_list_lock);
+			spin_unlock(&bdi->wb.list_lock);
 
 			if (wakeup_bdi)
 				bdi_wakeup_thread_delayed(bdi);
@@ -1326,6 +1325,7 @@ EXPORT_SYMBOL(sync_inodes_sb);
  */
 int write_inode_now(struct inode *inode, int sync)
 {
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	int ret;
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
@@ -1338,11 +1338,11 @@ int write_inode_now(struct inode *inode,
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, &wbc);
+	ret = writeback_single_inode(inode, wb, &wbc);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
@@ -1362,13 +1362,14 @@ EXPORT_SYMBOL(write_inode_now);
  */
 int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	int ret;
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, wbc);
+	ret = writeback_single_inode(inode, wb, wbc);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode);
--- linux-next.orig/fs/inode.c	2011-05-20 05:08:26.000000000 +0800
+++ linux-next/fs/inode.c	2011-05-20 05:11:32.000000000 +0800
@@ -37,7 +37,7 @@
  *   inode_lru, inode->i_lru
  * inode_sb_list_lock protects:
  *   sb->s_inodes, inode->i_sb_list
- * inode_wb_list_lock protects:
+ * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
  * inode_hash_lock protects:
  *   inode_hashtable, inode->i_hash
@@ -48,7 +48,7 @@
  *   inode->i_lock
  *     inode_lru_lock
  *
- * inode_wb_list_lock
+ * bdi->wb.list_lock
  *   inode->i_lock
  *
  * inode_hash_lock
@@ -111,7 +111,6 @@ static LIST_HEAD(inode_lru);
 static DEFINE_SPINLOCK(inode_lru_lock);
 
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_wb_list_lock);
 
 /*
  * iprune_sem provides exclusion between the icache shrinking and the
--- linux-next.orig/include/linux/backing-dev.h	2011-05-20 05:08:26.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-05-20 05:11:32.000000000 +0800
@@ -57,6 +57,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
 struct backing_dev_info {
@@ -106,6 +107,7 @@ int bdi_writeback_thread(void *data);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
--- linux-next.orig/include/linux/writeback.h	2011-05-20 05:11:24.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-20 05:11:32.000000000 +0800
@@ -9,8 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t inode_wb_list_lock;
-
 /*
  * fs/fs-writeback.c
  */
--- linux-next.orig/mm/backing-dev.c	2011-05-20 05:08:26.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-05-20 05:11:32.000000000 +0800
@@ -45,6 +45,17 @@ static struct timer_list sync_supers_tim
 static int bdi_sync_supers(void *);
 static void sync_supers_timer_fn(unsigned long);
 
+void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
+{
+	if (wb1 < wb2) {
+		spin_lock(&wb1->list_lock);
+		spin_lock_nested(&wb2->list_lock, 1);
+	} else {
+		spin_lock(&wb2->list_lock);
+		spin_lock_nested(&wb1->list_lock, 1);
+	}
+}
+
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
@@ -67,14 +78,14 @@ static int bdi_debug_stats_show(struct s
 	struct inode *inode;
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
 	list_for_each_entry(inode, &wb->b_io, i_wb_list)
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -628,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	spin_lock_init(&wb->list_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
 
@@ -676,11 +688,12 @@ void bdi_destroy(struct backing_dev_info
 	if (bdi_has_dirty_io(bdi)) {
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
-		spin_lock(&inode_wb_list_lock);
+		bdi_lock_two(&bdi->wb, dst);
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&bdi->wb.list_lock);
+		spin_unlock(&dst->list_lock);
 	}
 
 	bdi_unregister(bdi);
--- linux-next.orig/mm/filemap.c	2011-05-20 05:08:26.000000000 +0800
+++ linux-next/mm/filemap.c	2011-05-20 05:11:32.000000000 +0800
@@ -81,7 +81,7 @@
  *  ->i_mutex
  *    ->i_alloc_sem             (various)
  *
- *  inode_wb_list_lock
+ *  bdi->wb.list_lock
  *    sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
@@ -99,9 +99,9 @@
  *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
- *    inode_wb_list_lock	(page_remove_rmap->set_page_dirty)
+ *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
  *    ->inode->i_lock		(page_remove_rmap->set_page_dirty)
- *    inode_wb_list_lock	(zap_pte_range->set_page_dirty)
+ *    bdi.wb->list_lock		(zap_pte_range->set_page_dirty)
  *    ->inode->i_lock		(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
  *
--- linux-next.orig/mm/rmap.c	2011-05-20 05:08:26.000000000 +0800
+++ linux-next/mm/rmap.c	2011-05-20 05:11:32.000000000 +0800
@@ -32,11 +32,11 @@
  *               mmlist_lock (in mmput, drain_mmlist and others)
  *               mapping->private_lock (in __set_page_dirty_buffers)
  *               inode->i_lock (in set_page_dirty's __mark_inode_dirty)
- *               inode_wb_list_lock (in set_page_dirty's __mark_inode_dirty)
+ *               bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
  *                 sb_lock (within inode_lock in fs/fs-writeback.c)
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
- *                           within inode_wb_list_lock in __sync_single_inode)
+ *                           within bdi.wb->list_lock in __sync_single_inode)
  *
  * (code doesn't rely on that order so it could be switched around)
  * ->tasklist_lock



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 09/18] writeback: elevate queue_io() into wb_writeback()
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (7 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 08/18] writeback: split inode_wb_list_lock into bdi_writeback.list_lock Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 10/18] writeback: avoid extra sync work at enqueue time Wu Fengguang
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: move-queue_io.patch --]
[-- Type: text/plain, Size: 3402 bytes --]

Code refactor for more logical code layout.
No behavior change.

- remove the mis-named __writeback_inodes_sb()

- wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
  before calling __writeback_inodes_wb()

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:11:32.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:12:41.000000000 +0800
@@ -580,17 +580,13 @@ static int writeback_sb_inodes(struct su
 	return 1;
 }
 
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc)
+static void __writeback_inodes_wb(struct bdi_writeback *wb,
+				  struct writeback_control *wbc)
 {
 	int ret = 0;
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&wb->list_lock);
-
-	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -606,19 +602,16 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&wb->list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
-static void __writeback_inodes_sb(struct super_block *sb,
-		struct bdi_writeback *wb, struct writeback_control *wbc)
+void writeback_inodes_wb(struct bdi_writeback *wb,
+		struct writeback_control *wbc)
 {
-	WARN_ON(!rwsem_is_locked(&sb->s_umount));
-
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
-	writeback_sb_inodes(sb, wb, wbc, true);
+	__writeback_inodes_wb(wb, wbc);
 	spin_unlock(&wb->list_lock);
 }
 
@@ -685,7 +678,7 @@ static long wb_writeback(struct bdi_writ
 	 * The intended call sequence for WB_SYNC_ALL writeback is:
 	 *
 	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
+	 *          writeback_sb_inodes()       <== called only once
 	 *              write_cache_pages()     <== called once for each inode
 	 *                   (quickly) tag currently dirty pages
 	 *                   (maybe slowly) sync all tagged pages
@@ -694,6 +687,7 @@ static long wb_writeback(struct bdi_writ
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
+	spin_lock(&wb->list_lock);
 	for (;;) {
 		/*
 		 * Stop writeback when nr_pages has been consumed
@@ -731,10 +725,12 @@ static long wb_writeback(struct bdi_writ
 
 retry:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
+		if (list_empty(&wb->b_io))
+			queue_io(wb, wbc.older_than_this);
 		if (work->sb)
-			__writeback_inodes_sb(work->sb, wb, &wbc);
+			writeback_sb_inodes(work->sb, wb, &wbc, true);
 		else
-			writeback_inodes_wb(wb, &wbc);
+			__writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
@@ -775,7 +771,6 @@ retry:
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
-		spin_lock(&wb->list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = wb_inode(wb->b_more_io.prev);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
@@ -783,8 +778,8 @@ retry:
 			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&wb->list_lock);
 	}
+	spin_unlock(&wb->list_lock);
 
 	return wrote;
 }



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 10/18] writeback: avoid extra sync work at enqueue time
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (8 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 09/18] writeback: elevate queue_io() into wb_writeback() Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 11/18] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-kill-wb_start.patch --]
[-- Type: text/plain, Size: 2179 bytes --]

This removes writeback_control.wb_start and does more straightforward
sync livelock prevention by setting .older_than_this to prevent extra
inodes from being enqueued in the first place.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   16 +++-------------
 include/linux/writeback.h |    3 ---
 2 files changed, 3 insertions(+), 16 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:12:41.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:12:42.000000000 +0800
@@ -544,15 +544,6 @@ static int writeback_sb_inodes(struct su
 			continue;
 		}
 
-		/*
-		 * Was this inode dirtied after sync_sb_inodes was called?
-		 * This keeps sync from extra jobs and livelock.
-		 */
-		if (inode_dirtied_after(inode, wbc->wb_start)) {
-			spin_unlock(&inode->i_lock);
-			return 1;
-		}
-
 		__iget(inode);
 
 		pages_skipped = wbc->pages_skipped;
@@ -585,9 +576,6 @@ static void __writeback_inodes_wb(struct
 {
 	int ret = 0;
 
-	if (!wbc->wb_start)
-		wbc->wb_start = jiffies; /* livelock avoidance */
-
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
 		struct super_block *sb = inode->i_sb;
@@ -686,7 +674,9 @@ static long wb_writeback(struct bdi_writ
 	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_writepages)
 		write_chunk = LONG_MAX;
 
-	wbc.wb_start = jiffies; /* livelock avoidance */
+	oldest_jif = jiffies;
+	wbc.older_than_this = &oldest_jif;
+
 	spin_lock(&wb->list_lock);
 	for (;;) {
 		/*
--- linux-next.orig/include/linux/writeback.h	2011-05-20 05:11:32.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-20 05:12:42.000000000 +0800
@@ -26,9 +26,6 @@ struct writeback_control {
 	enum writeback_sync_modes sync_mode;
 	unsigned long *older_than_this;	/* If !NULL, only write back inodes
 					   older than this */
-	unsigned long wb_start;         /* Time writeback_inodes_wb was
-					   called. This is needed to avoid
-					   extra jobs and livelock */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 11/18] writeback: add bdi_dirty_limit() kernel-doc
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (9 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 10/18] writeback: avoid extra sync work at enqueue time Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 12/18] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-task_dirty_limit-comment.patch --]
[-- Type: text/plain, Size: 1193 bytes --]

Clarify the bdi_dirty_limit() comment.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-05-20 05:08:25.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-20 05:12:43.000000000 +0800
@@ -437,10 +437,17 @@ void global_dirty_limits(unsigned long *
 	*pdirty = dirty;
 }
 
-/*
+/**
  * bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ * @bdi: the backing_dev_info to query
+ * @dirty: global dirty limit in pages
+ *
+ * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
+ * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ * And the "limit" in the name is not seriously taken as hard limit in
+ * balance_dirty_pages().
  *
- * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * It allocates high/low dirty limits to fast/slow devices, in order to prevent
  * - starving fast devices
  * - piling up dirty pages (that will take long time to sync) on slow devices
  *



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 12/18] writeback: skip balance_dirty_pages() for in-memory fs
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (10 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 11/18] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 13/18] writeback: remove writeback_control.more_io Wu Fengguang
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Hugh Dickins, Rik van Riel, Peter Zijlstra,
	Wu Fengguang, Dave Chinner, Christoph Hellwig, linux-fsdevel,
	LKML

[-- Attachment #1: writeback-trace-global-dirty-states-fix.patch --]
[-- Type: text/plain, Size: 2707 bytes --]

This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.

Notes about the tmpfs/ramfs behavior changes:

As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
balance_dirty_pages() as long as we are over the (dirty+background)/2
global throttle threshold.  This is because both the dirty pages and
threshold will be 0 for tmpfs/ramfs. Hence this test will always
evaluate to TRUE:

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
                        || (nr_reclaimable + nr_writeback >= dirty_thresh);

For 2.6.37, someone complained that the current logic does not allow the
users to set vm.dirty_ratio=0.  So commit 4cbec4c8b9 changed the test to

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
                        || (nr_reclaimable + nr_writeback > dirty_thresh);

So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
throttled unless the global dirty threshold is exceeded (which is very
unlikely to happen; once happen, will block many tasks).

I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
for a busy writing server, tmpfs write()s may get livelocked! The
"inadvertent" throttling can hardly bring help to any workload because
of its "either no throttling, or get throttled to death" property.

So based on 2.6.37, this patch won't bring more noticeable changes.

CC: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-05-20 05:12:43.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-20 05:12:44.000000000 +0800
@@ -244,13 +244,8 @@ void task_dirty_inc(struct task_struct *
 static void bdi_writeout_fraction(struct backing_dev_info *bdi,
 		long *numerator, long *denominator)
 {
-	if (bdi_cap_writeback_dirty(bdi)) {
-		prop_fraction_percpu(&vm_completions, &bdi->completions,
+	prop_fraction_percpu(&vm_completions, &bdi->completions,
 				numerator, denominator);
-	} else {
-		*numerator = 0;
-		*denominator = 1;
-	}
 }
 
 static inline void task_dirties_fraction(struct task_struct *tsk,
@@ -495,6 +490,9 @@ static void balance_dirty_pages(struct a
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+	if (!bdi_cap_account_dirty(bdi))
+		return;
+
 	for (;;) {
 		struct writeback_control wbc = {
 			.sync_mode	= WB_SYNC_NONE,



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 13/18] writeback: remove writeback_control.more_io
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (11 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 12/18] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 14/18] writeback: make writeback_control.nr_to_write straight Wu Fengguang
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Wu Fengguang, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-kill-more_io.patch --]
[-- Type: text/plain, Size: 4431 bytes --]

When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |    9 ++-------
 include/linux/writeback.h        |    1 -
 include/trace/events/ext4.h      |    6 ++----
 include/trace/events/writeback.h |    5 +----
 4 files changed, 5 insertions(+), 16 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:12:42.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:12:45.000000000 +0800
@@ -560,12 +560,8 @@ static int writeback_sb_inodes(struct su
 		iput(inode);
 		cond_resched();
 		spin_lock(&wb->list_lock);
-		if (wbc->nr_to_write <= 0) {
-			wbc->more_io = 1;
+		if (wbc->nr_to_write <= 0)
 			return 1;
-		}
-		if (!list_empty(&wb->b_more_io))
-			wbc->more_io = 1;
 	}
 	/* b_io is empty */
 	return 1;
@@ -708,7 +704,6 @@ static long wb_writeback(struct bdi_writ
 			wbc.older_than_this = &oldest_jif;
 		}
 
-		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 		wbc.inodes_cleaned = 0;
@@ -754,7 +749,7 @@ retry:
 		/*
 		 * No more inodes for IO, bail
 		 */
-		if (!wbc.more_io)
+		if (list_empty(&wb->b_more_io))
 			break;
 		/*
 		 * Nothing written. Wait for some inode to
--- linux-next.orig/include/linux/writeback.h	2011-05-20 05:12:42.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-20 05:12:45.000000000 +0800
@@ -46,7 +46,6 @@ struct writeback_control {
 	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
-	unsigned more_io:1;		/* more io to be dispatched */
 };
 
 /*
--- linux-next.orig/include/trace/events/writeback.h	2011-05-20 05:08:25.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-20 05:12:45.000000000 +0800
@@ -101,7 +101,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(int, more_io)
 		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
@@ -116,7 +115,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->more_io	= wbc->more_io;
 		__entry->older_than_this = wbc->older_than_this ?
 						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
@@ -124,7 +122,7 @@ DECLARE_EVENT_CLASS(wbc_class,
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d older=0x%lx "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -134,7 +132,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->more_io,
 		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
--- linux-next.orig/include/trace/events/ext4.h	2011-05-20 05:08:25.000000000 +0800
+++ linux-next/include/trace/events/ext4.h	2011-05-20 05:12:45.000000000 +0800
@@ -404,7 +404,6 @@ TRACE_EVENT(ext4_da_writepages_result,
 		__field(	int,	pages_written		)
 		__field(	long,	pages_skipped		)
 		__field(	int,	sync_mode		)
-		__field(	char,	more_io			)	
 		__field(       pgoff_t,	writeback_index		)
 	),
 
@@ -415,16 +414,15 @@ TRACE_EVENT(ext4_da_writepages_result,
 		__entry->pages_written	= pages_written;
 		__entry->pages_skipped	= wbc->pages_skipped;
 		__entry->sync_mode	= wbc->sync_mode;
-		__entry->more_io	= wbc->more_io;
 		__entry->writeback_index = inode->i_mapping->writeback_index;
 	),
 
 	TP_printk("dev %d,%d ino %lu ret %d pages_written %d pages_skipped %ld "
-		  " more_io %d sync_mode %d writeback_index %lu",
+		  "sync_mode %d writeback_index %lu",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino, __entry->ret,
 		  __entry->pages_written, __entry->pages_skipped,
-		  __entry->more_io, __entry->sync_mode,
+		  __entry->sync_mode,
 		  (unsigned long) __entry->writeback_index)
 );
 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 14/18] writeback: make writeback_control.nr_to_write straight
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (12 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 13/18] writeback: remove writeback_control.more_io Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 22:06   ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 15/18] writeback: remove .nonblocking and .encountered_congestion Wu Fengguang
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-pass-work-to-writeback_sb_inodes.patch --]
[-- Type: text/plain, Size: 18372 bytes --]

Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.

Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/extent_io.c             |    2 
 fs/fs-writeback.c                |  200 ++++++++++++++++-------------
 include/linux/writeback.h        |    6 
 include/trace/events/writeback.h |   39 ++++-
 mm/backing-dev.c                 |    9 -
 mm/page-writeback.c              |   17 --
 6 files changed, 147 insertions(+), 126 deletions(-)

change set:
- move writeback_control from wb_writeback() down to writeback_sb_inodes()
- change return value of writeback_sb_inodes()/__writeback_inodes_wb()
  to the number of pages and/or inodes written
- move writeback_control.older_than_this to struct wb_writeback_work
- remove writeback_control.inodes_cleaned
- remove wbc_writeback_* trace events for now

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:12:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:12:46.000000000 +0800
@@ -30,11 +30,21 @@
 #include "internal.h"
 
 /*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES     1024L
+
+/*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
 struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
+	unsigned long *older_than_this;
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
@@ -463,7 +473,6 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
-			wbc->inodes_cleaned++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -496,6 +505,31 @@ static bool pin_sb_for_writeback(struct 
 	return false;
 }
 
+static long writeback_chunk_size(struct wb_writeback_work *work)
+{
+	long pages;
+
+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 * The intended call sequence for WB_SYNC_ALL writeback is:
+	 *
+	 *      wb_writeback()
+	 *          writeback_sb_inodes()       <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                   (quickly) tag currently dirty pages
+	 *                   (maybe slowly) sync all tagged pages
+	 */
+	if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
+		pages = LONG_MAX;
+	else
+		pages = min(MAX_WRITEBACK_PAGES, work->nr_pages);
+
+	return pages;
+}
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -503,18 +537,30 @@ static bool pin_sb_for_writeback(struct 
  * inodes. Otherwise write only ones which go sequentially
  * in reverse order.
  *
- * Return 1, if the caller writeback routine should be
- * interrupted. Otherwise return 0.
+ * Return the number of pages and/or inodes cleaned.
  */
-static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
-		struct writeback_control *wbc, bool only_this_sb)
+static long writeback_sb_inodes(struct super_block *sb,
+				struct bdi_writeback *wb,
+				struct wb_writeback_work *work)
 {
+	struct writeback_control wbc = {
+		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
+		.for_kupdate		= work->for_kupdate,
+		.for_background		= work->for_background,
+		.range_cyclic		= work->range_cyclic,
+		.range_start		= 0,
+		.range_end		= LLONG_MAX,
+	};
+	unsigned long start_time = jiffies;
+	long write_chunk;
+	long wrote = 0;  /* count both pages and inodes */
+
 	while (!list_empty(&wb->b_io)) {
-		long pages_skipped;
 		struct inode *inode = wb_inode(wb->b_io.prev);
 
 		if (inode->i_sb != sb) {
-			if (only_this_sb) {
+			if (work->sb) {
 				/*
 				 * We only want to write back data for this
 				 * superblock, move all inodes not belonging
@@ -529,7 +575,7 @@ static int writeback_sb_inodes(struct su
 			 * Bounce back to the caller to unpin this and
 			 * pin the next superblock.
 			 */
-			return 0;
+			break;
 		}
 
 		/*
@@ -543,34 +589,47 @@ static int writeback_sb_inodes(struct su
 			requeue_io(inode, wb);
 			continue;
 		}
-
 		__iget(inode);
+		write_chunk = writeback_chunk_size(work);
+		wbc.nr_to_write = write_chunk;
+		wbc.pages_skipped = 0;
+
+		writeback_single_inode(inode, wb, &wbc);
 
-		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wb, wbc);
-		if (wbc->pages_skipped != pages_skipped) {
+		work->nr_pages -= write_chunk - wbc.nr_to_write;
+		wrote += write_chunk - wbc.nr_to_write;
+		if (wbc.pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
 			redirty_tail(inode, wb);
-		}
+		} else if (!(inode->i_state & I_DIRTY))
+			wrote++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
 		iput(inode);
 		cond_resched();
 		spin_lock(&wb->list_lock);
-		if (wbc->nr_to_write <= 0)
-			return 1;
+		/*
+		 * bail out to wb_writeback() often enough to check
+		 * background threshold and other termination conditions.
+		 */
+		if (wrote) {
+			if (jiffies - start_time > HZ / 10UL)
+				break;
+			if (work->nr_pages <= 0)
+				break;
+		}
 	}
-	/* b_io is empty */
-	return 1;
+	return wrote;
 }
 
-static void __writeback_inodes_wb(struct bdi_writeback *wb,
-				  struct writeback_control *wbc)
+static long __writeback_inodes_wb(struct bdi_writeback *wb,
+				  struct wb_writeback_work *work)
 {
-	int ret = 0;
+	unsigned long start_time = jiffies;
+	long wrote = 0;
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -580,33 +639,36 @@ static void __writeback_inodes_wb(struct
 			requeue_io(inode, wb);
 			continue;
 		}
-		ret = writeback_sb_inodes(sb, wb, wbc, false);
+		wrote += writeback_sb_inodes(sb, wb, work);
 		drop_super(sb);
 
-		if (ret)
-			break;
+		if (wrote) {
+			if (jiffies - start_time > HZ / 10UL)
+				break;
+			if (work->nr_pages <= 0)
+				break;
+		}
 	}
 	/* Leave any unwritten inodes on b_io */
+	return wrote;
 }
 
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc)
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages)
 {
+	struct wb_writeback_work work = {
+		.nr_pages	= nr_pages,
+		.sync_mode	= WB_SYNC_NONE,
+		.range_cyclic	= 1,
+	};
+
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
-	__writeback_inodes_wb(wb, wbc);
+		queue_io(wb, NULL);
+	__writeback_inodes_wb(wb, &work);
 	spin_unlock(&wb->list_lock);
-}
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
+	return nr_pages - work.nr_pages;
+}
 
 static inline bool over_bground_thresh(void)
 {
@@ -636,42 +698,13 @@ static inline bool over_bground_thresh(v
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
-	struct writeback_control wbc = {
-		.sync_mode		= work->sync_mode,
-		.tagged_writepages	= work->tagged_writepages,
-		.older_than_this	= NULL,
-		.for_kupdate		= work->for_kupdate,
-		.for_background		= work->for_background,
-		.range_cyclic		= work->range_cyclic,
-	};
+	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	long wrote = 0;
-	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
-
-	if (!wbc.range_cyclic) {
-		wbc.range_start = 0;
-		wbc.range_end = LLONG_MAX;
-	}
-
-	/*
-	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
-	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
-	 * here avoids calling into writeback_inodes_wb() more than once.
-	 *
-	 * The intended call sequence for WB_SYNC_ALL writeback is:
-	 *
-	 *      wb_writeback()
-	 *          writeback_sb_inodes()       <== called only once
-	 *              write_cache_pages()     <== called once for each inode
-	 *                   (quickly) tag currently dirty pages
-	 *                   (maybe slowly) sync all tagged pages
-	 */
-	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_writepages)
-		write_chunk = LONG_MAX;
+	long progress;
 
 	oldest_jif = jiffies;
-	wbc.older_than_this = &oldest_jif;
+	work->older_than_this = &oldest_jif;
 
 	spin_lock(&wb->list_lock);
 	for (;;) {
@@ -701,25 +734,18 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_kupdate || work->for_background) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
-			wbc.older_than_this = &oldest_jif;
+			work->older_than_this = &oldest_jif;
 		}
 
-		wbc.nr_to_write = write_chunk;
-		wbc.pages_skipped = 0;
-		wbc.inodes_cleaned = 0;
-
 retry:
-		trace_wbc_writeback_start(&wbc, wb->bdi);
+		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
-			queue_io(wb, wbc.older_than_this);
+			queue_io(wb, work->older_than_this);
 		if (work->sb)
-			writeback_sb_inodes(work->sb, wb, &wbc, true);
+			progress = writeback_sb_inodes(work->sb, wb, work);
 		else
-			__writeback_inodes_wb(wb, &wbc);
-		trace_wbc_writeback_written(&wbc, wb->bdi);
-
-		work->nr_pages -= write_chunk - wbc.nr_to_write;
-		wrote += write_chunk - wbc.nr_to_write;
+			progress = __writeback_inodes_wb(wb, work);
+		trace_writeback_written(wb->bdi, work);
 
 		/*
 		 * Did we write something? Try for more
@@ -729,9 +755,7 @@ retry:
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		if (wbc.inodes_cleaned)
+		if (progress)
 			continue;
 		/*
 		 * background writeback will start with expired inodes, and
@@ -740,10 +764,10 @@ retry:
 		 * lists and cause trouble to the page reclaim.
 		 */
 		if (work->for_background &&
-		    wbc.older_than_this &&
+		    work->older_than_this &&
 		    list_empty(&wb->b_io) &&
 		    list_empty(&wb->b_more_io)) {
-			wbc.older_than_this = NULL;
+			work->older_than_this = NULL;
 			goto retry;
 		}
 		/*
@@ -757,8 +781,8 @@ retry:
 		 * we'll just busyloop.
 		 */
 		if (!list_empty(&wb->b_more_io))  {
+			trace_writeback_wait(wb->bdi, work);
 			inode = wb_inode(wb->b_more_io.prev);
-			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			spin_lock(&inode->i_lock);
 			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
@@ -766,7 +790,7 @@ retry:
 	}
 	spin_unlock(&wb->list_lock);
 
-	return wrote;
+	return nr_pages - work->nr_pages;
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-05-20 05:12:45.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-20 05:12:46.000000000 +0800
@@ -24,12 +24,9 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
-	long inodes_cleaned;		/* # of inodes cleaned */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is
@@ -58,8 +55,7 @@ void writeback_inodes_sb_nr(struct super
 int writeback_inodes_sb_if_idle(struct super_block *);
 int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr);
 void sync_inodes_sb(struct super_block *);
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc);
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages);
 
--- linux-next.orig/mm/backing-dev.c	2011-05-20 05:11:32.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-05-20 05:12:46.000000000 +0800
@@ -262,14 +262,7 @@ int bdi_has_dirty_io(struct backing_dev_
 
 static void bdi_flush_io(struct backing_dev_info *bdi)
 {
-	struct writeback_control wbc = {
-		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
-		.range_cyclic		= 1,
-		.nr_to_write		= 1024,
-	};
-
-	writeback_inodes_wb(&bdi->wb, &wbc);
+	writeback_inodes_wb(&bdi->wb, 1024);
 }
 
 /*
--- linux-next.orig/mm/page-writeback.c	2011-05-20 05:12:44.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-20 05:12:46.000000000 +0800
@@ -494,13 +494,6 @@ static void balance_dirty_pages(struct a
 		return;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -562,17 +555,17 @@ static void balance_dirty_pages(struct a
 		 * threshold otherwise wait until the disk writes catch
 		 * up.
 		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
+		trace_balance_dirty_start(bdi);
 		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
+			pages_written += writeback_inodes_wb(&bdi->wb,
+							     write_chunk);
+			trace_balance_dirty_written(bdi, pages_written);
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
 		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		trace_balance_dirty_wait(bdi);
 
 		/*
 		 * Increase the delay for each loop, up to our previous
--- linux-next.orig/include/trace/events/writeback.h	2011-05-20 05:12:45.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-20 05:12:46.000000000 +0800
@@ -49,6 +49,9 @@ DEFINE_EVENT(writeback_work_class, name,
 DEFINE_WRITEBACK_WORK_EVENT(writeback_nothread);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_queue);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_exec);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_written);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_wait);
 
 TRACE_EVENT(writeback_pages_written,
 	TP_PROTO(long pages_written),
@@ -88,6 +91,30 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
+DEFINE_WRITEBACK_EVENT(balance_dirty_start);
+DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
+
+TRACE_EVENT(balance_dirty_written,
+
+	TP_PROTO(struct backing_dev_info *bdi, int written),
+
+	TP_ARGS(bdi, written),
+
+	TP_STRUCT__entry(
+		__array(char,	name, 32)
+		__field(int,	written)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->written = written;
+	),
+
+	TP_printk("bdi %s written %d",
+		  __entry->name,
+		  __entry->written
+	)
+);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
@@ -101,7 +128,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -115,14 +141,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -132,7 +156,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
@@ -141,12 +164,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 DEFINE_EVENT(wbc_class, name, \
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), \
 	TP_ARGS(wbc, bdi))
-DEFINE_WBC_EVENT(wbc_writeback_start);
-DEFINE_WBC_EVENT(wbc_writeback_written);
-DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
--- linux-next.orig/fs/btrfs/extent_io.c	2011-05-20 05:08:25.000000000 +0800
+++ linux-next/fs/btrfs/extent_io.c	2011-05-20 05:12:46.000000000 +0800
@@ -2596,7 +2596,6 @@ int extent_write_full_page(struct extent
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= wbc->sync_mode,
-		.older_than_this = NULL,
 		.nr_to_write	= 64,
 		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
 		.range_end	= (loff_t)-1,
@@ -2629,7 +2628,6 @@ int extent_write_locked_range(struct ext
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= mode,
-		.older_than_this = NULL,
 		.nr_to_write	= nr_pages * 2,
 		.range_start	= start,
 		.range_end	= end + 1,



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 15/18] writeback: remove .nonblocking and .encountered_congestion
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (13 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 14/18] writeback: make writeback_control.nr_to_write straight Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 16/18] writeback: trace event writeback_single_inode Wu Fengguang
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-kill-nonblocking.patch --]
[-- Type: text/plain, Size: 3584 bytes --]

Remove two unused struct writeback_control fields:

	.encountered_congestion	(completely unused)
	.nonblocking		(never set, checked/showed in XFS,NFS/btrfs)

The .for_background check in nfs_write_inode() is also removed btw,
as .for_background implies WB_SYNC_NONE.

Reviewed-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c               |    3 +--
 fs/xfs/linux-2.6/xfs_aops.c  |    2 +-
 include/linux/writeback.h    |    2 --
 include/trace/events/btrfs.h |    6 ++----
 4 files changed, 4 insertions(+), 9 deletions(-)

--- linux-next.orig/fs/xfs/linux-2.6/xfs_aops.c	2011-05-20 05:08:25.000000000 +0800
+++ linux-next/fs/xfs/linux-2.6/xfs_aops.c	2011-05-20 05:12:47.000000000 +0800
@@ -970,7 +970,7 @@ xfs_vm_writepage(
 	offset = page_offset(page);
 	type = IO_OVERWRITE;
 
-	if (wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking)
+	if (wbc->sync_mode == WB_SYNC_NONE)
 		nonblocking = 1;
 
 	do {
--- linux-next.orig/include/trace/events/btrfs.h	2011-05-20 05:08:25.000000000 +0800
+++ linux-next/include/trace/events/btrfs.h	2011-05-20 05:12:47.000000000 +0800
@@ -284,7 +284,6 @@ DECLARE_EVENT_CLASS(btrfs__writepage,
 		__field(	long,   pages_skipped		)
 		__field(	loff_t, range_start		)
 		__field(	loff_t, range_end		)
-		__field(	char,   nonblocking		)
 		__field(	char,   for_kupdate		)
 		__field(	char,   for_reclaim		)
 		__field(	char,   range_cyclic		)
@@ -299,7 +298,6 @@ DECLARE_EVENT_CLASS(btrfs__writepage,
 		__entry->pages_skipped	= wbc->pages_skipped;
 		__entry->range_start	= wbc->range_start;
 		__entry->range_end	= wbc->range_end;
-		__entry->nonblocking	= wbc->nonblocking;
 		__entry->for_kupdate	= wbc->for_kupdate;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
@@ -310,13 +308,13 @@ DECLARE_EVENT_CLASS(btrfs__writepage,
 
 	TP_printk("root = %llu(%s), ino = %lu, page_index = %lu, "
 		  "nr_to_write = %ld, pages_skipped = %ld, range_start = %llu, "
-		  "range_end = %llu, nonblocking = %d, for_kupdate = %d, "
+		  "range_end = %llu, for_kupdate = %d, "
 		  "for_reclaim = %d, range_cyclic = %d, writeback_index = %lu",
 		  show_root_type(__entry->root_objectid),
 		  (unsigned long)__entry->ino, __entry->index,
 		  __entry->nr_to_write, __entry->pages_skipped,
 		  __entry->range_start, __entry->range_end,
-		  __entry->nonblocking, __entry->for_kupdate,
+		  __entry->for_kupdate,
 		  __entry->for_reclaim, __entry->range_cyclic,
 		  (unsigned long)__entry->writeback_index)
 );
--- linux-next.orig/fs/nfs/write.c	2011-05-20 05:08:25.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-05-20 05:12:47.000000000 +0800
@@ -1562,8 +1562,7 @@ int nfs_write_inode(struct inode *inode,
 		int status;
 		bool sync = true;
 
-		if (wbc->sync_mode == WB_SYNC_NONE || wbc->nonblocking ||
-		    wbc->for_background)
+		if (wbc->sync_mode == WB_SYNC_NONE)
 			sync = false;
 
 		status = pnfs_layoutcommit_inode(inode, sync);
--- linux-next.orig/include/linux/writeback.h	2011-05-20 05:12:46.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-20 05:12:47.000000000 +0800
@@ -36,8 +36,6 @@ struct writeback_control {
 	loff_t range_start;
 	loff_t range_end;
 
-	unsigned nonblocking:1;		/* Don't get stuck on request queues */
-	unsigned encountered_congestion:1; /* An output: a queue is full */
 	unsigned for_kupdate:1;		/* A kupdate writeback */
 	unsigned for_background:1;	/* A background writeback */
 	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 16/18] writeback: trace event writeback_single_inode
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (14 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 15/18] writeback: remove .nonblocking and .encountered_congestion Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 17/18] writeback: trace event writeback_queue_io Wu Fengguang
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Wu Fengguang, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-writeback_single_inode.patch --]
[-- Type: text/plain, Size: 4034 bytes --]

It is valuable to know how the dirty inodes are iterated and their IO size.

"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"

- "state" reflects inode->i_state at the end of writeback_single_inode()
- "index" reflects mapping->writeback_index after the ->writepages() call
- "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
- "wrote" is the number of pages actually written

v2: add trace event writeback_single_inode_requeue as proposed by Dave.

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |    4 +
 include/trace/events/writeback.h |   70 +++++++++++++++++++++++++++++
 2 files changed, 74 insertions(+)

--- linux-next.orig/include/trace/events/writeback.h	2011-05-20 05:35:47.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-20 05:41:03.000000000 +0800
@@ -8,6 +8,19 @@
 #include <linux/device.h>
 #include <linux/writeback.h>
 
+#define show_inode_state(state)					\
+	__print_flags(state, "|",				\
+		{I_DIRTY_SYNC,		"I_DIRTY_SYNC"},	\
+		{I_DIRTY_DATASYNC,	"I_DIRTY_DATASYNC"},	\
+		{I_DIRTY_PAGES,		"I_DIRTY_PAGES"},	\
+		{I_NEW,			"I_NEW"},		\
+		{I_WILL_FREE,		"I_WILL_FREE"},		\
+		{I_FREEING,		"I_FREEING"},		\
+		{I_CLEAR,		"I_CLEAR"},		\
+		{I_SYNC,		"I_SYNC"},		\
+		{I_REFERENCED,		"I_REFERENCED"}		\
+	)
+
 struct wb_writeback_work;
 
 DECLARE_EVENT_CLASS(writeback_work_class,
@@ -201,6 +214,63 @@ DEFINE_EVENT(writeback_congest_waited_te
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
+DECLARE_EVENT_CLASS(writeback_single_inode_template,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long nr_to_write
+	),
+
+	TP_ARGS(inode, wbc, nr_to_write),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, state)
+		__field(unsigned long, age)
+		__field(unsigned long, writeback_index)
+		__field(long, nr_to_write)
+		__field(unsigned long, wrote)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->state		= inode->i_state;
+		__entry->age		= (jiffies - inode->dirtied_when) *
+								1000 / HZ;
+		__entry->writeback_index = inode->i_mapping->writeback_index;
+		__entry->nr_to_write	= nr_to_write;
+		__entry->wrote		= nr_to_write - wbc->nr_to_write;
+	),
+
+	TP_printk("bdi %s: ino=%lu state=%s age=%lu "
+		  "index=%lu to_write=%ld wrote=%lu",
+		  __entry->name,
+		  __entry->ino,
+		  show_inode_state(__entry->state),
+		  __entry->age,
+		  __entry->writeback_index,
+		  __entry->nr_to_write,
+		  __entry->wrote
+	)
+);
+
+DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode_requeue,
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long nr_to_write),
+	TP_ARGS(inode, wbc, nr_to_write)
+);
+
+DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode,
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long nr_to_write),
+	TP_ARGS(inode, wbc, nr_to_write)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:35:47.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:41:12.000000000 +0800
@@ -356,6 +356,7 @@ writeback_single_inode(struct inode *ino
 		       struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
+	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -378,6 +379,8 @@ writeback_single_inode(struct inode *ino
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
 			requeue_io(inode, wb);
+			trace_writeback_single_inode_requeue(inode, wbc,
+							     nr_to_write);
 			return 0;
 		}
 
@@ -476,6 +479,7 @@ writeback_single_inode(struct inode *ino
 		}
 	}
 	inode_sync_complete(inode);
+	trace_writeback_single_inode(inode, wbc, nr_to_write);
 	return ret;
 }
 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 17/18] writeback: trace event writeback_queue_io
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (15 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 16/18] writeback: trace event writeback_single_inode Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-19 21:45 ` [PATCH 18/18] writeback: rearrange the wb_writeback() loop Wu Fengguang
  2011-05-23  9:07 ` [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Christoph Hellwig
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-queue_io.patch --]
[-- Type: text/plain, Size: 3472 bytes --]

Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   14 ++++++++++----
 include/trace/events/writeback.h |   25 +++++++++++++++++++++++++
 2 files changed, 35 insertions(+), 4 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:12:48.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:12:49.000000000 +0800
@@ -258,15 +258,16 @@ static bool inode_dirtied_after(struct i
 /*
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
-static void move_expired_inodes(struct list_head *delaying_queue,
+static int move_expired_inodes(struct list_head *delaying_queue,
 			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+			       unsigned long *older_than_this)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
+	int moved = 0;
 
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
@@ -277,12 +278,13 @@ static void move_expired_inodes(struct l
 			do_sb_sort = 1;
 		sb = inode->i_sb;
 		list_move(&inode->i_wb_list, &tmp);
+		moved++;
 	}
 
 	/* just one sb in list, splice to dispatch_queue and we're done */
 	if (!do_sb_sort) {
 		list_splice(&tmp, dispatch_queue);
-		return;
+		goto out;
 	}
 
 	/* Move inodes from one superblock together */
@@ -294,6 +296,8 @@ static void move_expired_inodes(struct l
 				list_move(&inode->i_wb_list, dispatch_queue);
 		}
 	}
+out:
+	return moved;
 }
 
 /*
@@ -309,9 +313,11 @@ static void move_expired_inodes(struct l
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
+	int moved;
 	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	trace_writeback_queue_io(wb, older_than_this, moved);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
--- linux-next.orig/include/trace/events/writeback.h	2011-05-20 05:12:48.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-20 05:12:49.000000000 +0800
@@ -179,6 +179,31 @@ DEFINE_EVENT(wbc_class, name, \
 	TP_ARGS(wbc, bdi))
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(writeback_queue_io,
+	TP_PROTO(struct bdi_writeback *wb,
+		 unsigned long *older_than_this,
+		 int moved),
+	TP_ARGS(wb, older_than_this, moved),
+	TP_STRUCT__entry(
+		__array(char,		name, 32)
+		__field(unsigned long,	older)
+		__field(long,		age)
+		__field(int,		moved)
+	),
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(wb->bdi->dev), 32);
+		__entry->older	= older_than_this ?  *older_than_this : 0;
+		__entry->age	= older_than_this ?
+				  (jiffies - *older_than_this) * 1000 / HZ : -1;
+		__entry->moved	= moved;
+	),
+	TP_printk("bdi %s: older=%lu age=%ld enqueue=%d",
+		__entry->name,
+		__entry->older,	/* older_than_this in jiffies */
+		__entry->age,	/* older_than_this in relative milliseconds */
+		__entry->moved)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 18/18] writeback: rearrange the wb_writeback() loop
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (16 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 17/18] writeback: trace event writeback_queue_io Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  2011-05-23  9:07 ` [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Christoph Hellwig
  18 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: loop-cleanup --]
[-- Type: text/plain, Size: 2853 bytes --]

Move the terminate conditions to the end of the loop, and move the
work->older_than_this updates close to each others.

Proposed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   58 +++++++++++++++++++++++---------------------
 1 file changed, 31 insertions(+), 27 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:41:35.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 05:41:36.000000000 +0800
@@ -712,42 +712,22 @@ static long wb_writeback(struct bdi_writ
 	unsigned long oldest_jif;
 	struct inode *inode;
 	long progress;
+	bool bg_retry_all = false;
 
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
 
 	spin_lock(&wb->list_lock);
 	for (;;) {
-		/*
-		 * Stop writeback when nr_pages has been consumed
-		 */
-		if (work->nr_pages <= 0)
-			break;
-
-		/*
-		 * Background writeout and kupdate-style writeback may
-		 * run forever. Stop them if there is other work to do
-		 * so that e.g. sync can proceed. They'll be restarted
-		 * after the other works are all done.
-		 */
-		if ((work->for_background || work->for_kupdate) &&
-		    !list_empty(&wb->bdi->work_list))
-			break;
-
-		/*
-		 * For background writeout, stop when we are below the
-		 * background dirty threshold
-		 */
-		if (work->for_background && !over_bground_thresh())
-			break;
-
-		if (work->for_kupdate || work->for_background) {
+		if (bg_retry_all) {
+			bg_retry_all = false;
+			work->older_than_this = NULL;
+		} else if (work->for_kupdate || work->for_background) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
 			work->older_than_this = &oldest_jif;
 		}
 
-retry:
 		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
 			queue_io(wb, work->older_than_this);
@@ -777,14 +757,38 @@ retry:
 		    work->older_than_this &&
 		    list_empty(&wb->b_io) &&
 		    list_empty(&wb->b_more_io)) {
-			work->older_than_this = NULL;
-			goto retry;
+			bg_retry_all = true;
+			continue;
 		}
 		/*
 		 * No more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
+
+		/*
+		 * Stop writeback when nr_pages has been consumed
+		 */
+		if (work->nr_pages <= 0)
+			break;
+
+		/*
+		 * For background writeout, stop when we are below the
+		 * background dirty threshold
+		 */
+		if (work->for_background && !over_bground_thresh())
+			break;
+
+		/*
+		 * Background writeout and kupdate-style writeback may
+		 * run forever. Stop them if there is other work to do
+		 * so that e.g. sync can proceed. They'll be restarted
+		 * after the other works are all done.
+		 */
+		if ((work->for_background || work->for_kupdate) &&
+		    !list_empty(&wb->bdi->work_list))
+			break;
+
 		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 14/18] writeback: make writeback_control.nr_to_write straight
  2011-05-19 21:45 ` [PATCH 14/18] writeback: make writeback_control.nr_to_write straight Wu Fengguang
@ 2011-05-19 22:06   ` Wu Fengguang
  2011-05-19 23:29     ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Wu Fengguang @ 2011-05-19 22:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, linux-fsdevel, LKML

:                 writeback_single_inode(inode, wb, &wbc);
:                 work->nr_pages -= write_chunk - wbc.nr_to_write;
:                 wrote += write_chunk - wbc.nr_to_write;
:                 if (wbc.pages_skipped) {
:                         /*
:                          * writeback is not making progress due to locked
:                          * buffers.  Skip this inode for now.
:                          */
:                         redirty_tail(inode, wb);
: -               }
: +               } else if (!(inode->i_state & I_DIRTY))
: +                       wrote++;

It looks a bit more clean to do

:                 wrote += write_chunk - wbc.nr_to_write;
: +               if (!(inode->i_state & I_DIRTY))
: +                       wrote++;
:                 if (wbc.pages_skipped) {
:                         /*
:                          * writeback is not making progress due to locked
:                          * buffers.  Skip this inode for now.
:                          */
:                         redirty_tail(inode, wb);
:                 }

This is the updated patch.

---
Subject: writeback: make writeback_control.nr_to_write straight 
Date: Wed May 04 19:54:37 CST 2011

Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.

Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/extent_io.c             |    2 
 fs/fs-writeback.c                |  199 ++++++++++++++++-------------
 include/linux/writeback.h        |    6 
 include/trace/events/writeback.h |   39 ++++-
 mm/backing-dev.c                 |    9 -
 mm/page-writeback.c              |   17 --
 6 files changed, 147 insertions(+), 125 deletions(-)

change set:
- move writeback_control from wb_writeback() down to writeback_sb_inodes()
- change return value of writeback_sb_inodes()/__writeback_inodes_wb()
  to the number of pages and/or inodes written
- move writeback_control.older_than_this to struct wb_writeback_work
- remove writeback_control.inodes_cleaned
- remove wbc_writeback_* trace events for now

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 05:35:46.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 06:00:41.000000000 +0800
@@ -30,11 +30,21 @@
 #include "internal.h"
 
 /*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES     1024L
+
+/*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
 struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
+	unsigned long *older_than_this;
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
@@ -463,7 +473,6 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
-			wbc->inodes_cleaned++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -496,6 +505,31 @@ static bool pin_sb_for_writeback(struct 
 	return false;
 }
 
+static long writeback_chunk_size(struct wb_writeback_work *work)
+{
+	long pages;
+
+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 * The intended call sequence for WB_SYNC_ALL writeback is:
+	 *
+	 *      wb_writeback()
+	 *          writeback_sb_inodes()       <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                   (quickly) tag currently dirty pages
+	 *                   (maybe slowly) sync all tagged pages
+	 */
+	if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
+		pages = LONG_MAX;
+	else
+		pages = min(MAX_WRITEBACK_PAGES, work->nr_pages);
+
+	return pages;
+}
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -503,18 +537,30 @@ static bool pin_sb_for_writeback(struct 
  * inodes. Otherwise write only ones which go sequentially
  * in reverse order.
  *
- * Return 1, if the caller writeback routine should be
- * interrupted. Otherwise return 0.
+ * Return the number of pages and/or inodes cleaned.
  */
-static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
-		struct writeback_control *wbc, bool only_this_sb)
+static long writeback_sb_inodes(struct super_block *sb,
+				struct bdi_writeback *wb,
+				struct wb_writeback_work *work)
 {
+	struct writeback_control wbc = {
+		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
+		.for_kupdate		= work->for_kupdate,
+		.for_background		= work->for_background,
+		.range_cyclic		= work->range_cyclic,
+		.range_start		= 0,
+		.range_end		= LLONG_MAX,
+	};
+	unsigned long start_time = jiffies;
+	long write_chunk;
+	long wrote = 0;  /* count both pages and inodes */
+
 	while (!list_empty(&wb->b_io)) {
-		long pages_skipped;
 		struct inode *inode = wb_inode(wb->b_io.prev);
 
 		if (inode->i_sb != sb) {
-			if (only_this_sb) {
+			if (work->sb) {
 				/*
 				 * We only want to write back data for this
 				 * superblock, move all inodes not belonging
@@ -529,7 +575,7 @@ static int writeback_sb_inodes(struct su
 			 * Bounce back to the caller to unpin this and
 			 * pin the next superblock.
 			 */
-			return 0;
+			break;
 		}
 
 		/*
@@ -543,12 +589,18 @@ static int writeback_sb_inodes(struct su
 			requeue_io(inode, wb);
 			continue;
 		}
-
 		__iget(inode);
+		write_chunk = writeback_chunk_size(work);
+		wbc.nr_to_write = write_chunk;
+		wbc.pages_skipped = 0;
+
+		writeback_single_inode(inode, wb, &wbc);
 
-		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wb, wbc);
-		if (wbc->pages_skipped != pages_skipped) {
+		work->nr_pages -= write_chunk - wbc.nr_to_write;
+		wrote += write_chunk - wbc.nr_to_write;
+		if (!(inode->i_state & I_DIRTY))
+			wrote++;
+		if (wbc.pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
@@ -560,17 +612,25 @@ static int writeback_sb_inodes(struct su
 		iput(inode);
 		cond_resched();
 		spin_lock(&wb->list_lock);
-		if (wbc->nr_to_write <= 0)
-			return 1;
+		/*
+		 * bail out to wb_writeback() often enough to check
+		 * background threshold and other termination conditions.
+		 */
+		if (wrote) {
+			if (jiffies - start_time > HZ / 10UL)
+				break;
+			if (work->nr_pages <= 0)
+				break;
+		}
 	}
-	/* b_io is empty */
-	return 1;
+	return wrote;
 }
 
-static void __writeback_inodes_wb(struct bdi_writeback *wb,
-				  struct writeback_control *wbc)
+static long __writeback_inodes_wb(struct bdi_writeback *wb,
+				  struct wb_writeback_work *work)
 {
-	int ret = 0;
+	unsigned long start_time = jiffies;
+	long wrote = 0;
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -580,33 +640,36 @@ static void __writeback_inodes_wb(struct
 			requeue_io(inode, wb);
 			continue;
 		}
-		ret = writeback_sb_inodes(sb, wb, wbc, false);
+		wrote += writeback_sb_inodes(sb, wb, work);
 		drop_super(sb);
 
-		if (ret)
-			break;
+		if (wrote) {
+			if (jiffies - start_time > HZ / 10UL)
+				break;
+			if (work->nr_pages <= 0)
+				break;
+		}
 	}
 	/* Leave any unwritten inodes on b_io */
+	return wrote;
 }
 
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc)
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages)
 {
+	struct wb_writeback_work work = {
+		.nr_pages	= nr_pages,
+		.sync_mode	= WB_SYNC_NONE,
+		.range_cyclic	= 1,
+	};
+
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
-	__writeback_inodes_wb(wb, wbc);
+		queue_io(wb, NULL);
+	__writeback_inodes_wb(wb, &work);
 	spin_unlock(&wb->list_lock);
-}
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
+	return nr_pages - work.nr_pages;
+}
 
 static inline bool over_bground_thresh(void)
 {
@@ -636,42 +699,13 @@ static inline bool over_bground_thresh(v
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
-	struct writeback_control wbc = {
-		.sync_mode		= work->sync_mode,
-		.tagged_writepages	= work->tagged_writepages,
-		.older_than_this	= NULL,
-		.for_kupdate		= work->for_kupdate,
-		.for_background		= work->for_background,
-		.range_cyclic		= work->range_cyclic,
-	};
+	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	long wrote = 0;
-	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
-
-	if (!wbc.range_cyclic) {
-		wbc.range_start = 0;
-		wbc.range_end = LLONG_MAX;
-	}
-
-	/*
-	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
-	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
-	 * here avoids calling into writeback_inodes_wb() more than once.
-	 *
-	 * The intended call sequence for WB_SYNC_ALL writeback is:
-	 *
-	 *      wb_writeback()
-	 *          writeback_sb_inodes()       <== called only once
-	 *              write_cache_pages()     <== called once for each inode
-	 *                   (quickly) tag currently dirty pages
-	 *                   (maybe slowly) sync all tagged pages
-	 */
-	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_writepages)
-		write_chunk = LONG_MAX;
+	long progress;
 
 	oldest_jif = jiffies;
-	wbc.older_than_this = &oldest_jif;
+	work->older_than_this = &oldest_jif;
 
 	spin_lock(&wb->list_lock);
 	for (;;) {
@@ -701,25 +735,18 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_kupdate || work->for_background) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
-			wbc.older_than_this = &oldest_jif;
+			work->older_than_this = &oldest_jif;
 		}
 
-		wbc.nr_to_write = write_chunk;
-		wbc.pages_skipped = 0;
-		wbc.inodes_cleaned = 0;
-
 retry:
-		trace_wbc_writeback_start(&wbc, wb->bdi);
+		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
-			queue_io(wb, wbc.older_than_this);
+			queue_io(wb, work->older_than_this);
 		if (work->sb)
-			writeback_sb_inodes(work->sb, wb, &wbc, true);
+			progress = writeback_sb_inodes(work->sb, wb, work);
 		else
-			__writeback_inodes_wb(wb, &wbc);
-		trace_wbc_writeback_written(&wbc, wb->bdi);
-
-		work->nr_pages -= write_chunk - wbc.nr_to_write;
-		wrote += write_chunk - wbc.nr_to_write;
+			progress = __writeback_inodes_wb(wb, work);
+		trace_writeback_written(wb->bdi, work);
 
 		/*
 		 * Did we write something? Try for more
@@ -729,9 +756,7 @@ retry:
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		if (wbc.inodes_cleaned)
+		if (progress)
 			continue;
 		/*
 		 * background writeback will start with expired inodes, and
@@ -740,10 +765,10 @@ retry:
 		 * lists and cause trouble to the page reclaim.
 		 */
 		if (work->for_background &&
-		    wbc.older_than_this &&
+		    work->older_than_this &&
 		    list_empty(&wb->b_io) &&
 		    list_empty(&wb->b_more_io)) {
-			wbc.older_than_this = NULL;
+			work->older_than_this = NULL;
 			goto retry;
 		}
 		/*
@@ -757,8 +782,8 @@ retry:
 		 * we'll just busyloop.
 		 */
 		if (!list_empty(&wb->b_more_io))  {
+			trace_writeback_wait(wb->bdi, work);
 			inode = wb_inode(wb->b_more_io.prev);
-			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			spin_lock(&inode->i_lock);
 			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
@@ -766,7 +791,7 @@ retry:
 	}
 	spin_unlock(&wb->list_lock);
 
-	return wrote;
+	return nr_pages - work->nr_pages;
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-05-20 05:35:46.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-20 06:00:11.000000000 +0800
@@ -24,12 +24,9 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
-	long inodes_cleaned;		/* # of inodes cleaned */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is
@@ -58,8 +55,7 @@ void writeback_inodes_sb_nr(struct super
 int writeback_inodes_sb_if_idle(struct super_block *);
 int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr);
 void sync_inodes_sb(struct super_block *);
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc);
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages);
 
--- linux-next.orig/mm/backing-dev.c	2011-05-20 05:35:44.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-05-20 05:35:47.000000000 +0800
@@ -262,14 +262,7 @@ int bdi_has_dirty_io(struct backing_dev_
 
 static void bdi_flush_io(struct backing_dev_info *bdi)
 {
-	struct writeback_control wbc = {
-		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
-		.range_cyclic		= 1,
-		.nr_to_write		= 1024,
-	};
-
-	writeback_inodes_wb(&bdi->wb, &wbc);
+	writeback_inodes_wb(&bdi->wb, 1024);
 }
 
 /*
--- linux-next.orig/mm/page-writeback.c	2011-05-20 05:35:44.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-20 05:35:47.000000000 +0800
@@ -494,13 +494,6 @@ static void balance_dirty_pages(struct a
 		return;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -562,17 +555,17 @@ static void balance_dirty_pages(struct a
 		 * threshold otherwise wait until the disk writes catch
 		 * up.
 		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
+		trace_balance_dirty_start(bdi);
 		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
+			pages_written += writeback_inodes_wb(&bdi->wb,
+							     write_chunk);
+			trace_balance_dirty_written(bdi, pages_written);
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
 		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		trace_balance_dirty_wait(bdi);
 
 		/*
 		 * Increase the delay for each loop, up to our previous
--- linux-next.orig/include/trace/events/writeback.h	2011-05-20 05:35:46.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-20 06:00:08.000000000 +0800
@@ -49,6 +49,9 @@ DEFINE_EVENT(writeback_work_class, name,
 DEFINE_WRITEBACK_WORK_EVENT(writeback_nothread);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_queue);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_exec);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_written);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_wait);
 
 TRACE_EVENT(writeback_pages_written,
 	TP_PROTO(long pages_written),
@@ -88,6 +91,30 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
+DEFINE_WRITEBACK_EVENT(balance_dirty_start);
+DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
+
+TRACE_EVENT(balance_dirty_written,
+
+	TP_PROTO(struct backing_dev_info *bdi, int written),
+
+	TP_ARGS(bdi, written),
+
+	TP_STRUCT__entry(
+		__array(char,	name, 32)
+		__field(int,	written)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->written = written;
+	),
+
+	TP_printk("bdi %s written %d",
+		  __entry->name,
+		  __entry->written
+	)
+);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
@@ -101,7 +128,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -115,14 +141,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -132,7 +156,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
@@ -141,12 +164,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 DEFINE_EVENT(wbc_class, name, \
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), \
 	TP_ARGS(wbc, bdi))
-DEFINE_WBC_EVENT(wbc_writeback_start);
-DEFINE_WBC_EVENT(wbc_writeback_written);
-DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
--- linux-next.orig/fs/btrfs/extent_io.c	2011-05-20 05:35:38.000000000 +0800
+++ linux-next/fs/btrfs/extent_io.c	2011-05-20 05:35:47.000000000 +0800
@@ -2596,7 +2596,6 @@ int extent_write_full_page(struct extent
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= wbc->sync_mode,
-		.older_than_this = NULL,
 		.nr_to_write	= 64,
 		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
 		.range_end	= (loff_t)-1,
@@ -2629,7 +2628,6 @@ int extent_write_locked_range(struct ext
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= mode,
-		.older_than_this = NULL,
 		.nr_to_write	= nr_pages * 2,
 		.range_start	= start,
 		.range_end	= end + 1,

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 14/18] writeback: make writeback_control.nr_to_write straight
  2011-05-19 22:06   ` Wu Fengguang
@ 2011-05-19 23:29     ` Dave Chinner
  2011-05-20  4:07       ` Wu Fengguang
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2011-05-19 23:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, linux-fsdevel, LKML

On Fri, May 20, 2011 at 06:06:44AM +0800, Wu Fengguang wrote:
> :                 writeback_single_inode(inode, wb, &wbc);
> :                 work->nr_pages -= write_chunk - wbc.nr_to_write;
> :                 wrote += write_chunk - wbc.nr_to_write;
> :                 if (wbc.pages_skipped) {
> :                         /*
> :                          * writeback is not making progress due to locked
> :                          * buffers.  Skip this inode for now.
> :                          */
> :                         redirty_tail(inode, wb);
> : -               }
> : +               } else if (!(inode->i_state & I_DIRTY))
> : +                       wrote++;
> 
> It looks a bit more clean to do
> 
> :                 wrote += write_chunk - wbc.nr_to_write;
> : +               if (!(inode->i_state & I_DIRTY))
> : +                       wrote++;
> :                 if (wbc.pages_skipped) {
> :                         /*
> :                          * writeback is not making progress due to locked
> :                          * buffers.  Skip this inode for now.
> :                          */
> :                         redirty_tail(inode, wb);
> :                 }

But it's still in the wrong place - such post-write inode dirty
processing is supposed to be isolated to writeback_single_inode().
Spreading it across multiple locations is not, IMO, the nicest thing
to do...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 14/18] writeback: make writeback_control.nr_to_write straight
  2011-05-19 23:29     ` Dave Chinner
@ 2011-05-20  4:07       ` Wu Fengguang
  2011-05-20  6:52         ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Wu Fengguang @ 2011-05-20  4:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, linux-fsdevel, LKML

On Fri, May 20, 2011 at 07:29:10AM +0800, Dave Chinner wrote:
> On Fri, May 20, 2011 at 06:06:44AM +0800, Wu Fengguang wrote:
> > :                 writeback_single_inode(inode, wb, &wbc);
> > :                 work->nr_pages -= write_chunk - wbc.nr_to_write;
> > :                 wrote += write_chunk - wbc.nr_to_write;
> > :                 if (wbc.pages_skipped) {
> > :                         /*
> > :                          * writeback is not making progress due to locked
> > :                          * buffers.  Skip this inode for now.
> > :                          */
> > :                         redirty_tail(inode, wb);
> > : -               }
> > : +               } else if (!(inode->i_state & I_DIRTY))
> > : +                       wrote++;
> > 
> > It looks a bit more clean to do
> > 
> > :                 wrote += write_chunk - wbc.nr_to_write;
> > : +               if (!(inode->i_state & I_DIRTY))
> > : +                       wrote++;
> > :                 if (wbc.pages_skipped) {
> > :                         /*
> > :                          * writeback is not making progress due to locked
> > :                          * buffers.  Skip this inode for now.
> > :                          */
> > :                         redirty_tail(inode, wb);
> > :                 }
> 
> But it's still in the wrong place - such post-write inode dirty
> processing is supposed to be isolated to writeback_single_inode().
> Spreading it across multiple locations is not, IMO, the nicest thing
> to do...

Strictly speaking, it's post inspecting :)

It does look reasonable and safe to move the pages_skipped post
processing into writeback_single_inode(). See the below patch.

When doing this chunk,

-			if (wbc->nr_to_write <= 0) {
+			if (wbc->nr_to_write <= 0 && wbc->pages_skipped == 0) {

I wonder in general sense (without knowing enough FS internals)
whether ->pages_skipped is that useful: if some locked buffer is
blocking all subsequent pages, then ->nr_to_write won't be able to
drop to zero.  So the (wbc->pages_skipped == 0) test seems redundant..

Thanks,
Fengguang
---
Subject: writeback: move pages_skipped post processing into writeback_single_inode()
Date: Fri May 20 11:42:42 CST 2011

It's more logical to isolate post-write processings in writeback_single_inode().

Note that it slightly changes behavior for write_inode_now() and sync_inode(),
which used to ignore pages_skipped.

Proposed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 11:26:19.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 11:42:30.000000000 +0800
@@ -404,6 +404,7 @@ writeback_single_inode(struct inode *ino
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&wb->list_lock);
 
+	wbc->pages_skipped = 0;
 	ret = do_writepages(mapping, wbc);
 
 	/*
@@ -443,7 +444,7 @@ writeback_single_inode(struct inode *ino
 			 * sometimes bales out without doing anything.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
-			if (wbc->nr_to_write <= 0) {
+			if (wbc->nr_to_write <= 0 && wbc->pages_skipped == 0) {
 				/*
 				 * slice used up: queue for next turn
 				 */
@@ -602,7 +603,6 @@ static long writeback_sb_inodes(struct s
 		__iget(inode);
 		write_chunk = writeback_chunk_size(work);
 		wbc.nr_to_write = write_chunk;
-		wbc.pages_skipped = 0;
 
 		writeback_single_inode(inode, wb, &wbc);
 
@@ -610,13 +610,6 @@ static long writeback_sb_inodes(struct s
 		wrote += write_chunk - wbc.nr_to_write;
 		if (!(inode->i_state & I_DIRTY))
 			wrote++;
-		if (wbc.pages_skipped) {
-			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
-			 */
-			redirty_tail(inode, wb);
-		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
 		iput(inode);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 14/18] writeback: make writeback_control.nr_to_write straight
  2011-05-20  4:07       ` Wu Fengguang
@ 2011-05-20  6:52         ` Dave Chinner
  2011-05-20  7:15           ` Wu Fengguang
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2011-05-20  6:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, linux-fsdevel, LKML

On Fri, May 20, 2011 at 12:07:40PM +0800, Wu Fengguang wrote:
> On Fri, May 20, 2011 at 07:29:10AM +0800, Dave Chinner wrote:
> > On Fri, May 20, 2011 at 06:06:44AM +0800, Wu Fengguang wrote:
> > > :                 writeback_single_inode(inode, wb, &wbc);
> > > :                 work->nr_pages -= write_chunk - wbc.nr_to_write;
> > > :                 wrote += write_chunk - wbc.nr_to_write;
> > > :                 if (wbc.pages_skipped) {
> > > :                         /*
> > > :                          * writeback is not making progress due to locked
> > > :                          * buffers.  Skip this inode for now.
> > > :                          */
> > > :                         redirty_tail(inode, wb);
> > > : -               }
> > > : +               } else if (!(inode->i_state & I_DIRTY))
> > > : +                       wrote++;
> > > 
> > > It looks a bit more clean to do
> > > 
> > > :                 wrote += write_chunk - wbc.nr_to_write;
> > > : +               if (!(inode->i_state & I_DIRTY))
> > > : +                       wrote++;
> > > :                 if (wbc.pages_skipped) {
> > > :                         /*
> > > :                          * writeback is not making progress due to locked
> > > :                          * buffers.  Skip this inode for now.
> > > :                          */
> > > :                         redirty_tail(inode, wb);
> > > :                 }
> > 
> > But it's still in the wrong place - such post-write inode dirty
> > processing is supposed to be isolated to writeback_single_inode().
> > Spreading it across multiple locations is not, IMO, the nicest thing
> > to do...
> 
> Strictly speaking, it's post inspecting :)
> 
> It does look reasonable and safe to move the pages_skipped post
> processing into writeback_single_inode(). See the below patch.

<sigh>

That's not what I was referring to. The wbc.pages_skipped check is
fine where it is.

> 
> When doing this chunk,
> 
> -			if (wbc->nr_to_write <= 0) {
> +			if (wbc->nr_to_write <= 0 && wbc->pages_skipped == 0) {
> 
> I wonder in general sense (without knowing enough FS internals)
> whether ->pages_skipped is that useful: if some locked buffer is
> blocking all subsequent pages, then ->nr_to_write won't be able to
> drop to zero.  So the (wbc->pages_skipped == 0) test seems redundant..
> 
> Thanks,
> Fengguang
> ---
> Subject: writeback: move pages_skipped post processing into writeback_single_inode()
> Date: Fri May 20 11:42:42 CST 2011
> 
> It's more logical to isolate post-write processings in writeback_single_inode().
> 
> Note that it slightly changes behavior for write_inode_now() and sync_inode(),
> which used to ignore pages_skipped.
> 
> Proposed-by: Dave Chinner <david@fromorbit.com>

No, I didn't propose the change you've made in this patch. I've been
asking you to fix the original patch, not proposing new changes to
some other code.  Please don't add my name to random tags in patches
without asking me first.

So, for the third time, please fix the original patch by moving the
new "inode now clean" accounting to the "inode-now-clean" logic
branch in writeback_single_inode().

        if (!(inode->i_state & I_FREEING)) {
                if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
.....
                } else if (inode->i_state & I_DIRTY) {
.....
                } else {
			/*
			 * account for it here with all the other
			 * inode-now-clean manipulations that we need
			 * to do!
			 */
.....
                }
        }

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 14/18] writeback: make writeback_control.nr_to_write straight
  2011-05-20  6:52         ` Dave Chinner
@ 2011-05-20  7:15           ` Wu Fengguang
  2011-05-20  7:26             ` Wu Fengguang
  0 siblings, 1 reply; 33+ messages in thread
From: Wu Fengguang @ 2011-05-20  7:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, linux-fsdevel, LKML

On Fri, May 20, 2011 at 02:52:07PM +0800, Dave Chinner wrote:
> On Fri, May 20, 2011 at 12:07:40PM +0800, Wu Fengguang wrote:
> > On Fri, May 20, 2011 at 07:29:10AM +0800, Dave Chinner wrote:
> > > On Fri, May 20, 2011 at 06:06:44AM +0800, Wu Fengguang wrote:
> > > > :                 writeback_single_inode(inode, wb, &wbc);
> > > > :                 work->nr_pages -= write_chunk - wbc.nr_to_write;
> > > > :                 wrote += write_chunk - wbc.nr_to_write;
> > > > :                 if (wbc.pages_skipped) {
> > > > :                         /*
> > > > :                          * writeback is not making progress due to locked
> > > > :                          * buffers.  Skip this inode for now.
> > > > :                          */
> > > > :                         redirty_tail(inode, wb);
> > > > : -               }
> > > > : +               } else if (!(inode->i_state & I_DIRTY))
> > > > : +                       wrote++;
> > > > 
> > > > It looks a bit more clean to do
> > > > 
> > > > :                 wrote += write_chunk - wbc.nr_to_write;
> > > > : +               if (!(inode->i_state & I_DIRTY))
> > > > : +                       wrote++;
> > > > :                 if (wbc.pages_skipped) {
> > > > :                         /*
> > > > :                          * writeback is not making progress due to locked
> > > > :                          * buffers.  Skip this inode for now.
> > > > :                          */
> > > > :                         redirty_tail(inode, wb);
> > > > :                 }
> > > 
> > > But it's still in the wrong place - such post-write inode dirty
> > > processing is supposed to be isolated to writeback_single_inode().
> > > Spreading it across multiple locations is not, IMO, the nicest thing
> > > to do...
> > 
> > Strictly speaking, it's post inspecting :)
> > 
> > It does look reasonable and safe to move the pages_skipped post
> > processing into writeback_single_inode(). See the below patch.
> 
> <sigh>
> 
> That's not what I was referring to. The wbc.pages_skipped check is
> fine where it is.
> 
> > 
> > When doing this chunk,
> > 
> > -			if (wbc->nr_to_write <= 0) {
> > +			if (wbc->nr_to_write <= 0 && wbc->pages_skipped == 0) {
> > 
> > I wonder in general sense (without knowing enough FS internals)
> > whether ->pages_skipped is that useful: if some locked buffer is
> > blocking all subsequent pages, then ->nr_to_write won't be able to
> > drop to zero.  So the (wbc->pages_skipped == 0) test seems redundant..
> > 
> > Thanks,
> > Fengguang
> > ---
> > Subject: writeback: move pages_skipped post processing into writeback_single_inode()
> > Date: Fri May 20 11:42:42 CST 2011
> > 
> > It's more logical to isolate post-write processings in writeback_single_inode().
> > 
> > Note that it slightly changes behavior for write_inode_now() and sync_inode(),
> > which used to ignore pages_skipped.
> > 
> > Proposed-by: Dave Chinner <david@fromorbit.com>
> 
> No, I didn't propose the change you've made in this patch. I've been
> asking you to fix the original patch, not proposing new changes to
> some other code.  Please don't add my name to random tags in patches
> without asking me first.

OK, sorry, I'll keep that in mind in future.

> So, for the third time, please fix the original patch by moving the
> new "inode now clean" accounting to the "inode-now-clean" logic
> branch in writeback_single_inode().
> 
>         if (!(inode->i_state & I_FREEING)) {
>                 if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> .....
>                 } else if (inode->i_state & I_DIRTY) {
> .....
>                 } else {
> 			/*
> 			 * account for it here with all the other
> 			 * inode-now-clean manipulations that we need
> 			 * to do!
> 			 */

That's what the original "writeback: introduce
writeback_control.inodes_cleaned" does. Given that it's opposed to add
writeback_control.inodes_cleaned, the only option remained is to add
one more argument "long *inode_cleaned" to writeback_single_inode()
like this.

Well it looks ugly and I wonder if you have any prettier version in
mind. This ugliness is the main reason I resist to do the change.

Thanks,
Fengguang
---

--- linux-next.orig/fs/fs-writeback.c	2011-05-20 15:09:11.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-20 15:09:15.000000000 +0800
@@ -359,7 +359,7 @@ static void inode_wait_for_writeback(str
  */
 static int
 writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
-		       struct writeback_control *wbc)
+		       struct writeback_control *wbc, long *inode_cleaned)
 {
 	struct address_space *mapping = inode->i_mapping;
 	long nr_to_write = wbc->nr_to_write;
@@ -482,6 +482,7 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
+			(*inode_cleaned)++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -604,12 +605,10 @@ static long writeback_sb_inodes(struct s
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 
-		writeback_single_inode(inode, wb, &wbc);
+		writeback_single_inode(inode, wb, &wbc, &wrote);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
 		wrote += write_chunk - wbc.nr_to_write;
-		if (!(inode->i_state & I_DIRTY))
-			wrote++;
 		if (wbc.pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
@@ -1352,6 +1351,7 @@ int write_inode_now(struct inode *inode,
 		.range_start = 0,
 		.range_end = LLONG_MAX,
 	};
+	long unused;
 
 	if (!mapping_cap_writeback_dirty(inode->i_mapping))
 		wbc.nr_to_write = 0;
@@ -1359,7 +1359,7 @@ int write_inode_now(struct inode *inode,
 	might_sleep();
 	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, wb, &wbc);
+	ret = writeback_single_inode(inode, wb, &wbc, &unused);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&wb->list_lock);
 	if (sync)
@@ -1383,10 +1383,11 @@ int sync_inode(struct inode *inode, stru
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	int ret;
+	long unused;
 
 	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, wb, wbc);
+	ret = writeback_single_inode(inode, wb, wbc, &unused);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&wb->list_lock);
 	return ret;

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 14/18] writeback: make writeback_control.nr_to_write straight
  2011-05-20  7:15           ` Wu Fengguang
@ 2011-05-20  7:26             ` Wu Fengguang
  0 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-05-20  7:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, linux-fsdevel, LKML

On Fri, May 20, 2011 at 03:15:18PM +0800, Wu Fengguang wrote:
> On Fri, May 20, 2011 at 02:52:07PM +0800, Dave Chinner wrote:
> > On Fri, May 20, 2011 at 12:07:40PM +0800, Wu Fengguang wrote:
> > > On Fri, May 20, 2011 at 07:29:10AM +0800, Dave Chinner wrote:
> > > > On Fri, May 20, 2011 at 06:06:44AM +0800, Wu Fengguang wrote:
> > > > > :                 writeback_single_inode(inode, wb, &wbc);
> > > > > :                 work->nr_pages -= write_chunk - wbc.nr_to_write;
> > > > > :                 wrote += write_chunk - wbc.nr_to_write;
> > > > > :                 if (wbc.pages_skipped) {
> > > > > :                         /*
> > > > > :                          * writeback is not making progress due to locked
> > > > > :                          * buffers.  Skip this inode for now.
> > > > > :                          */
> > > > > :                         redirty_tail(inode, wb);
> > > > > : -               }
> > > > > : +               } else if (!(inode->i_state & I_DIRTY))
> > > > > : +                       wrote++;
> > > > > 
> > > > > It looks a bit more clean to do
> > > > > 
> > > > > :                 wrote += write_chunk - wbc.nr_to_write;
> > > > > : +               if (!(inode->i_state & I_DIRTY))
> > > > > : +                       wrote++;
> > > > > :                 if (wbc.pages_skipped) {
> > > > > :                         /*
> > > > > :                          * writeback is not making progress due to locked
> > > > > :                          * buffers.  Skip this inode for now.
> > > > > :                          */
> > > > > :                         redirty_tail(inode, wb);
> > > > > :                 }
> > > > 
> > > > But it's still in the wrong place - such post-write inode dirty
> > > > processing is supposed to be isolated to writeback_single_inode().
> > > > Spreading it across multiple locations is not, IMO, the nicest thing
> > > > to do...
> > > 
> > > Strictly speaking, it's post inspecting :)
> > > 
> > > It does look reasonable and safe to move the pages_skipped post
> > > processing into writeback_single_inode(). See the below patch.
> > 
> > <sigh>
> > 
> > That's not what I was referring to. The wbc.pages_skipped check is
> > fine where it is.
> > 
> > > 
> > > When doing this chunk,
> > > 
> > > -			if (wbc->nr_to_write <= 0) {
> > > +			if (wbc->nr_to_write <= 0 && wbc->pages_skipped == 0) {
> > > 
> > > I wonder in general sense (without knowing enough FS internals)
> > > whether ->pages_skipped is that useful: if some locked buffer is
> > > blocking all subsequent pages, then ->nr_to_write won't be able to
> > > drop to zero.  So the (wbc->pages_skipped == 0) test seems redundant..
> > > 
> > > Thanks,
> > > Fengguang
> > > ---
> > > Subject: writeback: move pages_skipped post processing into writeback_single_inode()
> > > Date: Fri May 20 11:42:42 CST 2011
> > > 
> > > It's more logical to isolate post-write processings in writeback_single_inode().
> > > 
> > > Note that it slightly changes behavior for write_inode_now() and sync_inode(),
> > > which used to ignore pages_skipped.
> > > 
> > > Proposed-by: Dave Chinner <david@fromorbit.com>
> > 
> > No, I didn't propose the change you've made in this patch. I've been
> > asking you to fix the original patch, not proposing new changes to
> > some other code.  Please don't add my name to random tags in patches
> > without asking me first.
> 
> OK, sorry, I'll keep that in mind in future.
> 
> > So, for the third time, please fix the original patch by moving the
> > new "inode now clean" accounting to the "inode-now-clean" logic
> > branch in writeback_single_inode().
> > 
> >         if (!(inode->i_state & I_FREEING)) {
> >                 if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> > .....
> >                 } else if (inode->i_state & I_DIRTY) {
> > .....
> >                 } else {
> > 			/*
> > 			 * account for it here with all the other
> > 			 * inode-now-clean manipulations that we need
> > 			 * to do!
> > 			 */
> 
> That's what the original "writeback: introduce
> writeback_control.inodes_cleaned" does. Given that it's opposed to add
> writeback_control.inodes_cleaned, the only option remained is to add
> one more argument "long *inode_cleaned" to writeback_single_inode()
> like this.
> 
> Well it looks ugly and I wonder if you have any prettier version in
> mind. This ugliness is the main reason I resist to do the change.

The other option is to make use of a *bit* field wbc->inode_cleaned.
It still adds one more writeback_control field *logically* and several
new lines of code, but kills a bit stack overheads.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3)
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
                   ` (17 preceding siblings ...)
  2011-05-19 21:45 ` [PATCH 18/18] writeback: rearrange the wb_writeback() loop Wu Fengguang
@ 2011-05-23  9:07 ` Christoph Hellwig
  2011-05-23  9:28   ` Wu Fengguang
  18 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2011-05-23  9:07 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, Jan Kara, Dave Chinner, linux-fsdevel, LKML

Wu, Andrew,

what's the plan for these for 2.6.40?  We'll need to make some progress
in this area, and even if we can't get everything it we should make sure
to at least include the updated versions of those in -mm.  But even
some of the later ones are pretty low risk.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3)
  2011-05-23  9:07 ` [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Christoph Hellwig
@ 2011-05-23  9:28   ` Wu Fengguang
  2011-05-24  3:28     ` Wu Fengguang
  0 siblings, 1 reply; 33+ messages in thread
From: Wu Fengguang @ 2011-05-23  9:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, Dave Chinner, linux-fsdevel, LKML

On Mon, May 23, 2011 at 05:07:39PM +0800, Christoph Hellwig wrote:
> Wu, Andrew,
> 
> what's the plan for these for 2.6.40?  We'll need to make some progress
> in this area, and even if we can't get everything it we should make sure
> to at least include the updated versions of those in -mm.  But even
> some of the later ones are pretty low risk.

Yes, except for patch 14 which does not include external behavior
changes besides the good full write chunk size for large files, the
patches not in -mm are pretty trivial ones.

Aside from my simple tests, Alex also helped going through the LKP
tests with the patchset and find no writeback regressions.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3)
  2011-05-23  9:28   ` Wu Fengguang
@ 2011-05-24  3:28     ` Wu Fengguang
  2011-06-01 22:31       ` Andrew Morton
  0 siblings, 1 reply; 33+ messages in thread
From: Wu Fengguang @ 2011-05-24  3:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, Dave Chinner, linux-fsdevel, LKML

On Mon, May 23, 2011 at 05:28:38PM +0800, Wu Fengguang wrote:
> On Mon, May 23, 2011 at 05:07:39PM +0800, Christoph Hellwig wrote:
> > Wu, Andrew,
> > 
> > what's the plan for these for 2.6.40?  We'll need to make some progress
> > in this area, and even if we can't get everything it we should make sure
> > to at least include the updated versions of those in -mm.  But even
> > some of the later ones are pretty low risk.
> 
> Yes, except for patch 14 which does not include external behavior
> changes besides the good full write chunk size for large files, the
> patches not in -mm are pretty trivial ones.
> 
> Aside from my simple tests, Alex also helped going through the LKP
> tests with the patchset and find no writeback regressions.

I'll rearrange the series and move patch 14 to the end, so that the
patches come in order

- 10 (updated) patches in -mm
- 6 more trivial patches that is safe to go upstream after -rc1 
- the current patch 14 and patch 16 (that depends on 14)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3)
  2011-05-24  3:28     ` Wu Fengguang
@ 2011-06-01 22:31       ` Andrew Morton
  2011-06-02  2:29         ` Wu Fengguang
  0 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2011-06-01 22:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Jan Kara, Dave Chinner, linux-fsdevel, LKML

On Tue, 24 May 2011 11:28:59 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Mon, May 23, 2011 at 05:28:38PM +0800, Wu Fengguang wrote:
> > On Mon, May 23, 2011 at 05:07:39PM +0800, Christoph Hellwig wrote:
> > > Wu, Andrew,
> > > 
> > > what's the plan for these for 2.6.40?  We'll need to make some progress
> > > in this area, and even if we can't get everything it we should make sure
> > > to at least include the updated versions of those in -mm.  But even
> > > some of the later ones are pretty low risk.

I've been waiting for this work to stabilize a bit more.

> > Yes, except for patch 14 which does not include external behavior
> > changes besides the good full write chunk size for large files, the
> > patches not in -mm are pretty trivial ones.
> > 
> > Aside from my simple tests, Alex also helped going through the LKP
> > tests with the patchset and find no writeback regressions.
> 
> I'll rearrange the series and move patch 14 to the end, so that the
> patches come in order
> 
> - 10 (updated) patches in -mm
> - 6 more trivial patches that is safe to go upstream after -rc1 
> - the current patch 14 and patch 16 (that depends on 14)

I didn't merge the writeback patches for -rc1 due to uncertainty about
their readiness.  Here's what I'm sitting on:


writeback-pass-writeback_control-down-to-move_expired_inodes.patch
writeback-introduce-writeback_controlinodes_cleaned.patch
writeback-try-more-writeback-as-long-as-something-was-written.patch
writeback-the-kupdate-expire-timestamp-should-be-a-moving-target.patch
writeback-sync-expired-inodes-first-in-background-writeback.patch
writeback-refill-b_io-iff-empty.patch
writeback-split-inode_wb_list_lock-into-bdi_writebacklist_lock.patch
writeback-elevate-queue_io-into-wb_writeback.patch
writeback-introduce-wbctagged_sync-for-the-wb_sync_none-sync-stage.patch
writeback-update-dirtied_when-for-synced-inode-to-prevent-livelock.patch
writeback-avoid-extra-sync-work-at-enqueue-time.patch

Are these all up-todate and considered ready to go?


I have a couple of notes I made:

#writeback-try-more-writeback-as-long-as-something-was-written.patch: sync livelocks

IOW, someone reported livelocks with sync, or speculated that it might
cause them.

#writeback-sync-expired-inodes-first-in-background-writeback.patch: TBU??

"to be updated".

I haven't gone back through the email trail to work out why I added
these notes and whether the identified issues were resolved.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3)
  2011-06-01 22:31       ` Andrew Morton
@ 2011-06-02  2:29         ` Wu Fengguang
  2011-06-07 12:13           ` writeback merge status, was " Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Wu Fengguang @ 2011-06-02  2:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Jan Kara, Dave Chinner, linux-fsdevel, LKML

Andrew,

On Thu, Jun 02, 2011 at 06:31:42AM +0800, Andrew Morton wrote:
> On Tue, 24 May 2011 11:28:59 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Mon, May 23, 2011 at 05:28:38PM +0800, Wu Fengguang wrote:
> > > On Mon, May 23, 2011 at 05:07:39PM +0800, Christoph Hellwig wrote:
> > > > Wu, Andrew,
> > > > 
> > > > what's the plan for these for 2.6.40?  We'll need to make some progress
> > > > in this area, and even if we can't get everything it we should make sure
> > > > to at least include the updated versions of those in -mm.  But even
> > > > some of the later ones are pretty low risk.
> 
> I've been waiting for this work to stabilize a bit more.

Yeah, there were many careful reviews going on thanks to Jan and Dave.

> > > Yes, except for patch 14 which does not include external behavior
> > > changes besides the good full write chunk size for large files, the
> > > patches not in -mm are pretty trivial ones.
> > > 
> > > Aside from my simple tests, Alex also helped going through the LKP
> > > tests with the patchset and find no writeback regressions.
> > 
> > I'll rearrange the series and move patch 14 to the end, so that the
> > patches come in order
> > 
> > - 10 (updated) patches in -mm
> > - 6 more trivial patches that is safe to go upstream after -rc1 
> > - the current patch 14 and patch 16 (that depends on 14)
> 
> I didn't merge the writeback patches for -rc1 due to uncertainty about
> their readiness.  Here's what I'm sitting on:
> 
> 
> writeback-pass-writeback_control-down-to-move_expired_inodes.patch
> writeback-introduce-writeback_controlinodes_cleaned.patch
> writeback-try-more-writeback-as-long-as-something-was-written.patch
> writeback-the-kupdate-expire-timestamp-should-be-a-moving-target.patch
> writeback-sync-expired-inodes-first-in-background-writeback.patch
> writeback-refill-b_io-iff-empty.patch
> writeback-split-inode_wb_list_lock-into-bdi_writebacklist_lock.patch
> writeback-elevate-queue_io-into-wb_writeback.patch
> writeback-introduce-wbctagged_sync-for-the-wb_sync_none-sync-stage.patch
> writeback-update-dirtied_when-for-synced-inode-to-prevent-livelock.patch
> writeback-avoid-extra-sync-work-at-enqueue-time.patch
> 
> Are these all up-todate and considered ready to go?

The v4 patches are the most up-to-date ones. Since the whole patchset
has been re-organized, please drop the -mm patches and take all v4
patches except 06, 17 and 18:

[PATCH 06/18] writeback: sync expired inodes first in background writeback
[PATCH 17/18] writeback: make writeback_control.nr_to_write straight
[PATCH 18/18] writeback: rearrange the wb_writeback() loop

IMHO the other patches are ready to go. The recent updates are mostly code
refactor. For the behavior changes, no writeback regressions show up
in the LKP tests and the dd+tar tests show consistent speedup for ext4
and xfs.
 
> I have a couple of notes I made:
> 
> #writeback-try-more-writeback-as-long-as-something-was-written.patch: sync livelocks
> 
> IOW, someone reported livelocks with sync, or speculated that it might
> cause them.

It's reported by me and then fixed by patch 01 and 02:

[PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
[PATCH 02/18] writeback: update dirtied_when for synced inode to prevent livelock

> #writeback-sync-expired-inodes-first-in-background-writeback.patch: TBU??
> 
> "to be updated".

I've recommended to drop this patch. It cannot be trivially updated
due to either space or CPU overheads and will at least need more
discussions with some numbers collected on the exact CPU overheads.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* writeback merge status, was Re: [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3)
  2011-06-02  2:29         ` Wu Fengguang
@ 2011-06-07 12:13           ` Christoph Hellwig
  2011-06-07 20:15             ` Andrew Morton
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2011-06-07 12:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Dave Chinner, linux-fsdevel, LKML, torvalds

It looks like we once again got no writeback fixes merged for $NEXT,
even with Wu having a very nice, well reviewed and tested series of 
fairly trivial patches.

I'm getting really frustrated about this.  We need a lot of work to
make writeback not suck with todays hardware and workloads, nevermind
tomorrows, and we're getting stalled for completely untransparent
reasons.  Basically people doing the work move ahead, then it gets stuck
in -mm which doesn't even get into linux-next and gets completely
stalled there.  So my vote is to have a proper peer-reviewed writeback
git tree, handled like all the other git trees we have, that is
maintained e.g. by Wu and make progress with that.

After fixing all those basic issues we have other big items like
I/O-less balance_dirty_pages and the memcg integration on our plate,
and with the current rate of progress we won't get there before Linux
4.0.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: writeback merge status, was Re: [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3)
  2011-06-07 12:13           ` writeback merge status, was " Christoph Hellwig
@ 2011-06-07 20:15             ` Andrew Morton
  2011-06-07 21:11               ` Wu Fengguang
  0 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2011-06-07 20:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Wu Fengguang, Jan Kara, Dave Chinner, linux-fsdevel, LKML, torvalds

On Tue, 7 Jun 2011 08:13:36 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> It looks like we once again got no writeback fixes merged for $NEXT,
> even with Wu having a very nice, well reviewed and tested series of 
> fairly trivial patches.

We're waiting for the v4 series:

: I'll rearrange the series and move patch 14 to the end, so that the
: patches come in order
:
: - 10 (updated) patches in -mm
: - 6 more trivial patches that is safe to go upstream after -rc1 
: - the current patch 14 and patch 16 (that depends on 14)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: writeback merge status, was Re: [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3)
  2011-06-07 20:15             ` Andrew Morton
@ 2011-06-07 21:11               ` Wu Fengguang
  0 siblings, 0 replies; 33+ messages in thread
From: Wu Fengguang @ 2011-06-07 21:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Jan Kara, Dave Chinner, linux-fsdevel, LKML, torvalds

On Wed, Jun 08, 2011 at 04:15:12AM +0800, Andrew Morton wrote:
> On Tue, 7 Jun 2011 08:13:36 -0400
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > It looks like we once again got no writeback fixes merged for $NEXT,
> > even with Wu having a very nice, well reviewed and tested series of 
> > fairly trivial patches.
> 
> We're waiting for the v4 series:
 
The v4 series was posted here:

https://lkml.org/lkml/2011/5/24/20

> : I'll rearrange the series and move patch 14 to the end, so that the
> : patches come in order
> :
> : - 10 (updated) patches in -mm
> : - 6 more trivial patches that is safe to go upstream after -rc1 
> : - the current patch 14 and patch 16 (that depends on 14)

Anyway, I'll make the series straight and resubmit.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2011-06-07 21:11 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
2011-05-19 21:45 ` [PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage Wu Fengguang
2011-05-19 21:45 ` [PATCH 02/18] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
2011-05-19 21:45 ` [PATCH 03/18] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
2011-05-19 21:45 ` [PATCH 04/18] writeback: try more writeback as long as something was written Wu Fengguang
2011-05-19 21:45 ` [PATCH 05/18] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
2011-05-19 21:45 ` [PATCH 06/18] writeback: sync expired inodes first in background writeback Wu Fengguang
2011-05-19 21:45 ` [PATCH 07/18] writeback: refill b_io iff empty Wu Fengguang
2011-05-19 21:45 ` [PATCH 08/18] writeback: split inode_wb_list_lock into bdi_writeback.list_lock Wu Fengguang
2011-05-19 21:45 ` [PATCH 09/18] writeback: elevate queue_io() into wb_writeback() Wu Fengguang
2011-05-19 21:45 ` [PATCH 10/18] writeback: avoid extra sync work at enqueue time Wu Fengguang
2011-05-19 21:45 ` [PATCH 11/18] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
2011-05-19 21:45 ` [PATCH 12/18] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
2011-05-19 21:45 ` [PATCH 13/18] writeback: remove writeback_control.more_io Wu Fengguang
2011-05-19 21:45 ` [PATCH 14/18] writeback: make writeback_control.nr_to_write straight Wu Fengguang
2011-05-19 22:06   ` Wu Fengguang
2011-05-19 23:29     ` Dave Chinner
2011-05-20  4:07       ` Wu Fengguang
2011-05-20  6:52         ` Dave Chinner
2011-05-20  7:15           ` Wu Fengguang
2011-05-20  7:26             ` Wu Fengguang
2011-05-19 21:45 ` [PATCH 15/18] writeback: remove .nonblocking and .encountered_congestion Wu Fengguang
2011-05-19 21:45 ` [PATCH 16/18] writeback: trace event writeback_single_inode Wu Fengguang
2011-05-19 21:45 ` [PATCH 17/18] writeback: trace event writeback_queue_io Wu Fengguang
2011-05-19 21:45 ` [PATCH 18/18] writeback: rearrange the wb_writeback() loop Wu Fengguang
2011-05-23  9:07 ` [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Christoph Hellwig
2011-05-23  9:28   ` Wu Fengguang
2011-05-24  3:28     ` Wu Fengguang
2011-06-01 22:31       ` Andrew Morton
2011-06-02  2:29         ` Wu Fengguang
2011-06-07 12:13           ` writeback merge status, was " Christoph Hellwig
2011-06-07 20:15             ` Andrew Morton
2011-06-07 21:11               ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.