[PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4)
@ 2011-05-24  5:14 Wu Fengguang
  2011-05-24  5:14 ` [PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage Wu Fengguang
                   ` (18 more replies)
  0 siblings, 19 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, linux-fsdevel,
	Wu Fengguang, LKML


Andrew,

The first 10 patches are already in -mm tree, with updates:

- remove "writeback: pass writeback_control down to move_expired_inodes()", and
  resolve the resulting merge conflicts in other patches.
- move ahead the sync livelock prevention patches (01, 02) so that (04) won't livelock sync
- merge the three -mm fixes to (08)
- fixed changelog of (01)
- rename .for_sync to .tagged_writepages

	[PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
	[PATCH 02/18] writeback: update dirtied_when for synced inode to prevent livelock
	[PATCH 03/18] writeback: introduce writeback_control.inodes_cleaned
	[PATCH 04/18] writeback: try more writeback as long as something was written
	[PATCH 05/18] writeback: the kupdate expire timestamp should be a moving target
	[PATCH 06/18] writeback: sync expired inodes first in background writeback
	[PATCH 07/18] writeback: refill b_io iff empty
	[PATCH 08/18] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
	[PATCH 09/18] writeback: elevate queue_io() into wb_writeback()
	[PATCH 10/18] writeback: avoid extra sync work at enqueue time

The following 6 patches are trivial and safe

	[PATCH 11/18] writeback: add bdi_dirty_limit() kernel-doc
	[PATCH 12/18] writeback: skip balance_dirty_pages() for in-memory fs
	[PATCH 13/18] writeback: remove writeback_control.more_io
	[PATCH 14/18] writeback: remove .nonblocking and .encountered_congestion
	[PATCH 15/18] writeback: trace event writeback_single_inode
	[PATCH 16/18] writeback: trace event writeback_queue_io

Patch 17 is a bit more complex

	[PATCH 17/18] writeback: make writeback_control.nr_to_write straight
	[PATCH 18/18] writeback: rearrange the wb_writeback() loop

Thanks to Jan and Dave for the careful reviews!

The patches are git pullable from

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git fs-writeback

 fs/block_dev.c                   |   16 +-
 fs/btrfs/extent_io.c             |    2 -
 fs/ext4/inode.c                  |    4 +-
 fs/fs-writeback.c                |  411 +++++++++++++++++++++-----------------
 fs/inode.c                       |    5 +-
 fs/nfs/write.c                   |    3 +-
 fs/xfs/linux-2.6/xfs_aops.c      |    2 +-
 include/linux/backing-dev.h      |    2 +
 include/linux/writeback.h        |   14 +-
 include/trace/events/btrfs.h     |    6 +-
 include/trace/events/ext4.h      |    6 +-
 include/trace/events/writeback.h |  137 ++++++++++++--
 mm/backing-dev.c                 |   30 ++-
 mm/filemap.c                     |    6 +-
 mm/page-writeback.c              |   42 ++--
 mm/rmap.c                        |    4 +-
 16 files changed, 420 insertions(+), 270 deletions(-)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 02/18] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Wu Fengguang, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-for-sync.patch --]
[-- Type: text/plain, Size: 4595 bytes --]

sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
do livelock prevention for it, too.

Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers also need livelock prevention.

Impact:  It changes the order in which pages/inodes are synced to disk.
Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
until finished with the current inode.

Acked-by: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/ext4/inode.c           |    4 ++--
 fs/fs-writeback.c         |   17 +++++++++--------
 include/linux/writeback.h |    1 +
 mm/page-writeback.c       |    4 ++--
 4 files changed, 14 insertions(+), 12 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:13.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:14.000000000 +0800
@@ -36,6 +36,7 @@ struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
 	enum writeback_sync_modes sync_mode;
+	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
@@ -650,6 +651,7 @@ static long wb_writeback(struct bdi_writ
 {
 	struct writeback_control wbc = {
 		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
 		.older_than_this	= NULL,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
@@ -657,7 +659,7 @@ static long wb_writeback(struct bdi_writ
 	};
 	unsigned long oldest_jif;
 	long wrote = 0;
-	long write_chunk;
+	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
 
 	if (wbc.for_kupdate) {
@@ -683,9 +685,7 @@ static long wb_writeback(struct bdi_writ
 	 *                   (quickly) tag currently dirty pages
 	 *                   (maybe slowly) sync all tagged pages
 	 */
-	if (wbc.sync_mode == WB_SYNC_NONE)
-		write_chunk = MAX_WRITEBACK_PAGES;
-	else
+	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_writepages)
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
@@ -1191,10 +1191,11 @@ void writeback_inodes_sb_nr(struct super
 {
 	DECLARE_COMPLETION_ONSTACK(done);
 	struct wb_writeback_work work = {
-		.sb		= sb,
-		.sync_mode	= WB_SYNC_NONE,
-		.done		= &done,
-		.nr_pages	= nr,
+		.sb			= sb,
+		.sync_mode		= WB_SYNC_NONE,
+		.tagged_writepages	= 1,
+		.done			= &done,
+		.nr_pages		= nr,
 	};
 
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
--- linux-next.orig/include/linux/writeback.h	2011-05-24 11:17:13.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-24 11:17:14.000000000 +0800
@@ -47,6 +47,7 @@ struct writeback_control {
 	unsigned encountered_congestion:1; /* An output: a queue is full */
 	unsigned for_kupdate:1;		/* A kupdate writeback */
 	unsigned for_background:1;	/* A background writeback */
+	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned more_io:1;		/* more io to be dispatched */
--- linux-next.orig/mm/page-writeback.c	2011-05-24 11:17:13.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-24 11:17:14.000000000 +0800
@@ -892,12 +892,12 @@ int write_cache_pages(struct address_spa
 			range_whole = 1;
 		cycled = 1; /* ignore range_cyclic tests */
 	}
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
 retry:
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
--- linux-next.orig/fs/ext4/inode.c	2011-05-24 11:17:13.000000000 +0800
+++ linux-next/fs/ext4/inode.c	2011-05-24 11:17:14.000000000 +0800
@@ -2741,7 +2741,7 @@ static int write_cache_pages_da(struct a
 	index = wbc->range_start >> PAGE_CACHE_SHIFT;
 	end = wbc->range_end >> PAGE_CACHE_SHIFT;
 
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
@@ -2975,7 +2975,7 @@ static int ext4_da_writepages(struct add
 	}
 
 retry:
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag_pages_for_writeback(mapping, index, end);
 
 	while (!ret && wbc->nr_to_write > 0) {



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 02/18] writeback: update dirtied_when for synced inode to prevent livelock
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
  2011-05-24  5:14 ` [PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 03/18] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-fix-sync-busyloop.patch --]
[-- Type: text/plain, Size: 3807 bytes --]

Explicitly update .dirtied_when on synced inodes, so that they are no
longer considered for writeback in the next round.

We'll do more aggressive "keep writeback as long as we wrote something"
logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
b9543dac5bbc ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
no longer be enough to stop sync livelock.

It can prevent both of the following livelock schemes:

- while true; do echo data >> f; done
- while true; do touch f;        done

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    9 +++++++++
 1 file changed, 9 insertions(+)

ext3/ext4 are working fine now, however tests show that XFS may still
livelock inside the XFS routines:

[ 3581.181253] sync            D ffff8800b6ca15d8  4560  4403   4392 0x00000000
[ 3581.181734]  ffff88006f775bc8 0000000000000046 ffff8800b6ca12b8 00000001b6ca1938
[ 3581.182411]  ffff88006f774000 00000000001d2e40 00000000001d2e40 ffff8800b6ca1280
[ 3581.183088]  00000000001d2e40 ffff88006f775fd8 00000340af111ef2 00000000001d2e40
[ 3581.183765] Call Trace:
[ 3581.184008]  [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
[ 3581.184392]  [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
[ 3581.184756]  [<ffffffff8108cc0d>] ? prepare_to_wait+0x6c/0x79
[ 3581.185120]  [<ffffffff812ed520>] xfs_ioend_wait+0x87/0x9f
[ 3581.185474]  [<ffffffff8108c97a>] ? wake_up_bit+0x2a/0x2a
[ 3581.185827]  [<ffffffff812f742a>] xfs_sync_inode_data+0x92/0x9d
[ 3581.186198]  [<ffffffff812f76e2>] xfs_inode_ag_walk+0x1a5/0x287
[ 3581.186569]  [<ffffffff812f779b>] ? xfs_inode_ag_walk+0x25e/0x287
[ 3581.186946]  [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
[ 3581.187311]  [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
[ 3581.187669]  [<ffffffff81092175>] ? local_clock+0x41/0x5a
[ 3581.188020]  [<ffffffff8109be73>] ? lock_release_holdtime+0xa3/0xab
[ 3581.188403]  [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
[ 3581.188773]  [<ffffffff812e2354>] ? xfs_perag_get+0x68/0xd0
[ 3581.189130]  [<ffffffff812e236c>] ? xfs_perag_get+0x80/0xd0
[ 3581.189488]  [<ffffffff812e22ec>] ? xfs_check_sizes+0x160/0x160
[ 3581.189858]  [<ffffffff812f7831>] ? xfs_inode_ag_iterator+0x6d/0x8f
[ 3581.190241]  [<ffffffff812f7398>] ? xfs_sync_worker+0x69/0x69
[ 3581.190606]  [<ffffffff812f780b>] xfs_inode_ag_iterator+0x47/0x8f
[ 3581.190982]  [<ffffffff811611f5>] ? __sync_filesystem+0x7a/0x7a
[ 3581.191352]  [<ffffffff812f7877>] xfs_sync_data+0x24/0x43
[ 3581.191703]  [<ffffffff812f7911>] xfs_quiesce_data+0x2c/0x88
[ 3581.192065]  [<ffffffff812f5556>] xfs_fs_sync_fs+0x21/0x48
[ 3581.192419]  [<ffffffff811611e1>] __sync_filesystem+0x66/0x7a
[ 3581.192783]  [<ffffffff8116120b>] sync_one_sb+0x16/0x18
[ 3581.193128]  [<ffffffff8113e3e3>] iterate_supers+0x72/0xce
[ 3581.193482]  [<ffffffff81161140>] sync_filesystems+0x20/0x22
[ 3581.193842]  [<ffffffff8116127e>] sys_sync+0x21/0x33
[ 3581.194177]  [<ffffffff819016c2>] system_call_fastpath+0x16/0x1b

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:14.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:16.000000000 +0800
@@ -419,6 +419,15 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
+		/*
+		 * Sync livelock prevention. Each inode is tagged and synced in
+		 * one shot. If still dirty, it will be redirty_tail()'ed below.
+		 * Update the dirty time to prevent enqueue and sync it again.
+		 */
+		if ((inode->i_state & I_DIRTY) &&
+		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+			inode->dirtied_when = jiffies;
+
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 03/18] writeback: introduce writeback_control.inodes_cleaned
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
  2011-05-24  5:14 ` [PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage Wu Fengguang
  2011-05-24  5:14 ` [PATCH 02/18] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 04/18] writeback: try more writeback as long as something was written Wu Fengguang
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Wu Fengguang, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-inodes_written.patch --]
[-- Type: text/plain, Size: 2319 bytes --]

The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".

So introduce writeback_control.inodes_cleaned to count the inodes get
cleaned.  A non-zero value means there are some progress on writeback,
in which case more writeback can be tried.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    4 ++++
 include/linux/writeback.h |    1 +
 2 files changed, 5 insertions(+)

about v1: The initial version was to count successful ->write_inode()
calls.  However it leads to busy loops for sync() over NFS, because NFS
ridiculously returns 0 (success) while at the same time redirties the
inode.  The NFS case can be trivially fixed, however there may be more
hidden bugs in other filesystems..

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:16.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:16.000000000 +0800
@@ -464,6 +464,7 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
+			wbc->inodes_cleaned++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -725,6 +726,7 @@ static long wb_writeback(struct bdi_writ
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
+		wbc.inodes_cleaned = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
@@ -741,6 +743,8 @@ static long wb_writeback(struct bdi_writ
 		 */
 		if (wbc.nr_to_write <= 0)
 			continue;
+		if (wbc.inodes_cleaned)
+			continue;
 		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
--- linux-next.orig/include/linux/writeback.h	2011-05-24 11:17:14.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-24 11:17:16.000000000 +0800
@@ -34,6 +34,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long inodes_cleaned;		/* # of inodes cleaned */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 04/18] writeback: try more writeback as long as something was written
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (2 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 03/18] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 05/18] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-background-retry.patch --]
[-- Type: text/plain, Size: 2708 bytes --]

writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of eligible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

For example, imagine 100 inodes

        i0, i1, i2, ..., i90, i91, i99

At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold. This will be a fairly normal/frequent case I guess.

Now that we do tagged sync and update inode->dirtied_when after the sync,
this change won't livelock sync(1).  I actually tried to write 1 page
per 1ms with this command

	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test

and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:16.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:17.000000000 +0800
@@ -739,23 +739,23 @@ static long wb_writeback(struct bdi_writ
 		wrote += write_chunk - wbc.nr_to_write;

 		/*
-		 * If we consumed everything, see if we have more
+		 * Did we write something? Try for more
+		 *
+		 * Dirty inodes are moved to b_io for writeback in batches.
+		 * The completion of the current batch does not necessarily
+		 * mean the overall work is done. So we keep looping as long
+		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write <= 0)
+		if (wbc.nr_to_write < write_chunk)
 			continue;
 		if (wbc.inodes_cleaned)
 			continue;
 		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * No more inodes for IO, bail
 		 */
 		if (!wbc.more_io)
 			break;
 		/*
-		 * Did we write something? Try for more
-		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 05/18] writeback: the kupdate expire timestamp should be a moving target
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (3 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 04/18] writeback: try more writeback as long as something was written Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 06/18] writeback: sync expired inodes first in background writeback Wu Fengguang
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Mel Gorman, Itaru Kitayama, Wu Fengguang,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-moving-dirty-expired.patch --]
[-- Type: text/plain, Size: 2063 bytes --]

Dynamically compute the dirty expire timestamp at queue_io() time.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the evil pageout(). Nevertheless this patch merely addresses part
  of the problem.

v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.

CC: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:18.000000000 +0800
@@ -672,11 +672,6 @@ static long wb_writeback(struct bdi_writ
 	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -723,6 +718,12 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
+		if (work->for_kupdate) {
+			oldest_jif = jiffies -
+				msecs_to_jiffies(dirty_expire_interval * 10);
+			wbc.older_than_this = &oldest_jif;
+		}
+
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 06/18] writeback: sync expired inodes first in background writeback
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (4 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 05/18] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24 15:52   ` Jan Kara
  2011-05-24  5:14 ` [PATCH 07/18] writeback: refill b_io iff empty Wu Fengguang
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Rik van Riel, Mel Gorman, Wu Fengguang,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-moving-dirty-expired-background.patch --]
[-- Type: text/plain, Size: 3354 bytes --]

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

At each queue_io() time, first try enqueuing only newly expired inodes.
If there are zero expired inodes to work with, then relax the rule and
enqueue all dirty inodes.

It at least makes sense from the data integrity point of view.

This may also reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by the
page reclaim code.

More background: as Mel put it, "it makes sense to write old pages first
to reduce the chances page reclaim is initiating IO."

Rik also presented the situation with a graph:

LRU head                                 [*] dirty page
[                          *              *      * *  *  * * * * * *]

Ideally, most dirty pages should lie close to the LRU tail instead of
LRU head. That requires the flusher thread to sync old/expired inodes
first (as there are obvious correlations between inode age and page
age), and to give fair opportunities to newly expired inodes rather
than sticking with some large eldest inodes (as larger inodes have
weaker correlations in the inode<=>page ages).

This patch helps the flusher to meet both the above requirements.

Side effects: it might reduce the batch size and hence reduce
inode_wb_list_lock hold time, but in turn make the cluster-by-partition
logic in the same function less effective on reducing disk seeks.

v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.

CC: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel<riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:18.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:18.000000000 +0800
@@ -718,7 +718,7 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
-		if (work->for_kupdate) {
+		if (work->for_kupdate || work->for_background) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
 			wbc.older_than_this = &oldest_jif;
@@ -729,6 +729,7 @@ static long wb_writeback(struct bdi_writ
 		wbc.pages_skipped = 0;
 		wbc.inodes_cleaned = 0;
 
+retry:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
 			__writeback_inodes_sb(work->sb, wb, &wbc);
@@ -752,6 +753,19 @@ static long wb_writeback(struct bdi_writ
 		if (wbc.inodes_cleaned)
 			continue;
 		/*
+		 * background writeback will start with expired inodes, and
+		 * if none is found, fallback to all inodes. This order helps
+		 * reduce the number of dirty pages reaching the end of LRU
+		 * lists and cause trouble to the page reclaim.
+		 */
+		if (work->for_background &&
+		    wbc.older_than_this &&
+		    list_empty(&wb->b_io) &&
+		    list_empty(&wb->b_more_io)) {
+			wbc.older_than_this = NULL;
+			goto retry;
+		}
+		/*
 		 * No more inodes for IO, bail
 		 */
 		if (!wbc.more_io)



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 07/18] writeback: refill b_io iff empty
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (5 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 06/18] writeback: sync expired inodes first in background writeback Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 08/18] writeback: split inode_wb_list_lock into bdi_writeback.list_lock Wu Fengguang
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Wu Fengguang, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-refill-queue-iff-empty.patch --]
[-- Type: text/plain, Size: 7353 bytes --]

There is no point to carry different refill policies between for_kupdate
and other type of works. Use a consistent "refill b_io iff empty" policy
which can guarantee fairness in an easy to understand way.

A b_io refill will setup a _fixed_ work set with all currently eligible
inodes and start a new round of walk through b_io. The "fixed" work set
means no new inodes will be added to the work set during the walk.
Only when a complete walk over b_io is done, new inodes that are
eligible at the time will be enqueued and the walk be started over.

This procedure provides fairness among the inodes because it guarantees
each inode to be synced once and only once at each round. So all inodes
will be free from starvations.

This change relies on wb_writeback() to keep retrying as long as we made
some progress on cleaning some pages and/or inodes. Without that ability,
the old logic on background works relies on aggressively queuing all
eligible inodes into b_io at every time. But that's not a guarantee.

The below test script completes a slightly faster now:

             2.6.39-rc3	  2.6.39-rc3-dyn-expire+
------------------------------------------------
all elapsed     256.043      252.367
stddev           24.381       12.530

tar elapsed      30.097       28.808
dd  elapsed      13.214       11.782

	#!/bin/zsh

	cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/

	umount /dev/sda7
	mkfs.xfs -f /dev/sda7
	mount /dev/sda7 /fs

	echo 3 > /proc/sys/vm/drop_caches

	tic=$(cat /proc/uptime|cut -d' ' -f2)

	cd /fs
	time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
	time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &

	wait
	sync
	tac=$(cat /proc/uptime|cut -d' ' -f2)
	echo elapsed: $((tac - tic))

It maintains roughly the same small vs. large file writeout shares, and
offers large files better chances to be written in nice 4M chunks.

Analyzes from Dave Chinner in great details:

Let's say we have lots of inodes with 100 dirty pages being created,
and one large writeback going on. We expire 8 new inodes for every
1024 pages we write back.

With the old code, we do:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)

	writeback  8 small inodes 800 pages
		   1 large inode 224 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)
	.....

Your new code:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 8s)
	writeback  8 small inodes 800 pages

	b_io empty: (1800 pages written)
		b_more_io (large inode) -> b_io (1l)
		14 newly expired inodes -> b_io (1l, 14s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 14s)
	writeback  10 small inodes 1000 pages
		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
	writeback  5 small inodes 500 pages
	b_io empty: (2548 pages written)
		b_more_io (large inode) -> b_io (1l, 1s(24))
		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
	......

Rough progression of pages written at b_io refill:

Old code:

	total	large file	% of writeback
	1024	224		21.9% (fixed)

New code:
	total	large file	% of writeback
	1800	1024		~55%
	2550	1024		~40%
	3050	1024		~33%
	3500	1024		~29%
	3950	1024		~26%
	4250	1024		~24%
	4500	1024		~22.7%
	4700	1024		~21.7%
	4800	1024		~21.3%
	4800	1024		~21.3%
	(pretty much steady state from here)

Ok, so the steady state is reached with a similar percentage of
writeback to the large file as the existing code. Ok, that's good,
but providing some evidence that is doesn't change the shared of
writeback to the large should be in the commit message ;)

The other advantage to this is that we always write 1024 page chunks
to the large file, rather than smaller "whatever remains" chunks.

CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

after + dyn-expire + ioless:
                &(&wb->list_lock)->rlock:          2291           2304           0.15         204.09        3125.12          35315         970712           0.10         223.84     1113437.05
                ------------------------
                &(&wb->list_lock)->rlock              9          [<ffffffff8115dc5d>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock           1614          [<ffffffff8115da6a>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock            459          [<ffffffff8115d351>] writeback_sb_inodes+0x108/0x154
                &(&wb->list_lock)->rlock            137          [<ffffffff8115cdcf>] writeback_single_inode+0x1b4/0x296
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110c367>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115dc5d>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock           1160          [<ffffffff8115da6a>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock            435          [<ffffffff8115dcb6>] writeback_inodes_wb+0x33/0x12b

after + dyn-expire:
                &(&wb->list_lock)->rlock:        226820         229719           0.10         194.28      809275.91         327372     1033513685           0.08         476.96  3590929811.61
                ------------------------
                &(&wb->list_lock)->rlock             11          [<ffffffff8115b6d3>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock          30559          [<ffffffff8115bb1f>] wb_writeback+0x2fb/0x3c3
                &(&wb->list_lock)->rlock          37339          [<ffffffff8115b72c>] writeback_inodes_wb+0x33/0x12b
                &(&wb->list_lock)->rlock          54880          [<ffffffff8115a87f>] writeback_single_inode+0x17f/0x227
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110b606>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115b6d3>] inode_wb_list_del+0x5f/0x85
                &(&wb->list_lock)->rlock          55347          [<ffffffff8115b72c>] writeback_inodes_wb+0x33/0x12b
                &(&wb->list_lock)->rlock          55338          [<ffffffff8115a87f>] writeback_single_inode+0x17f/0x227

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:18.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:19.000000000 +0800
@@ -589,7 +589,8 @@ void writeback_inodes_wb(struct bdi_writ
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_wb_list_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
 	while (!list_empty(&wb->b_io)) {
@@ -616,7 +617,7 @@ static void __writeback_inodes_sb(struct
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_wb_list_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_wb_list_lock);



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 08/18] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (6 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 07/18] writeback: refill b_io iff empty Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 09/18] writeback: elevate queue_io() into wb_writeback() Wu Fengguang
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Hugh Dickins, Wu Fengguang,
	Dave Chinner, Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-split-lock.patch --]
[-- Type: text/plain, Size: 21363 bytes --]

From: Christoph Hellwig <hch@infradead.org>

Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads.  It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
                      inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157

2.6.39-rc3 + patch:
                &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                ------------------------
                &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f

hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/block_dev.c              |   16 +++--
 fs/fs-writeback.c           |   97 +++++++++++++++++-----------------
 fs/inode.c                  |    5 -
 include/linux/backing-dev.h |    2 
 include/linux/writeback.h   |    2 
 mm/backing-dev.c            |   21 +++++--
 mm/filemap.c                |    6 +-
 mm/rmap.c                   |    4 -
 8 files changed, 85 insertions(+), 68 deletions(-)

--- linux-next.orig/fs/block_dev.c	2011-05-24 11:17:12.000000000 +0800
+++ linux-next/fs/block_dev.c	2011-05-24 11:17:20.000000000 +0800
@@ -44,24 +44,28 @@ inline struct block_device *I_BDEV(struc
 {
 	return &BDEV_I(inode)->bdev;
 }
-
 EXPORT_SYMBOL(I_BDEV);
 
 /*
- * move the inode from it's current bdi to the a new bdi. if the inode is dirty
- * we need to move it onto the dirty list of @dst so that the inode is always
- * on the right list.
+ * Move the inode from its current bdi to a new bdi. If the inode is dirty we
+ * need to move it onto the dirty list of @dst so that the inode is always on
+ * the right list.
  */
 static void bdev_inode_switch_bdi(struct inode *inode,
 			struct backing_dev_info *dst)
 {
-	spin_lock(&inode_wb_list_lock);
+	struct backing_dev_info *old = inode->i_data.backing_dev_info;
+
+	if (unlikely(dst == old))		/* deadlock avoidance */
+		return;
+	bdi_lock_two(&old->wb, &dst->wb);
 	spin_lock(&inode->i_lock);
 	inode->i_data.backing_dev_info = dst;
 	if (inode->i_state & I_DIRTY)
 		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&old->wb.list_lock);
+	spin_unlock(&dst->wb.list_lock);
 }
 
 static sector_t max_block(struct block_device *bdev)
--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:19.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:20.000000000 +0800
@@ -181,12 +181,13 @@ void bdi_start_background_writeback(stru
  */
 void inode_wb_list_del(struct inode *inode)
 {
-	spin_lock(&inode_wb_list_lock);
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.list_lock);
 	list_del_init(&inode->i_wb_list);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&bdi->wb.list_lock);
 }
 
-
 /*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
@@ -196,11 +197,9 @@ void inode_wb_list_del(struct inode *ino
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
-static void redirty_tail(struct inode *inode)
+static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -214,11 +213,9 @@ static void redirty_tail(struct inode *i
 /*
  * requeue inode for re-scanning after bdi->b_io list is exhausted.
  */
-static void requeue_io(struct inode *inode)
+static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
@@ -226,7 +223,7 @@ static void inode_sync_complete(struct i
 {
 	/*
 	 * Prevent speculative execution through
-	 * spin_unlock(&inode_wb_list_lock);
+	 * spin_unlock(&wb->list_lock);
 	 */
 
 	smp_mb();
@@ -302,7 +299,7 @@ static void move_expired_inodes(struct l
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
@@ -317,7 +314,8 @@ static int write_inode(struct inode *ino
 /*
  * Wait for writeback on an inode to complete.
  */
-static void inode_wait_for_writeback(struct inode *inode)
+static void inode_wait_for_writeback(struct inode *inode,
+				     struct bdi_writeback *wb)
 {
 	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
 	wait_queue_head_t *wqh;
@@ -325,15 +323,15 @@ static void inode_wait_for_writeback(str
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		spin_lock(&inode->i_lock);
 	}
 }
 
 /*
- * Write out an inode's dirty pages.  Called under inode_wb_list_lock and
+ * Write out an inode's dirty pages.  Called under wb->list_lock and
  * inode->i_lock.  Either the caller has an active reference on the inode or
  * the inode has I_WILL_FREE set.
  *
@@ -344,13 +342,14 @@ static void inode_wait_for_writeback(str
  * livelocks, etc.
  */
 static int
-writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
+writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
+		       struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
 	unsigned dirty;
 	int ret;
 
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
 
 	if (!atomic_read(&inode->i_count))
@@ -368,14 +367,14 @@ writeback_single_inode(struct inode *ino
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			return 0;
 		}
 
 		/*
 		 * It's a data-integrity sync.  We must wait.
 		 */
-		inode_wait_for_writeback(inode);
+		inode_wait_for_writeback(inode, wb);
 	}
 
 	BUG_ON(inode->i_state & I_SYNC);
@@ -384,7 +383,7 @@ writeback_single_inode(struct inode *ino
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 
 	ret = do_writepages(mapping, wbc);
 
@@ -415,7 +414,7 @@ writeback_single_inode(struct inode *ino
 			ret = err;
 	}
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
@@ -438,7 +437,7 @@ writeback_single_inode(struct inode *ino
 				/*
 				 * slice used up: queue for next turn
 				 */
-				requeue_io(inode);
+				requeue_io(inode, wb);
 			} else {
 				/*
 				 * Writeback blocked by something other than
@@ -447,7 +446,7 @@ writeback_single_inode(struct inode *ino
 				 * retrying writeback of the dirty page/inode
 				 * that cannot be performed immediately.
 				 */
-				redirty_tail(inode);
+				redirty_tail(inode, wb);
 			}
 		} else if (inode->i_state & I_DIRTY) {
 			/*
@@ -456,7 +455,7 @@ writeback_single_inode(struct inode *ino
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
-			redirty_tail(inode);
+			redirty_tail(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -521,7 +520,7 @@ static int writeback_sb_inodes(struct su
 				 * superblock, move all inodes not belonging
 				 * to it back onto the dirty list.
 				 */
-				redirty_tail(inode);
+				redirty_tail(inode, wb);
 				continue;
 			}
 
@@ -541,7 +540,7 @@ static int writeback_sb_inodes(struct su
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			continue;
 		}
 
@@ -557,19 +556,19 @@ static int writeback_sb_inodes(struct su
 		__iget(inode);
 
 		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wbc);
+		writeback_single_inode(inode, wb, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
-			redirty_tail(inode);
+			redirty_tail(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 		iput(inode);
 		cond_resched();
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
 			return 1;
@@ -588,7 +587,7 @@ void writeback_inodes_wb(struct bdi_writ
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
@@ -598,7 +597,7 @@ void writeback_inodes_wb(struct bdi_writ
 		struct super_block *sb = inode->i_sb;
 
 		if (!pin_sb_for_writeback(sb)) {
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			continue;
 		}
 		ret = writeback_sb_inodes(sb, wb, wbc, false);
@@ -607,7 +606,7 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
@@ -616,11 +615,11 @@ static void __writeback_inodes_sb(struct
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 }
 
 /*
@@ -776,15 +775,15 @@ retry:
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = wb_inode(wb->b_more_io.prev);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode);
+			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 	}
 
 	return wrote;
@@ -1121,10 +1120,10 @@ void __mark_inode_dirty(struct inode *in
 			}
 
 			spin_unlock(&inode->i_lock);
-			spin_lock(&inode_wb_list_lock);
+			spin_lock(&bdi->wb.list_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
-			spin_unlock(&inode_wb_list_lock);
+			spin_unlock(&bdi->wb.list_lock);
 
 			if (wakeup_bdi)
 				bdi_wakeup_thread_delayed(bdi);
@@ -1326,6 +1325,7 @@ EXPORT_SYMBOL(sync_inodes_sb);
  */
 int write_inode_now(struct inode *inode, int sync)
 {
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	int ret;
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
@@ -1338,11 +1338,11 @@ int write_inode_now(struct inode *inode,
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, &wbc);
+	ret = writeback_single_inode(inode, wb, &wbc);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
@@ -1362,13 +1362,14 @@ EXPORT_SYMBOL(write_inode_now);
  */
 int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	int ret;
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, wbc);
+	ret = writeback_single_inode(inode, wb, wbc);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode);
--- linux-next.orig/fs/inode.c	2011-05-24 11:17:12.000000000 +0800
+++ linux-next/fs/inode.c	2011-05-24 11:17:20.000000000 +0800
@@ -37,7 +37,7 @@
  *   inode_lru, inode->i_lru
  * inode_sb_list_lock protects:
  *   sb->s_inodes, inode->i_sb_list
- * inode_wb_list_lock protects:
+ * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
  * inode_hash_lock protects:
  *   inode_hashtable, inode->i_hash
@@ -48,7 +48,7 @@
  *   inode->i_lock
  *     inode_lru_lock
  *
- * inode_wb_list_lock
+ * bdi->wb.list_lock
  *   inode->i_lock
  *
  * inode_hash_lock
@@ -111,7 +111,6 @@ static LIST_HEAD(inode_lru);
 static DEFINE_SPINLOCK(inode_lru_lock);
 
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_wb_list_lock);
 
 /*
  * iprune_sem provides exclusion between the icache shrinking and the
--- linux-next.orig/include/linux/backing-dev.h	2011-05-24 11:17:12.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-05-24 11:17:20.000000000 +0800
@@ -57,6 +57,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
 struct backing_dev_info {
@@ -106,6 +107,7 @@ int bdi_writeback_thread(void *data);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
--- linux-next.orig/include/linux/writeback.h	2011-05-24 11:17:16.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-24 11:17:20.000000000 +0800
@@ -9,8 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t inode_wb_list_lock;
-
 /*
  * fs/fs-writeback.c
  */
--- linux-next.orig/mm/backing-dev.c	2011-05-24 11:17:12.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-05-24 11:17:20.000000000 +0800
@@ -45,6 +45,17 @@ static struct timer_list sync_supers_tim
 static int bdi_sync_supers(void *);
 static void sync_supers_timer_fn(unsigned long);
 
+void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
+{
+	if (wb1 < wb2) {
+		spin_lock(&wb1->list_lock);
+		spin_lock_nested(&wb2->list_lock, 1);
+	} else {
+		spin_lock(&wb2->list_lock);
+		spin_lock_nested(&wb1->list_lock, 1);
+	}
+}
+
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
@@ -67,14 +78,14 @@ static int bdi_debug_stats_show(struct s
 	struct inode *inode;
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
 	list_for_each_entry(inode, &wb->b_io, i_wb_list)
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -628,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	spin_lock_init(&wb->list_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
 
@@ -676,11 +688,12 @@ void bdi_destroy(struct backing_dev_info
 	if (bdi_has_dirty_io(bdi)) {
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
-		spin_lock(&inode_wb_list_lock);
+		bdi_lock_two(&bdi->wb, dst);
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&bdi->wb.list_lock);
+		spin_unlock(&dst->list_lock);
 	}
 
 	bdi_unregister(bdi);
--- linux-next.orig/mm/filemap.c	2011-05-24 11:17:12.000000000 +0800
+++ linux-next/mm/filemap.c	2011-05-24 11:17:20.000000000 +0800
@@ -81,7 +81,7 @@
  *  ->i_mutex
  *    ->i_alloc_sem             (various)
  *
- *  inode_wb_list_lock
+ *  bdi->wb.list_lock
  *    sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
@@ -99,9 +99,9 @@
  *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
- *    inode_wb_list_lock	(page_remove_rmap->set_page_dirty)
+ *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
  *    ->inode->i_lock		(page_remove_rmap->set_page_dirty)
- *    inode_wb_list_lock	(zap_pte_range->set_page_dirty)
+ *    bdi.wb->list_lock		(zap_pte_range->set_page_dirty)
  *    ->inode->i_lock		(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
  *
--- linux-next.orig/mm/rmap.c	2011-05-24 11:17:12.000000000 +0800
+++ linux-next/mm/rmap.c	2011-05-24 11:17:20.000000000 +0800
@@ -32,11 +32,11 @@
  *               mmlist_lock (in mmput, drain_mmlist and others)
  *               mapping->private_lock (in __set_page_dirty_buffers)
  *               inode->i_lock (in set_page_dirty's __mark_inode_dirty)
- *               inode_wb_list_lock (in set_page_dirty's __mark_inode_dirty)
+ *               bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
  *                 sb_lock (within inode_lock in fs/fs-writeback.c)
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
- *                           within inode_wb_list_lock in __sync_single_inode)
+ *                           within bdi.wb->list_lock in __sync_single_inode)
  *
  * (code doesn't rely on that order so it could be switched around)
  * ->tasklist_lock



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 09/18] writeback: elevate queue_io() into wb_writeback()
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (7 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 08/18] writeback: split inode_wb_list_lock into bdi_writeback.list_lock Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 10/18] writeback: avoid extra sync work at enqueue time Wu Fengguang
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: move-queue_io.patch --]
[-- Type: text/plain, Size: 3402 bytes --]

Code refactor for more logical code layout.
No behavior change.

- remove the mis-named __writeback_inodes_sb()

- wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
  before calling __writeback_inodes_wb()

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:20.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:21.000000000 +0800
@@ -580,17 +580,13 @@ static int writeback_sb_inodes(struct su
 	return 1;
 }
 
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc)
+static void __writeback_inodes_wb(struct bdi_writeback *wb,
+				  struct writeback_control *wbc)
 {
 	int ret = 0;
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&wb->list_lock);
-
-	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -606,19 +602,16 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&wb->list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
-static void __writeback_inodes_sb(struct super_block *sb,
-		struct bdi_writeback *wb, struct writeback_control *wbc)
+void writeback_inodes_wb(struct bdi_writeback *wb,
+		struct writeback_control *wbc)
 {
-	WARN_ON(!rwsem_is_locked(&sb->s_umount));
-
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
-	writeback_sb_inodes(sb, wb, wbc, true);
+	__writeback_inodes_wb(wb, wbc);
 	spin_unlock(&wb->list_lock);
 }
 
@@ -685,7 +678,7 @@ static long wb_writeback(struct bdi_writ
 	 * The intended call sequence for WB_SYNC_ALL writeback is:
 	 *
 	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
+	 *          writeback_sb_inodes()       <== called only once
 	 *              write_cache_pages()     <== called once for each inode
 	 *                   (quickly) tag currently dirty pages
 	 *                   (maybe slowly) sync all tagged pages
@@ -694,6 +687,7 @@ static long wb_writeback(struct bdi_writ
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
+	spin_lock(&wb->list_lock);
 	for (;;) {
 		/*
 		 * Stop writeback when nr_pages has been consumed
@@ -731,10 +725,12 @@ static long wb_writeback(struct bdi_writ
 
 retry:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
+		if (list_empty(&wb->b_io))
+			queue_io(wb, wbc.older_than_this);
 		if (work->sb)
-			__writeback_inodes_sb(work->sb, wb, &wbc);
+			writeback_sb_inodes(work->sb, wb, &wbc, true);
 		else
-			writeback_inodes_wb(wb, &wbc);
+			__writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
@@ -775,7 +771,6 @@ retry:
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
-		spin_lock(&wb->list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = wb_inode(wb->b_more_io.prev);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
@@ -783,8 +778,8 @@ retry:
 			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&wb->list_lock);
 	}
+	spin_unlock(&wb->list_lock);
 
 	return wrote;
 }



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 10/18] writeback: avoid extra sync work at enqueue time
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (8 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 09/18] writeback: elevate queue_io() into wb_writeback() Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 11/18] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-kill-wb_start.patch --]
[-- Type: text/plain, Size: 2179 bytes --]

This removes writeback_control.wb_start and does more straightforward
sync livelock prevention by setting .older_than_this to prevent extra
inodes from being enqueued in the first place.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   16 +++-------------
 include/linux/writeback.h |    3 ---
 2 files changed, 3 insertions(+), 16 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:21.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:22.000000000 +0800
@@ -544,15 +544,6 @@ static int writeback_sb_inodes(struct su
 			continue;
 		}
 
-		/*
-		 * Was this inode dirtied after sync_sb_inodes was called?
-		 * This keeps sync from extra jobs and livelock.
-		 */
-		if (inode_dirtied_after(inode, wbc->wb_start)) {
-			spin_unlock(&inode->i_lock);
-			return 1;
-		}
-
 		__iget(inode);
 
 		pages_skipped = wbc->pages_skipped;
@@ -585,9 +576,6 @@ static void __writeback_inodes_wb(struct
 {
 	int ret = 0;
 
-	if (!wbc->wb_start)
-		wbc->wb_start = jiffies; /* livelock avoidance */
-
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
 		struct super_block *sb = inode->i_sb;
@@ -686,7 +674,9 @@ static long wb_writeback(struct bdi_writ
 	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_writepages)
 		write_chunk = LONG_MAX;
 
-	wbc.wb_start = jiffies; /* livelock avoidance */
+	oldest_jif = jiffies;
+	wbc.older_than_this = &oldest_jif;
+
 	spin_lock(&wb->list_lock);
 	for (;;) {
 		/*
--- linux-next.orig/include/linux/writeback.h	2011-05-24 11:17:20.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-24 11:17:22.000000000 +0800
@@ -26,9 +26,6 @@ struct writeback_control {
 	enum writeback_sync_modes sync_mode;
 	unsigned long *older_than_this;	/* If !NULL, only write back inodes
 					   older than this */
-	unsigned long wb_start;         /* Time writeback_inodes_wb was
-					   called. This is needed to avoid
-					   extra jobs and livelock */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 11/18] writeback: add bdi_dirty_limit() kernel-doc
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (9 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 10/18] writeback: avoid extra sync work at enqueue time Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 12/18] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-task_dirty_limit-comment.patch --]
[-- Type: text/plain, Size: 1193 bytes --]

Clarify the bdi_dirty_limit() comment.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-05-24 11:17:14.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-24 11:17:23.000000000 +0800
@@ -437,10 +437,17 @@ void global_dirty_limits(unsigned long *
 	*pdirty = dirty;
 }
 
-/*
+/**
  * bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ * @bdi: the backing_dev_info to query
+ * @dirty: global dirty limit in pages
+ *
+ * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
+ * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ * And the "limit" in the name is not seriously taken as hard limit in
+ * balance_dirty_pages().
  *
- * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * It allocates high/low dirty limits to fast/slow devices, in order to prevent
  * - starving fast devices
  * - piling up dirty pages (that will take long time to sync) on slow devices
  *



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 12/18] writeback: skip balance_dirty_pages() for in-memory fs
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (10 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 11/18] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 13/18] writeback: remove writeback_control.more_io Wu Fengguang
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Hugh Dickins, Rik van Riel, Peter Zijlstra,
	Wu Fengguang, Dave Chinner, Christoph Hellwig, linux-fsdevel,
	LKML

[-- Attachment #1: writeback-trace-global-dirty-states-fix.patch --]
[-- Type: text/plain, Size: 2707 bytes --]

This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.

Notes about the tmpfs/ramfs behavior changes:

As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
balance_dirty_pages() as long as we are over the (dirty+background)/2
global throttle threshold.  This is because both the dirty pages and
threshold will be 0 for tmpfs/ramfs. Hence this test will always
evaluate to TRUE:

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
                        || (nr_reclaimable + nr_writeback >= dirty_thresh);

For 2.6.37, someone complained that the current logic does not allow the
users to set vm.dirty_ratio=0.  So commit 4cbec4c8b9 changed the test to

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
                        || (nr_reclaimable + nr_writeback > dirty_thresh);

So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
throttled unless the global dirty threshold is exceeded (which is very
unlikely to happen; once happen, will block many tasks).

I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
for a busy writing server, tmpfs write()s may get livelocked! The
"inadvertent" throttling can hardly bring help to any workload because
of its "either no throttling, or get throttled to death" property.

So based on 2.6.37, this patch won't bring more noticeable changes.

CC: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-05-24 11:17:23.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-24 11:17:24.000000000 +0800
@@ -244,13 +244,8 @@ void task_dirty_inc(struct task_struct *
 static void bdi_writeout_fraction(struct backing_dev_info *bdi,
 		long *numerator, long *denominator)
 {
-	if (bdi_cap_writeback_dirty(bdi)) {
-		prop_fraction_percpu(&vm_completions, &bdi->completions,
+	prop_fraction_percpu(&vm_completions, &bdi->completions,
 				numerator, denominator);
-	} else {
-		*numerator = 0;
-		*denominator = 1;
-	}
 }
 
 static inline void task_dirties_fraction(struct task_struct *tsk,
@@ -495,6 +490,9 @@ static void balance_dirty_pages(struct a
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+	if (!bdi_cap_account_dirty(bdi))
+		return;
+
 	for (;;) {
 		struct writeback_control wbc = {
 			.sync_mode	= WB_SYNC_NONE,



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 13/18] writeback: remove writeback_control.more_io
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (11 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 12/18] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 14/18] writeback: remove .nonblocking and .encountered_congestion Wu Fengguang
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Wu Fengguang, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-kill-more_io.patch --]
[-- Type: text/plain, Size: 4431 bytes --]

When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |    9 ++-------
 include/linux/writeback.h        |    1 -
 include/trace/events/ext4.h      |    6 ++----
 include/trace/events/writeback.h |    5 +----
 4 files changed, 5 insertions(+), 16 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:22.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:25.000000000 +0800
@@ -560,12 +560,8 @@ static int writeback_sb_inodes(struct su
 		iput(inode);
 		cond_resched();
 		spin_lock(&wb->list_lock);
-		if (wbc->nr_to_write <= 0) {
-			wbc->more_io = 1;
+		if (wbc->nr_to_write <= 0)
 			return 1;
-		}
-		if (!list_empty(&wb->b_more_io))
-			wbc->more_io = 1;
 	}
 	/* b_io is empty */
 	return 1;
@@ -708,7 +704,6 @@ static long wb_writeback(struct bdi_writ
 			wbc.older_than_this = &oldest_jif;
 		}
 
-		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 		wbc.inodes_cleaned = 0;
@@ -754,7 +749,7 @@ retry:
 		/*
 		 * No more inodes for IO, bail
 		 */
-		if (!wbc.more_io)
+		if (list_empty(&wb->b_more_io))
 			break;
 		/*
 		 * Nothing written. Wait for some inode to
--- linux-next.orig/include/linux/writeback.h	2011-05-24 11:17:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-24 11:17:25.000000000 +0800
@@ -46,7 +46,6 @@ struct writeback_control {
 	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
-	unsigned more_io:1;		/* more io to be dispatched */
 };
 
 /*
--- linux-next.orig/include/trace/events/writeback.h	2011-05-24 11:17:12.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-24 11:17:25.000000000 +0800
@@ -101,7 +101,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(int, more_io)
 		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
@@ -116,7 +115,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->more_io	= wbc->more_io;
 		__entry->older_than_this = wbc->older_than_this ?
 						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
@@ -124,7 +122,7 @@ DECLARE_EVENT_CLASS(wbc_class,
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d older=0x%lx "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -134,7 +132,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->more_io,
 		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
--- linux-next.orig/include/trace/events/ext4.h	2011-05-24 11:17:12.000000000 +0800
+++ linux-next/include/trace/events/ext4.h	2011-05-24 11:17:25.000000000 +0800
@@ -404,7 +404,6 @@ TRACE_EVENT(ext4_da_writepages_result,
 		__field(	int,	pages_written		)
 		__field(	long,	pages_skipped		)
 		__field(	int,	sync_mode		)
-		__field(	char,	more_io			)	
 		__field(       pgoff_t,	writeback_index		)
 	),
 
@@ -415,16 +414,15 @@ TRACE_EVENT(ext4_da_writepages_result,
 		__entry->pages_written	= pages_written;
 		__entry->pages_skipped	= wbc->pages_skipped;
 		__entry->sync_mode	= wbc->sync_mode;
-		__entry->more_io	= wbc->more_io;
 		__entry->writeback_index = inode->i_mapping->writeback_index;
 	),
 
 	TP_printk("dev %d,%d ino %lu ret %d pages_written %d pages_skipped %ld "
-		  " more_io %d sync_mode %d writeback_index %lu",
+		  "sync_mode %d writeback_index %lu",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long) __entry->ino, __entry->ret,
 		  __entry->pages_written, __entry->pages_skipped,
-		  __entry->more_io, __entry->sync_mode,
+		  __entry->sync_mode,
 		  (unsigned long) __entry->writeback_index)
 );
 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 14/18] writeback: remove .nonblocking and .encountered_congestion
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (12 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 13/18] writeback: remove writeback_control.more_io Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 15/18] writeback: trace event writeback_single_inode Wu Fengguang
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-kill-nonblocking.patch --]
[-- Type: text/plain, Size: 3584 bytes --]

Remove two unused struct writeback_control fields:

	.encountered_congestion	(completely unused)
	.nonblocking		(never set, checked/showed in XFS,NFS/btrfs)

The .for_background check in nfs_write_inode() is also removed btw,
as .for_background implies WB_SYNC_NONE.

Reviewed-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c               |    3 +--
 fs/xfs/linux-2.6/xfs_aops.c  |    2 +-
 include/linux/writeback.h    |    2 --
 include/trace/events/btrfs.h |    6 ++----
 4 files changed, 4 insertions(+), 9 deletions(-)

--- linux-next.orig/fs/xfs/linux-2.6/xfs_aops.c	2011-05-24 11:17:12.000000000 +0800
+++ linux-next/fs/xfs/linux-2.6/xfs_aops.c	2011-05-24 11:17:26.000000000 +0800
@@ -970,7 +970,7 @@ xfs_vm_writepage(
 	offset = page_offset(page);
 	type = IO_OVERWRITE;
 
-	if (wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking)
+	if (wbc->sync_mode == WB_SYNC_NONE)
 		nonblocking = 1;
 
 	do {
--- linux-next.orig/include/trace/events/btrfs.h	2011-05-24 11:17:12.000000000 +0800
+++ linux-next/include/trace/events/btrfs.h	2011-05-24 11:17:26.000000000 +0800
@@ -284,7 +284,6 @@ DECLARE_EVENT_CLASS(btrfs__writepage,
 		__field(	long,   pages_skipped		)
 		__field(	loff_t, range_start		)
 		__field(	loff_t, range_end		)
-		__field(	char,   nonblocking		)
 		__field(	char,   for_kupdate		)
 		__field(	char,   for_reclaim		)
 		__field(	char,   range_cyclic		)
@@ -299,7 +298,6 @@ DECLARE_EVENT_CLASS(btrfs__writepage,
 		__entry->pages_skipped	= wbc->pages_skipped;
 		__entry->range_start	= wbc->range_start;
 		__entry->range_end	= wbc->range_end;
-		__entry->nonblocking	= wbc->nonblocking;
 		__entry->for_kupdate	= wbc->for_kupdate;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
@@ -310,13 +308,13 @@ DECLARE_EVENT_CLASS(btrfs__writepage,
 
 	TP_printk("root = %llu(%s), ino = %lu, page_index = %lu, "
 		  "nr_to_write = %ld, pages_skipped = %ld, range_start = %llu, "
-		  "range_end = %llu, nonblocking = %d, for_kupdate = %d, "
+		  "range_end = %llu, for_kupdate = %d, "
 		  "for_reclaim = %d, range_cyclic = %d, writeback_index = %lu",
 		  show_root_type(__entry->root_objectid),
 		  (unsigned long)__entry->ino, __entry->index,
 		  __entry->nr_to_write, __entry->pages_skipped,
 		  __entry->range_start, __entry->range_end,
-		  __entry->nonblocking, __entry->for_kupdate,
+		  __entry->for_kupdate,
 		  __entry->for_reclaim, __entry->range_cyclic,
 		  (unsigned long)__entry->writeback_index)
 );
--- linux-next.orig/fs/nfs/write.c	2011-05-24 11:17:12.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-05-24 11:17:26.000000000 +0800
@@ -1562,8 +1562,7 @@ int nfs_write_inode(struct inode *inode,
 		int status;
 		bool sync = true;
 
-		if (wbc->sync_mode == WB_SYNC_NONE || wbc->nonblocking ||
-		    wbc->for_background)
+		if (wbc->sync_mode == WB_SYNC_NONE)
 			sync = false;
 
 		status = pnfs_layoutcommit_inode(inode, sync);
--- linux-next.orig/include/linux/writeback.h	2011-05-24 11:17:25.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-24 11:17:26.000000000 +0800
@@ -39,8 +39,6 @@ struct writeback_control {
 	loff_t range_start;
 	loff_t range_end;
 
-	unsigned nonblocking:1;		/* Don't get stuck on request queues */
-	unsigned encountered_congestion:1; /* An output: a queue is full */
 	unsigned for_kupdate:1;		/* A kupdate writeback */
 	unsigned for_background:1;	/* A background writeback */
 	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 15/18] writeback: trace event writeback_single_inode
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (13 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 14/18] writeback: remove .nonblocking and .encountered_congestion Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 16/18] writeback: trace event writeback_queue_io Wu Fengguang
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Wu Fengguang, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-writeback_single_inode.patch --]
[-- Type: text/plain, Size: 4034 bytes --]

It is valuable to know how the dirty inodes are iterated and their IO size.

"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"

- "state" reflects inode->i_state at the end of writeback_single_inode()
- "index" reflects mapping->writeback_index after the ->writepages() call
- "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
- "wrote" is the number of pages actually written

v2: add trace event writeback_single_inode_requeue as proposed by Dave.

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |    4 +
 include/trace/events/writeback.h |   70 +++++++++++++++++++++++++++++
 2 files changed, 74 insertions(+)

--- linux-next.orig/include/trace/events/writeback.h	2011-05-24 11:17:25.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-24 11:17:26.000000000 +0800
@@ -8,6 +8,19 @@
 #include <linux/device.h>
 #include <linux/writeback.h>
 
+#define show_inode_state(state)					\
+	__print_flags(state, "|",				\
+		{I_DIRTY_SYNC,		"I_DIRTY_SYNC"},	\
+		{I_DIRTY_DATASYNC,	"I_DIRTY_DATASYNC"},	\
+		{I_DIRTY_PAGES,		"I_DIRTY_PAGES"},	\
+		{I_NEW,			"I_NEW"},		\
+		{I_WILL_FREE,		"I_WILL_FREE"},		\
+		{I_FREEING,		"I_FREEING"},		\
+		{I_CLEAR,		"I_CLEAR"},		\
+		{I_SYNC,		"I_SYNC"},		\
+		{I_REFERENCED,		"I_REFERENCED"}		\
+	)
+
 struct wb_writeback_work;
 
 DECLARE_EVENT_CLASS(writeback_work_class,
@@ -184,6 +197,63 @@ DEFINE_EVENT(writeback_congest_waited_te
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
+DECLARE_EVENT_CLASS(writeback_single_inode_template,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long nr_to_write
+	),
+
+	TP_ARGS(inode, wbc, nr_to_write),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, state)
+		__field(unsigned long, age)
+		__field(unsigned long, writeback_index)
+		__field(long, nr_to_write)
+		__field(unsigned long, wrote)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->state		= inode->i_state;
+		__entry->age		= (jiffies - inode->dirtied_when) *
+								1000 / HZ;
+		__entry->writeback_index = inode->i_mapping->writeback_index;
+		__entry->nr_to_write	= nr_to_write;
+		__entry->wrote		= nr_to_write - wbc->nr_to_write;
+	),
+
+	TP_printk("bdi %s: ino=%lu state=%s age=%lu "
+		  "index=%lu to_write=%ld wrote=%lu",
+		  __entry->name,
+		  __entry->ino,
+		  show_inode_state(__entry->state),
+		  __entry->age,
+		  __entry->writeback_index,
+		  __entry->nr_to_write,
+		  __entry->wrote
+	)
+);
+
+DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode_requeue,
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long nr_to_write),
+	TP_ARGS(inode, wbc, nr_to_write)
+);
+
+DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode,
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long nr_to_write),
+	TP_ARGS(inode, wbc, nr_to_write)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:25.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:26.000000000 +0800
@@ -346,6 +346,7 @@ writeback_single_inode(struct inode *ino
 		       struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
+	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -368,6 +369,8 @@ writeback_single_inode(struct inode *ino
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
 			requeue_io(inode, wb);
+			trace_writeback_single_inode_requeue(inode, wbc,
+							     nr_to_write);
 			return 0;
 		}
 
@@ -467,6 +470,7 @@ writeback_single_inode(struct inode *ino
 		}
 	}
 	inode_sync_complete(inode);
+	trace_writeback_single_inode(inode, wbc, nr_to_write);
 	return ret;
 }
 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 16/18] writeback: trace event writeback_queue_io
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (14 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 15/18] writeback: trace event writeback_single_inode Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 17/18] writeback: make writeback_control.nr_to_write straight Wu Fengguang
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-queue_io.patch --]
[-- Type: text/plain, Size: 3503 bytes --]

Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   14 ++++++++++----
 include/trace/events/writeback.h |   25 +++++++++++++++++++++++++
 2 files changed, 35 insertions(+), 4 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:26.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:27.000000000 +0800
@@ -248,15 +248,16 @@ static bool inode_dirtied_after(struct i
 /*
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
-static void move_expired_inodes(struct list_head *delaying_queue,
+static int move_expired_inodes(struct list_head *delaying_queue,
 			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+			       unsigned long *older_than_this)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
+	int moved = 0;
 
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
@@ -267,12 +268,13 @@ static void move_expired_inodes(struct l
 			do_sb_sort = 1;
 		sb = inode->i_sb;
 		list_move(&inode->i_wb_list, &tmp);
+		moved++;
 	}
 
 	/* just one sb in list, splice to dispatch_queue and we're done */
 	if (!do_sb_sort) {
 		list_splice(&tmp, dispatch_queue);
-		return;
+		goto out;
 	}
 
 	/* Move inodes from one superblock together */
@@ -284,6 +286,8 @@ static void move_expired_inodes(struct l
 				list_move(&inode->i_wb_list, dispatch_queue);
 		}
 	}
+out:
+	return moved;
 }
 
 /*
@@ -299,9 +303,11 @@ static void move_expired_inodes(struct l
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
+	int moved;
 	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	trace_writeback_queue_io(wb, older_than_this, moved);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
--- linux-next.orig/include/trace/events/writeback.h	2011-05-24 11:17:26.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-24 11:17:27.000000000 +0800
@@ -162,6 +162,31 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_writt
 DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(writeback_queue_io,
+	TP_PROTO(struct bdi_writeback *wb,
+		 unsigned long *older_than_this,
+		 int moved),
+	TP_ARGS(wb, older_than_this, moved),
+	TP_STRUCT__entry(
+		__array(char,		name, 32)
+		__field(unsigned long,	older)
+		__field(long,		age)
+		__field(int,		moved)
+	),
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(wb->bdi->dev), 32);
+		__entry->older	= older_than_this ?  *older_than_this : 0;
+		__entry->age	= older_than_this ?
+				  (jiffies - *older_than_this) * 1000 / HZ : -1;
+		__entry->moved	= moved;
+	),
+	TP_printk("bdi %s: older=%lu age=%ld enqueue=%d",
+		__entry->name,
+		__entry->older,	/* older_than_this in jiffies */
+		__entry->age,	/* older_than_this in relative milliseconds */
+		__entry->moved)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 17/18] writeback: make writeback_control.nr_to_write straight
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (15 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 16/18] writeback: trace event writeback_queue_io Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-24  5:14 ` [PATCH 18/18] writeback: rearrange the wb_writeback() loop Wu Fengguang
  2011-05-29  7:34   ` Sedat Dilek
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-pass-work-to-writeback_sb_inodes.patch --]
[-- Type: text/plain, Size: 18301 bytes --]

Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.

Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/extent_io.c             |    2 
 fs/fs-writeback.c                |  199 ++++++++++++++++-------------
 include/linux/writeback.h        |    6 
 include/trace/events/writeback.h |   39 ++++-
 mm/backing-dev.c                 |    9 -
 mm/page-writeback.c              |   17 --
 6 files changed, 147 insertions(+), 125 deletions(-)

change set:
- move writeback_control from wb_writeback() down to writeback_sb_inodes()
- change return value of writeback_sb_inodes()/__writeback_inodes_wb()
  to the number of pages and/or inodes written
- move writeback_control.older_than_this to struct wb_writeback_work
- remove writeback_control.inodes_cleaned
- remove wbc_writeback_* trace events for now

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:27.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:28.000000000 +0800
@@ -30,11 +30,21 @@
 #include "internal.h"
 
 /*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES     1024L
+
+/*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
 struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
+	unsigned long *older_than_this;
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
@@ -472,7 +482,6 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
-			wbc->inodes_cleaned++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -506,6 +515,31 @@ static bool pin_sb_for_writeback(struct 
 	return false;
 }
 
+static long writeback_chunk_size(struct wb_writeback_work *work)
+{
+	long pages;
+
+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 * The intended call sequence for WB_SYNC_ALL writeback is:
+	 *
+	 *      wb_writeback()
+	 *          writeback_sb_inodes()       <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                   (quickly) tag currently dirty pages
+	 *                   (maybe slowly) sync all tagged pages
+	 */
+	if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
+		pages = LONG_MAX;
+	else
+		pages = min(MAX_WRITEBACK_PAGES, work->nr_pages);
+
+	return pages;
+}
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -513,18 +547,30 @@ static bool pin_sb_for_writeback(struct 
  * inodes. Otherwise write only ones which go sequentially
  * in reverse order.
  *
- * Return 1, if the caller writeback routine should be
- * interrupted. Otherwise return 0.
+ * Return the number of pages and/or inodes cleaned.
  */
-static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
-		struct writeback_control *wbc, bool only_this_sb)
+static long writeback_sb_inodes(struct super_block *sb,
+				struct bdi_writeback *wb,
+				struct wb_writeback_work *work)
 {
+	struct writeback_control wbc = {
+		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
+		.for_kupdate		= work->for_kupdate,
+		.for_background		= work->for_background,
+		.range_cyclic		= work->range_cyclic,
+		.range_start		= 0,
+		.range_end		= LLONG_MAX,
+	};
+	unsigned long start_time = jiffies;
+	long write_chunk;
+	long wrote = 0;  /* count both pages and inodes */
+
 	while (!list_empty(&wb->b_io)) {
-		long pages_skipped;
 		struct inode *inode = wb_inode(wb->b_io.prev);
 
 		if (inode->i_sb != sb) {
-			if (only_this_sb) {
+			if (work->sb) {
 				/*
 				 * We only want to write back data for this
 				 * superblock, move all inodes not belonging
@@ -539,7 +585,7 @@ static int writeback_sb_inodes(struct su
 			 * Bounce back to the caller to unpin this and
 			 * pin the next superblock.
 			 */
-			return 0;
+			break;
 		}
 
 		/*
@@ -553,12 +599,18 @@ static int writeback_sb_inodes(struct su
 			requeue_io(inode, wb);
 			continue;
 		}
-
 		__iget(inode);
+		write_chunk = writeback_chunk_size(work);
+		wbc.nr_to_write = write_chunk;
+		wbc.pages_skipped = 0;
+
+		writeback_single_inode(inode, wb, &wbc);
 
-		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wb, wbc);
-		if (wbc->pages_skipped != pages_skipped) {
+		work->nr_pages -= write_chunk - wbc.nr_to_write;
+		wrote += write_chunk - wbc.nr_to_write;
+		if (!(inode->i_state & I_DIRTY))
+			wrote++;
+		if (wbc.pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
@@ -570,17 +622,25 @@ static int writeback_sb_inodes(struct su
 		iput(inode);
 		cond_resched();
 		spin_lock(&wb->list_lock);
-		if (wbc->nr_to_write <= 0)
-			return 1;
+		/*
+		 * bail out to wb_writeback() often enough to check
+		 * background threshold and other termination conditions.
+		 */
+		if (wrote) {
+			if (jiffies - start_time > HZ / 10UL)
+				break;
+			if (work->nr_pages <= 0)
+				break;
+		}
 	}
-	/* b_io is empty */
-	return 1;
+	return wrote;
 }
 
-static void __writeback_inodes_wb(struct bdi_writeback *wb,
-				  struct writeback_control *wbc)
+static long __writeback_inodes_wb(struct bdi_writeback *wb,
+				  struct wb_writeback_work *work)
 {
-	int ret = 0;
+	unsigned long start_time = jiffies;
+	long wrote = 0;
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -590,33 +650,36 @@ static void __writeback_inodes_wb(struct
 			requeue_io(inode, wb);
 			continue;
 		}
-		ret = writeback_sb_inodes(sb, wb, wbc, false);
+		wrote += writeback_sb_inodes(sb, wb, work);
 		drop_super(sb);
 
-		if (ret)
-			break;
+		if (wrote) {
+			if (jiffies - start_time > HZ / 10UL)
+				break;
+			if (work->nr_pages <= 0)
+				break;
+		}
 	}
 	/* Leave any unwritten inodes on b_io */
+	return wrote;
 }
 
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc)
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages)
 {
+	struct wb_writeback_work work = {
+		.nr_pages	= nr_pages,
+		.sync_mode	= WB_SYNC_NONE,
+		.range_cyclic	= 1,
+	};
+
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
-	__writeback_inodes_wb(wb, wbc);
+		queue_io(wb, NULL);
+	__writeback_inodes_wb(wb, &work);
 	spin_unlock(&wb->list_lock);
-}
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
+	return nr_pages - work.nr_pages;
+}
 
 static inline bool over_bground_thresh(void)
 {
@@ -646,42 +709,13 @@ static inline bool over_bground_thresh(v
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
-	struct writeback_control wbc = {
-		.sync_mode		= work->sync_mode,
-		.tagged_writepages	= work->tagged_writepages,
-		.older_than_this	= NULL,
-		.for_kupdate		= work->for_kupdate,
-		.for_background		= work->for_background,
-		.range_cyclic		= work->range_cyclic,
-	};
+	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	long wrote = 0;
-	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
-
-	if (!wbc.range_cyclic) {
-		wbc.range_start = 0;
-		wbc.range_end = LLONG_MAX;
-	}
-
-	/*
-	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
-	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
-	 * here avoids calling into writeback_inodes_wb() more than once.
-	 *
-	 * The intended call sequence for WB_SYNC_ALL writeback is:
-	 *
-	 *      wb_writeback()
-	 *          writeback_sb_inodes()       <== called only once
-	 *              write_cache_pages()     <== called once for each inode
-	 *                   (quickly) tag currently dirty pages
-	 *                   (maybe slowly) sync all tagged pages
-	 */
-	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_writepages)
-		write_chunk = LONG_MAX;
+	long progress;
 
 	oldest_jif = jiffies;
-	wbc.older_than_this = &oldest_jif;
+	work->older_than_this = &oldest_jif;
 
 	spin_lock(&wb->list_lock);
 	for (;;) {
@@ -711,25 +745,18 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_kupdate || work->for_background) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
-			wbc.older_than_this = &oldest_jif;
+			work->older_than_this = &oldest_jif;
 		}
 
-		wbc.nr_to_write = write_chunk;
-		wbc.pages_skipped = 0;
-		wbc.inodes_cleaned = 0;
-
 retry:
-		trace_wbc_writeback_start(&wbc, wb->bdi);
+		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
-			queue_io(wb, wbc.older_than_this);
+			queue_io(wb, work->older_than_this);
 		if (work->sb)
-			writeback_sb_inodes(work->sb, wb, &wbc, true);
+			progress = writeback_sb_inodes(work->sb, wb, work);
 		else
-			__writeback_inodes_wb(wb, &wbc);
-		trace_wbc_writeback_written(&wbc, wb->bdi);
-
-		work->nr_pages -= write_chunk - wbc.nr_to_write;
-		wrote += write_chunk - wbc.nr_to_write;
+			progress = __writeback_inodes_wb(wb, work);
+		trace_writeback_written(wb->bdi, work);
 
 		/*
 		 * Did we write something? Try for more
@@ -739,9 +766,7 @@ retry:
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		if (wbc.inodes_cleaned)
+		if (progress)
 			continue;
 		/*
 		 * background writeback will start with expired inodes, and
@@ -750,10 +775,10 @@ retry:
 		 * lists and cause trouble to the page reclaim.
 		 */
 		if (work->for_background &&
-		    wbc.older_than_this &&
+		    work->older_than_this &&
 		    list_empty(&wb->b_io) &&
 		    list_empty(&wb->b_more_io)) {
-			wbc.older_than_this = NULL;
+			work->older_than_this = NULL;
 			goto retry;
 		}
 		/*
@@ -767,8 +792,8 @@ retry:
 		 * we'll just busyloop.
 		 */
 		if (!list_empty(&wb->b_more_io))  {
+			trace_writeback_wait(wb->bdi, work);
 			inode = wb_inode(wb->b_more_io.prev);
-			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			spin_lock(&inode->i_lock);
 			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
@@ -776,7 +801,7 @@ retry:
 	}
 	spin_unlock(&wb->list_lock);
 
-	return wrote;
+	return nr_pages - work->nr_pages;
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-05-24 11:17:26.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-05-24 11:17:28.000000000 +0800
@@ -24,12 +24,9 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
-	long inodes_cleaned;		/* # of inodes cleaned */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is
@@ -56,8 +53,7 @@ void writeback_inodes_sb_nr(struct super
 int writeback_inodes_sb_if_idle(struct super_block *);
 int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr);
 void sync_inodes_sb(struct super_block *);
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc);
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages);
 
--- linux-next.orig/mm/backing-dev.c	2011-05-24 11:17:20.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-05-24 11:17:28.000000000 +0800
@@ -262,14 +262,7 @@ int bdi_has_dirty_io(struct backing_dev_
 
 static void bdi_flush_io(struct backing_dev_info *bdi)
 {
-	struct writeback_control wbc = {
-		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
-		.range_cyclic		= 1,
-		.nr_to_write		= 1024,
-	};
-
-	writeback_inodes_wb(&bdi->wb, &wbc);
+	writeback_inodes_wb(&bdi->wb, 1024);
 }
 
 /*
--- linux-next.orig/mm/page-writeback.c	2011-05-24 11:17:24.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-24 11:17:28.000000000 +0800
@@ -494,13 +494,6 @@ static void balance_dirty_pages(struct a
 		return;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -562,17 +555,17 @@ static void balance_dirty_pages(struct a
 		 * threshold otherwise wait until the disk writes catch
 		 * up.
 		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
+		trace_balance_dirty_start(bdi);
 		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
+			pages_written += writeback_inodes_wb(&bdi->wb,
+							     write_chunk);
+			trace_balance_dirty_written(bdi, pages_written);
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
 		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		trace_balance_dirty_wait(bdi);
 
 		/*
 		 * Increase the delay for each loop, up to our previous
--- linux-next.orig/include/trace/events/writeback.h	2011-05-24 11:17:27.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-05-24 11:17:28.000000000 +0800
@@ -62,6 +62,9 @@ DEFINE_EVENT(writeback_work_class, name,
 DEFINE_WRITEBACK_WORK_EVENT(writeback_nothread);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_queue);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_exec);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_written);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_wait);
 
 TRACE_EVENT(writeback_pages_written,
 	TP_PROTO(long pages_written),
@@ -101,6 +104,30 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
+DEFINE_WRITEBACK_EVENT(balance_dirty_start);
+DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
+
+TRACE_EVENT(balance_dirty_written,
+
+	TP_PROTO(struct backing_dev_info *bdi, int written),
+
+	TP_ARGS(bdi, written),
+
+	TP_STRUCT__entry(
+		__array(char,	name, 32)
+		__field(int,	written)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->written = written;
+	),
+
+	TP_printk("bdi %s written %d",
+		  __entry->name,
+		  __entry->written
+	)
+);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
@@ -114,7 +141,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -128,14 +154,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -145,7 +169,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
@@ -154,12 +177,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 DEFINE_EVENT(wbc_class, name, \
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), \
 	TP_ARGS(wbc, bdi))
-DEFINE_WBC_EVENT(wbc_writeback_start);
-DEFINE_WBC_EVENT(wbc_writeback_written);
-DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
 TRACE_EVENT(writeback_queue_io,
--- linux-next.orig/fs/btrfs/extent_io.c	2011-05-24 11:17:11.000000000 +0800
+++ linux-next/fs/btrfs/extent_io.c	2011-05-24 11:17:28.000000000 +0800
@@ -2596,7 +2596,6 @@ int extent_write_full_page(struct extent
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= wbc->sync_mode,
-		.older_than_this = NULL,
 		.nr_to_write	= 64,
 		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
 		.range_end	= (loff_t)-1,
@@ -2629,7 +2628,6 @@ int extent_write_locked_range(struct ext
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= mode,
-		.older_than_this = NULL,
 		.nr_to_write	= nr_pages * 2,
 		.range_start	= start,
 		.range_end	= end + 1,



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 18/18] writeback: rearrange the wb_writeback() loop
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
                   ` (16 preceding siblings ...)
  2011-05-24  5:14 ` [PATCH 17/18] writeback: make writeback_control.nr_to_write straight Wu Fengguang
@ 2011-05-24  5:14 ` Wu Fengguang
  2011-05-29  7:34   ` Sedat Dilek
  18 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-24  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML

[-- Attachment #1: loop-cleanup --]
[-- Type: text/plain, Size: 2853 bytes --]

Move the terminate conditions to the end of the loop, and move the
work->older_than_this updates close to each others.

Proposed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   58 +++++++++++++++++++++++---------------------
 1 file changed, 31 insertions(+), 27 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:28.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:30.000000000 +0800
@@ -713,42 +713,22 @@ static long wb_writeback(struct bdi_writ
 	unsigned long oldest_jif;
 	struct inode *inode;
 	long progress;
+	bool bg_retry_all = false;
 
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
 
 	spin_lock(&wb->list_lock);
 	for (;;) {
-		/*
-		 * Stop writeback when nr_pages has been consumed
-		 */
-		if (work->nr_pages <= 0)
-			break;
-
-		/*
-		 * Background writeout and kupdate-style writeback may
-		 * run forever. Stop them if there is other work to do
-		 * so that e.g. sync can proceed. They'll be restarted
-		 * after the other works are all done.
-		 */
-		if ((work->for_background || work->for_kupdate) &&
-		    !list_empty(&wb->bdi->work_list))
-			break;
-
-		/*
-		 * For background writeout, stop when we are below the
-		 * background dirty threshold
-		 */
-		if (work->for_background && !over_bground_thresh())
-			break;
-
-		if (work->for_kupdate || work->for_background) {
+		if (bg_retry_all) {
+			bg_retry_all = false;
+			work->older_than_this = NULL;
+		} else if (work->for_kupdate || work->for_background) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
 			work->older_than_this = &oldest_jif;
 		}
 
-retry:
 		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
 			queue_io(wb, work->older_than_this);
@@ -778,14 +758,38 @@ retry:
 		    work->older_than_this &&
 		    list_empty(&wb->b_io) &&
 		    list_empty(&wb->b_more_io)) {
-			work->older_than_this = NULL;
-			goto retry;
+			bg_retry_all = true;
+			continue;
 		}
 		/*
 		 * No more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
+
+		/*
+		 * Stop writeback when nr_pages has been consumed
+		 */
+		if (work->nr_pages <= 0)
+			break;
+
+		/*
+		 * For background writeout, stop when we are below the
+		 * background dirty threshold
+		 */
+		if (work->for_background && !over_bground_thresh())
+			break;
+
+		/*
+		 * Background writeout and kupdate-style writeback may
+		 * run forever. Stop them if there is other work to do
+		 * so that e.g. sync can proceed. They'll be restarted
+		 * after the other works are all done.
+		 */
+		if ((work->for_background || work->for_kupdate) &&
+		    !list_empty(&wb->bdi->work_list))
+			break;
+
 		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 06/18] writeback: sync expired inodes first in background writeback
  2011-05-24  5:14 ` [PATCH 06/18] writeback: sync expired inodes first in background writeback Wu Fengguang
@ 2011-05-24 15:52   ` Jan Kara
  2011-05-25 14:38     ` Wu Fengguang
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2011-05-24 15:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Dave Chinner, Rik van Riel, Mel Gorman,
	Christoph Hellwig, linux-fsdevel, LKML

On Tue 24-05-11 13:14:17, Wu Fengguang wrote:
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> At each queue_io() time, first try enqueuing only newly expired inodes.
> If there are zero expired inodes to work with, then relax the rule and
> enqueue all dirty inodes.
  Fengguang, I've been thinking about this change again (since the code is
now easier to read - good work! - and so I realized some new consequences)
and I was wondering: Assume there is one continuously redirtied file and
untar starts in parallel. With the new logic, background writeback will
never consider inodes that are not expired in this situation (we never
switch to "all dirty inodes" phase - or even if we switched, we would just
queue all inodes and then return back to queueing only expired inodes). So
the net effect is that for 30 seconds we will be only continuously writing
pages of the continuously dirtied file instead of (possibly older) pages of
other files that are written. Is this really desirable? Wasn't the old
behavior simpler and not worse than the new one?

								Honza

> --- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:18.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:18.000000000 +0800
> @@ -718,7 +718,7 @@ static long wb_writeback(struct bdi_writ
>  		if (work->for_background && !over_bground_thresh())
>  			break;
>  
> -		if (work->for_kupdate) {
> +		if (work->for_kupdate || work->for_background) {
>  			oldest_jif = jiffies -
>  				msecs_to_jiffies(dirty_expire_interval * 10);
>  			wbc.older_than_this = &oldest_jif;
> @@ -729,6 +729,7 @@ static long wb_writeback(struct bdi_writ
>  		wbc.pages_skipped = 0;
>  		wbc.inodes_cleaned = 0;
>  
> +retry:
>  		trace_wbc_writeback_start(&wbc, wb->bdi);
>  		if (work->sb)
>  			__writeback_inodes_sb(work->sb, wb, &wbc);
> @@ -752,6 +753,19 @@ static long wb_writeback(struct bdi_writ
>  		if (wbc.inodes_cleaned)
>  			continue;
>  		/*
> +		 * background writeback will start with expired inodes, and
> +		 * if none is found, fallback to all inodes. This order helps
> +		 * reduce the number of dirty pages reaching the end of LRU
> +		 * lists and cause trouble to the page reclaim.
> +		 */
> +		if (work->for_background &&
> +		    wbc.older_than_this &&
> +		    list_empty(&wb->b_io) &&
> +		    list_empty(&wb->b_more_io)) {
> +			wbc.older_than_this = NULL;
> +			goto retry;
> +		}
> +		/*
>  		 * No more inodes for IO, bail
>  		 */
>  		if (!wbc.more_io)
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 06/18] writeback: sync expired inodes first in background writeback
  2011-05-24 15:52   ` Jan Kara
@ 2011-05-25 14:38     ` Wu Fengguang
  2011-05-26 23:10       ` Jan Kara
  2011-05-27 15:17       ` Wu Fengguang
  0 siblings, 2 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-25 14:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Dave Chinner, Rik van Riel, Mel Gorman,
	Christoph Hellwig, linux-fsdevel, LKML

Kara,

On Tue, May 24, 2011 at 11:52:05PM +0800, Jan Kara wrote:
> On Tue 24-05-11 13:14:17, Wu Fengguang wrote:
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> > 
> > At each queue_io() time, first try enqueuing only newly expired inodes.
> > If there are zero expired inodes to work with, then relax the rule and
> > enqueue all dirty inodes.
>   Fengguang, I've been thinking about this change again (since the code is
> now easier to read - good work! - and so I realized some new consequences)

Thank you.

> and I was wondering: Assume there is one continuously redirtied file and
> untar starts in parallel. With the new logic, background writeback will
> never consider inodes that are not expired in this situation (we never
> switch to "all dirty inodes" phase - or even if we switched, we would just
> queue all inodes and then return back to queueing only expired inodes). So
> the net effect is that for 30 seconds we will be only continuously writing
> pages of the continuously dirtied file instead of (possibly older) pages of
> other files that are written. Is this really desirable? Wasn't the old
> behavior simpler and not worse than the new one?

Good question! Yes sadly in this case the new behavior could be worse
than the old one.

In fact this patch do not improve the small files (< 4MB) case at all,
except for the side effect that less unexpired inodes will leave in
s_io when the background work quit and the later kupdate work will
write less unexpired inodes.

And for the mixed small/large files case, it actually results in worse
behavior on your mentioned case.

However the root cause here is the file being _actively_ written to,
somehow a livelock scheme. We could add a simple livelock prevention
scheme that works for the common case of file appending:

- save i_size when the range_cyclic writeback starts from 0, for
  limiting the writeback scope
  
- when range_cyclic writeback hits the saved i_size, quit the current
  inode instead of immediately restarting from 0. This will not only
  avoid a possible extra seek, but also redirty_tail() the inode and
  hence get out of possible livelock.

The livelock prevention scheme may not only eliminate the undesirable
behavior you observed for this patch, but also prevent the "some old
pages may not get the chance to get written to disk in an actively
dirtied file" data security issue discussed in an old email. What do
you think?

Thanks,
Fengguang

> > --- linux-next.orig/fs/fs-writeback.c	2011-05-24 11:17:18.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2011-05-24 11:17:18.000000000 +0800
> > @@ -718,7 +718,7 @@ static long wb_writeback(struct bdi_writ
> >  		if (work->for_background && !over_bground_thresh())
> >  			break;
> >  
> > -		if (work->for_kupdate) {
> > +		if (work->for_kupdate || work->for_background) {
> >  			oldest_jif = jiffies -
> >  				msecs_to_jiffies(dirty_expire_interval * 10);
> >  			wbc.older_than_this = &oldest_jif;
> > @@ -729,6 +729,7 @@ static long wb_writeback(struct bdi_writ
> >  		wbc.pages_skipped = 0;
> >  		wbc.inodes_cleaned = 0;
> >  
> > +retry:
> >  		trace_wbc_writeback_start(&wbc, wb->bdi);
> >  		if (work->sb)
> >  			__writeback_inodes_sb(work->sb, wb, &wbc);
> > @@ -752,6 +753,19 @@ static long wb_writeback(struct bdi_writ
> >  		if (wbc.inodes_cleaned)
> >  			continue;
> >  		/*
> > +		 * background writeback will start with expired inodes, and
> > +		 * if none is found, fallback to all inodes. This order helps
> > +		 * reduce the number of dirty pages reaching the end of LRU
> > +		 * lists and cause trouble to the page reclaim.
> > +		 */
> > +		if (work->for_background &&
> > +		    wbc.older_than_this &&
> > +		    list_empty(&wb->b_io) &&
> > +		    list_empty(&wb->b_more_io)) {
> > +			wbc.older_than_this = NULL;
> > +			goto retry;
> > +		}
> > +		/*
> >  		 * No more inodes for IO, bail
> >  		 */
> >  		if (!wbc.more_io)
> > 
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 06/18] writeback: sync expired inodes first in background writeback
  2011-05-25 14:38     ` Wu Fengguang
@ 2011-05-26 23:10       ` Jan Kara
  2011-05-27 15:06         ` Wu Fengguang
  2011-05-27 15:17       ` Wu Fengguang
  1 sibling, 1 reply; 27+ messages in thread
From: Jan Kara @ 2011-05-26 23:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Dave Chinner, Rik van Riel, Mel Gorman,
	Christoph Hellwig, linux-fsdevel, LKML

On Wed 25-05-11 22:38:57, Wu Fengguang wrote:
> > and I was wondering: Assume there is one continuously redirtied file and
> > untar starts in parallel. With the new logic, background writeback will
> > never consider inodes that are not expired in this situation (we never
> > switch to "all dirty inodes" phase - or even if we switched, we would just
> > queue all inodes and then return back to queueing only expired inodes). So
> > the net effect is that for 30 seconds we will be only continuously writing
> > pages of the continuously dirtied file instead of (possibly older) pages of
> > other files that are written. Is this really desirable? Wasn't the old
> > behavior simpler and not worse than the new one?
> 
> Good question! Yes sadly in this case the new behavior could be worse
> than the old one.
> 
> In fact this patch do not improve the small files (< 4MB) case at all,
> except for the side effect that less unexpired inodes will leave in
> s_io when the background work quit and the later kupdate work will
> write less unexpired inodes.
> 
> And for the mixed small/large files case, it actually results in worse
> behavior on your mentioned case.
> 
> However the root cause here is the file being _actively_ written to,
> somehow a livelock scheme. We could add a simple livelock prevention
> scheme that works for the common case of file appending:
> 
> - save i_size when the range_cyclic writeback starts from 0, for
>   limiting the writeback scope
  Hmm, but for this we'd have to store additional 'unsigned long' (page
index) for each inode. Not sure if it's really worth it.

> - when range_cyclic writeback hits the saved i_size, quit the current
>   inode instead of immediately restarting from 0. This will not only
>   avoid a possible extra seek, but also redirty_tail() the inode and
>   hence get out of possible livelock.
  But I like the idea of doing redirty_tail() when we write out some inode
for too long. Maybe we could just do redirty_tail() instead of requeue_io()
whenever write_cache_pages() had to wrap the index? We could communicate
this by setting a flag in wbc in write_cache_pages()...

> The livelock prevention scheme may not only eliminate the undesirable
> behavior you observed for this patch, but also prevent the "some old
> pages may not get the chance to get written to disk in an actively
> dirtied file" data security issue discussed in an old email. What do
> you think?
  So my scheme would not solve this but it does not require per-inode
overhead...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 06/18] writeback: sync expired inodes first in background writeback
  2011-05-26 23:10       ` Jan Kara
@ 2011-05-27 15:06         ` Wu Fengguang
  0 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-27 15:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Dave Chinner, Rik van Riel, Mel Gorman,
	Christoph Hellwig, linux-fsdevel, LKML

On Fri, May 27, 2011 at 07:10:45AM +0800, Jan Kara wrote:
> On Wed 25-05-11 22:38:57, Wu Fengguang wrote:
> > > and I was wondering: Assume there is one continuously redirtied file and
> > > untar starts in parallel. With the new logic, background writeback will
> > > never consider inodes that are not expired in this situation (we never
> > > switch to "all dirty inodes" phase - or even if we switched, we would just
> > > queue all inodes and then return back to queueing only expired inodes). So
> > > the net effect is that for 30 seconds we will be only continuously writing
> > > pages of the continuously dirtied file instead of (possibly older) pages of
> > > other files that are written. Is this really desirable? Wasn't the old
> > > behavior simpler and not worse than the new one?
> > 
> > Good question! Yes sadly in this case the new behavior could be worse
> > than the old one.
> > 
> > In fact this patch do not improve the small files (< 4MB) case at all,
> > except for the side effect that less unexpired inodes will leave in
> > s_io when the background work quit and the later kupdate work will
> > write less unexpired inodes.
> > 
> > And for the mixed small/large files case, it actually results in worse
> > behavior on your mentioned case.
> > 
> > However the root cause here is the file being _actively_ written to,
> > somehow a livelock scheme. We could add a simple livelock prevention
> > scheme that works for the common case of file appending:
> > 
> > - save i_size when the range_cyclic writeback starts from 0, for
> >   limiting the writeback scope
>   Hmm, but for this we'd have to store additional 'unsigned long' (page
> index) for each inode. Not sure if it's really worth it.

Yeah, it may be considerable space cost when icache grows large.

> > - when range_cyclic writeback hits the saved i_size, quit the current
> >   inode instead of immediately restarting from 0. This will not only
> >   avoid a possible extra seek, but also redirty_tail() the inode and
> >   hence get out of possible livelock.
>   But I like the idea of doing redirty_tail() when we write out some inode
> for too long.

Then there is the hard question of "how long time is enough time to
redirty_tail()?". The time should be able to avoid livelocking the
dirty pages near EOF when the large file is being written to all over
the places (random writes or multiple sequential write streams), and
yet still be small enough to be useful. 

> Maybe we could just do redirty_tail() instead of requeue_io()
> whenever write_cache_pages() had to wrap the index? We could communicate
> this by setting a flag in wbc in write_cache_pages()...

That's the minimal required for the problem here. What do you think
about this old patch that happen to contain the required side effects?

        writeback: quit on wrap for .range_cyclic (ext4 part)
        http://lkml.org/lkml/2009/10/7/129

> > The livelock prevention scheme may not only eliminate the undesirable
> > behavior you observed for this patch, but also prevent the "some old
> > pages may not get the chance to get written to disk in an actively
> > dirtied file" data security issue discussed in an old email. What do
> > you think?
>   So my scheme would not solve this but it does not require per-inode
> overhead...

Another possible scheme is to tag the dirty inode immediate after
requeue_io(), if it's not yet tagged.  Accordingly let
write_cache_pages() write only the tagged pages as long as
mapping_tagged(TOWRITE), and to always update dirtied_when on
writeback index wrap.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 06/18] writeback: sync expired inodes first in background writeback
  2011-05-25 14:38     ` Wu Fengguang
  2011-05-26 23:10       ` Jan Kara
@ 2011-05-27 15:17       ` Wu Fengguang
  1 sibling, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-27 15:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Rik van Riel, Mel Gorman,
	Christoph Hellwig, linux-fsdevel, LKML

On Wed, May 25, 2011 at 10:38:57PM +0800, Wu Fengguang wrote:
> On Tue, May 24, 2011 at 11:52:05PM +0800, Jan Kara wrote:
[snip]
> > and I was wondering: Assume there is one continuously redirtied file and
> > untar starts in parallel. With the new logic, background writeback will
> > never consider inodes that are not expired in this situation (we never
> > switch to "all dirty inodes" phase - or even if we switched, we would just
> > queue all inodes and then return back to queueing only expired inodes). So
> > the net effect is that for 30 seconds we will be only continuously writing
> > pages of the continuously dirtied file instead of (possibly older) pages of
> > other files that are written. Is this really desirable? Wasn't the old
> > behavior simpler and not worse than the new one?
> 
> Good question! Yes sadly in this case the new behavior could be worse
> than the old one.

Andrew, it's desirable to delay this patch given that it has some
negative behavior (it's fixable but the fix won't be trivial enough
for immediate merge).

You may simply drop this patch (06) and the last two patches (17, 18)
and still be able to cleanly apply all other patches in this series.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4)
  2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
@ 2011-05-29  7:34   ` Sedat Dilek
  2011-05-24  5:14 ` [PATCH 02/18] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
                     ` (17 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Sedat Dilek @ 2011-05-29  7:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML, Linus Torvalds

( Sorry for top-posting. )
I have this patchset v4 since 24-May in my series.
Is this going to be accepted within merge-window for 2.6.40 (or 3.0)?

- Sedat -

On Tue, May 24, 2011 at 7:14 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> Andrew,
>
> The first 10 patches are already in -mm tree, with updates:
>
> - remove "writeback: pass writeback_control down to move_expired_inodes()", and
>  resolve the resulting merge conflicts in other patches.
> - move ahead the sync livelock prevention patches (01, 02) so that (04) won't livelock sync
> - merge the three -mm fixes to (08)
> - fixed changelog of (01)
> - rename .for_sync to .tagged_writepages
>
>        [PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
>        [PATCH 02/18] writeback: update dirtied_when for synced inode to prevent livelock
>        [PATCH 03/18] writeback: introduce writeback_control.inodes_cleaned
>        [PATCH 04/18] writeback: try more writeback as long as something was written
>        [PATCH 05/18] writeback: the kupdate expire timestamp should be a moving target
>        [PATCH 06/18] writeback: sync expired inodes first in background writeback
>        [PATCH 07/18] writeback: refill b_io iff empty
>        [PATCH 08/18] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
>        [PATCH 09/18] writeback: elevate queue_io() into wb_writeback()
>        [PATCH 10/18] writeback: avoid extra sync work at enqueue time
>
> The following 6 patches are trivial and safe
>
>        [PATCH 11/18] writeback: add bdi_dirty_limit() kernel-doc
>        [PATCH 12/18] writeback: skip balance_dirty_pages() for in-memory fs
>        [PATCH 13/18] writeback: remove writeback_control.more_io
>        [PATCH 14/18] writeback: remove .nonblocking and .encountered_congestion
>        [PATCH 15/18] writeback: trace event writeback_single_inode
>        [PATCH 16/18] writeback: trace event writeback_queue_io
>
> Patch 17 is a bit more complex
>
>        [PATCH 17/18] writeback: make writeback_control.nr_to_write straight
>        [PATCH 18/18] writeback: rearrange the wb_writeback() loop
>
> Thanks to Jan and Dave for the careful reviews!
>
> The patches are git pullable from
>
> git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git fs-writeback
>
>  fs/block_dev.c                   |   16 +-
>  fs/btrfs/extent_io.c             |    2 -
>  fs/ext4/inode.c                  |    4 +-
>  fs/fs-writeback.c                |  411 +++++++++++++++++++++-----------------
>  fs/inode.c                       |    5 +-
>  fs/nfs/write.c                   |    3 +-
>  fs/xfs/linux-2.6/xfs_aops.c      |    2 +-
>  include/linux/backing-dev.h      |    2 +
>  include/linux/writeback.h        |   14 +-
>  include/trace/events/btrfs.h     |    6 +-
>  include/trace/events/ext4.h      |    6 +-
>  include/trace/events/writeback.h |  137 ++++++++++++--
>  mm/backing-dev.c                 |   30 ++-
>  mm/filemap.c                     |    6 +-
>  mm/page-writeback.c              |   42 ++--
>  mm/rmap.c                        |    4 +-
>  16 files changed, 420 insertions(+), 270 deletions(-)
>
> Thanks,
> Fengguang
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4)
@ 2011-05-29  7:34   ` Sedat Dilek
  0 siblings, 0 replies; 27+ messages in thread
From: Sedat Dilek @ 2011-05-29  7:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, LKML, Linus Torvalds

( Sorry for top-posting. )
I have this patchset v4 since 24-May in my series.
Is this going to be accepted within merge-window for 2.6.40 (or 3.0)?

- Sedat -

On Tue, May 24, 2011 at 7:14 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> Andrew,
>
> The first 10 patches are already in -mm tree, with updates:
>
> - remove "writeback: pass writeback_control down to move_expired_inodes()", and
>  resolve the resulting merge conflicts in other patches.
> - move ahead the sync livelock prevention patches (01, 02) so that (04) won't livelock sync
> - merge the three -mm fixes to (08)
> - fixed changelog of (01)
> - rename .for_sync to .tagged_writepages
>
>        [PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
>        [PATCH 02/18] writeback: update dirtied_when for synced inode to prevent livelock
>        [PATCH 03/18] writeback: introduce writeback_control.inodes_cleaned
>        [PATCH 04/18] writeback: try more writeback as long as something was written
>        [PATCH 05/18] writeback: the kupdate expire timestamp should be a moving target
>        [PATCH 06/18] writeback: sync expired inodes first in background writeback
>        [PATCH 07/18] writeback: refill b_io iff empty
>        [PATCH 08/18] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
>        [PATCH 09/18] writeback: elevate queue_io() into wb_writeback()
>        [PATCH 10/18] writeback: avoid extra sync work at enqueue time
>
> The following 6 patches are trivial and safe
>
>        [PATCH 11/18] writeback: add bdi_dirty_limit() kernel-doc
>        [PATCH 12/18] writeback: skip balance_dirty_pages() for in-memory fs
>        [PATCH 13/18] writeback: remove writeback_control.more_io
>        [PATCH 14/18] writeback: remove .nonblocking and .encountered_congestion
>        [PATCH 15/18] writeback: trace event writeback_single_inode
>        [PATCH 16/18] writeback: trace event writeback_queue_io
>
> Patch 17 is a bit more complex
>
>        [PATCH 17/18] writeback: make writeback_control.nr_to_write straight
>        [PATCH 18/18] writeback: rearrange the wb_writeback() loop
>
> Thanks to Jan and Dave for the careful reviews!
>
> The patches are git pullable from
>
> git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git fs-writeback
>
>  fs/block_dev.c                   |   16 +-
>  fs/btrfs/extent_io.c             |    2 -
>  fs/ext4/inode.c                  |    4 +-
>  fs/fs-writeback.c                |  411 +++++++++++++++++++++-----------------
>  fs/inode.c                       |    5 +-
>  fs/nfs/write.c                   |    3 +-
>  fs/xfs/linux-2.6/xfs_aops.c      |    2 +-
>  include/linux/backing-dev.h      |    2 +
>  include/linux/writeback.h        |   14 +-
>  include/trace/events/btrfs.h     |    6 +-
>  include/trace/events/ext4.h      |    6 +-
>  include/trace/events/writeback.h |  137 ++++++++++++--
>  mm/backing-dev.c                 |   30 ++-
>  mm/filemap.c                     |    6 +-
>  mm/page-writeback.c              |   42 ++--
>  mm/rmap.c                        |    4 +-
>  16 files changed, 420 insertions(+), 270 deletions(-)
>
> Thanks,
> Fengguang
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 11/18] writeback: add bdi_dirty_limit() kernel-doc
  2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
@ 2011-05-19 21:45 ` Wu Fengguang
  0 siblings, 0 replies; 27+ messages in thread
From: Wu Fengguang @ 2011-05-19 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, LKML

[-- Attachment #1: writeback-task_dirty_limit-comment.patch --]
[-- Type: text/plain, Size: 1193 bytes --]

Clarify the bdi_dirty_limit() comment.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-05-20 05:08:25.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-05-20 05:12:43.000000000 +0800
@@ -437,10 +437,17 @@ void global_dirty_limits(unsigned long *
 	*pdirty = dirty;
 }
 
-/*
+/**
  * bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ * @bdi: the backing_dev_info to query
+ * @dirty: global dirty limit in pages
+ *
+ * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
+ * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ * And the "limit" in the name is not seriously taken as hard limit in
+ * balance_dirty_pages().
  *
- * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * It allocates high/low dirty limits to fast/slow devices, in order to prevent
  * - starving fast devices
  * - piling up dirty pages (that will take long time to sync) on slow devices
  *



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2011-05-29  7:34 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-24  5:14 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Wu Fengguang
2011-05-24  5:14 ` [PATCH 01/18] writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage Wu Fengguang
2011-05-24  5:14 ` [PATCH 02/18] writeback: update dirtied_when for synced inode to prevent livelock Wu Fengguang
2011-05-24  5:14 ` [PATCH 03/18] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
2011-05-24  5:14 ` [PATCH 04/18] writeback: try more writeback as long as something was written Wu Fengguang
2011-05-24  5:14 ` [PATCH 05/18] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
2011-05-24  5:14 ` [PATCH 06/18] writeback: sync expired inodes first in background writeback Wu Fengguang
2011-05-24 15:52   ` Jan Kara
2011-05-25 14:38     ` Wu Fengguang
2011-05-26 23:10       ` Jan Kara
2011-05-27 15:06         ` Wu Fengguang
2011-05-27 15:17       ` Wu Fengguang
2011-05-24  5:14 ` [PATCH 07/18] writeback: refill b_io iff empty Wu Fengguang
2011-05-24  5:14 ` [PATCH 08/18] writeback: split inode_wb_list_lock into bdi_writeback.list_lock Wu Fengguang
2011-05-24  5:14 ` [PATCH 09/18] writeback: elevate queue_io() into wb_writeback() Wu Fengguang
2011-05-24  5:14 ` [PATCH 10/18] writeback: avoid extra sync work at enqueue time Wu Fengguang
2011-05-24  5:14 ` [PATCH 11/18] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
2011-05-24  5:14 ` [PATCH 12/18] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
2011-05-24  5:14 ` [PATCH 13/18] writeback: remove writeback_control.more_io Wu Fengguang
2011-05-24  5:14 ` [PATCH 14/18] writeback: remove .nonblocking and .encountered_congestion Wu Fengguang
2011-05-24  5:14 ` [PATCH 15/18] writeback: trace event writeback_single_inode Wu Fengguang
2011-05-24  5:14 ` [PATCH 16/18] writeback: trace event writeback_queue_io Wu Fengguang
2011-05-24  5:14 ` [PATCH 17/18] writeback: make writeback_control.nr_to_write straight Wu Fengguang
2011-05-24  5:14 ` [PATCH 18/18] writeback: rearrange the wb_writeback() loop Wu Fengguang
2011-05-29  7:34 ` [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v4) Sedat Dilek
2011-05-29  7:34   ` Sedat Dilek
  -- strict thread matches above, loose matches on Subject: below --
2011-05-19 21:45 [PATCH 00/18] writeback fixes and cleanups for 2.6.40 (v3) Wu Fengguang
2011-05-19 21:45 ` [PATCH 11/18] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.