All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] [RFC] writeback fixes for 2.6.32
@ 2009-09-23 12:33 Wu Fengguang
  2009-09-23 12:33 ` [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages Wu Fengguang
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Wu Fengguang @ 2009-09-23 12:33 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Jan Kara, Theodore Ts'o, Dave Chinner, Chris Mason,
	Christoph Hellwig, Peter Zijlstra, linux-fsdevel, Wu Fengguang,
	LKML

Hi,

Here are some writeback patches intended for 2.6.32,
they are based on the latest linux-next plus Jara's
busy loop fix.

They are mostly bug fixes. The last patch is a behavior change
that should be optional for 2.6.32, but included for review.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-23 12:33 [PATCH 0/6] [RFC] writeback fixes for 2.6.32 Wu Fengguang
@ 2009-09-23 12:33 ` Wu Fengguang
  2009-09-23 12:45   ` Christoph Hellwig
  2009-09-23 13:56   ` [PATCH 1/6 -v2] " Wu Fengguang
  2009-09-23 12:33 ` [PATCH 2/6] writeback: stop background writeback when below background threshold Wu Fengguang
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Wu Fengguang @ 2009-09-23 12:33 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Jan Kara, Wu Fengguang, Theodore Ts'o, Dave Chinner,
	Chris Mason, Christoph Hellwig, Peter Zijlstra, linux-fsdevel,
	LKML

[-- Attachment #1: writeback-ratelimit.patch --]
[-- Type: text/plain, Size: 2371 bytes --]

Some filesystem may choose to write much more than ratelimit_pages
before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
determine number to write based on real number of dirtied pages.

The increased write_chunk may make the dirtier more bumpy.  This is
filesystem writers' duty not to dirty too much at a time without
checking the ratelimit.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-09-23 16:31:46.000000000 +0800
+++ linux/mm/page-writeback.c	2009-09-23 16:33:19.000000000 +0800
@@ -44,12 +44,12 @@ static long ratelimit_pages = 32;
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
+ * It should be somewhat larger than dirtied pages to ensure that reasonably
  * large amounts of I/O are submitted.
  */
-static inline long sync_writeback_pages(void)
+static inline long sync_writeback_pages(unsigned long dirtied)
 {
-	return ratelimit_pages + ratelimit_pages / 2;
+	return dirtied + dirtied / 2;
 }
 
 /* The following parameters are exported via /proc/sys/vm */
@@ -476,7 +476,8 @@ get_dirty_limits(unsigned long *pbackgro
  * If we're over `background_thresh' then pdflush is woken to perform some
  * writeout.
  */
-static void balance_dirty_pages(struct address_space *mapping)
+static void balance_dirty_pages(struct address_space *mapping,
+				unsigned long write_chunk)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
@@ -484,7 +485,6 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
-	unsigned long write_chunk = sync_writeback_pages();
 	unsigned long pause = 1;
 
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
@@ -640,9 +640,10 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
+		ratelimit = sync_writeback_pages(*p);
 		*p = 0;
 		preempt_enable();
-		balance_dirty_pages(mapping);
+		balance_dirty_pages(mapping, ratelimit);
 		return;
 	}
 	preempt_enable();



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/6] writeback: stop background writeback when below background threshold
  2009-09-23 12:33 [PATCH 0/6] [RFC] writeback fixes for 2.6.32 Wu Fengguang
  2009-09-23 12:33 ` [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages Wu Fengguang
@ 2009-09-23 12:33 ` Wu Fengguang
  2009-09-23 15:05   ` Jens Axboe
  2009-09-23 12:33 ` [PATCH 3/6] writeback: kupdate writeback shall not stop when more io is possible Wu Fengguang
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Wu Fengguang @ 2009-09-23 12:33 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Theodore Ts'o,
	Dave Chinner, Chris Mason, Christoph Hellwig, linux-fsdevel,
	LKML

[-- Attachment #1: writeback-background-threshold.patch --]
[-- Type: text/plain, Size: 2964 bytes --]

Treat bdi_start_writeback(0) as a special request to do background write,
and stop such work when we are below the background dirty threshold.

Also simplify the (nr_pages <= 0) checks. Since we already pass in
nr_pages=LONG_MAX for WB_SYNC_ALL and background writes, we don't
need to worry about it being decreased to zero.

Reported-by: Richard Kennedy <richard@rsk.demon.co.uk>
CC: Jan Kara <jack@suse.cz>
CC: Jens Axboe <jens.axboe@oracle.com> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c   |   28 +++++++++++++++++-----------
 mm/page-writeback.c |    6 +++---
 2 files changed, 20 insertions(+), 14 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-09-23 17:47:23.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-23 18:13:36.000000000 +0800
@@ -41,8 +41,9 @@ struct wb_writeback_args {
 	long nr_pages;
 	struct super_block *sb;
 	enum writeback_sync_modes sync_mode;
-	int for_kupdate;
-	int range_cyclic;
+	int for_kupdate:1;
+	int range_cyclic:1;
+	int for_background:1;
 };
 
 /*
@@ -260,6 +261,15 @@ void bdi_start_writeback(struct backing_
 		.range_cyclic	= 1,
 	};
 
+	/*
+	 * We treat @nr_pages=0 as the special case to do background writeback,
+	 * ie. to sync pages until the background dirty threshold is reached.
+	 */
+	if (!nr_pages) {
+		args.nr_pages = LONG_MAX;
+		args.for_background = 1;
+	}
+
 	bdi_alloc_queue_work(bdi, &args);
 }
 
@@ -723,20 +733,16 @@ static long wb_writeback(struct bdi_writ
 
 	for (;;) {
 		/*
-		 * Don't flush anything for non-integrity writeback where
-		 * no nr_pages was given
+		 * Stop writeback when nr_pages has been consumed
 		 */
-		if (!args->for_kupdate && args->nr_pages <= 0 &&
-		     args->sync_mode == WB_SYNC_NONE)
+		if (args->nr_pages <= 0)
 			break;
 
 		/*
-		 * If no specific pages were given and this is just a
-		 * periodic background writeout and we are below the
-		 * background dirty threshold, don't do anything
+		 * For background writeout, stop when we are below the
+		 * background dirty threshold
 		 */
-		if (args->for_kupdate && args->nr_pages <= 0 &&
-		    !over_bground_thresh())
+		if (args->for_background && !over_bground_thresh())
 			break;
 
 		wbc.more_io = 0;
--- linux.orig/mm/page-writeback.c	2009-09-23 17:45:58.000000000 +0800
+++ linux/mm/page-writeback.c	2009-09-23 17:47:17.000000000 +0800
@@ -589,10 +589,10 @@ static void balance_dirty_pages(struct a
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && ((nr_writeback = global_page_state(NR_FILE_DIRTY)
-					  + global_page_state(NR_UNSTABLE_NFS))
+	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
+			       + global_page_state(NR_UNSTABLE_NFS))
 					  > background_thresh)))
-		bdi_start_writeback(bdi, nr_writeback);
+		bdi_start_writeback(bdi, 0);
 }
 
 void set_page_dirty_balance(struct page *page, int page_mkwrite)



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3/6] writeback: kupdate writeback shall not stop when more io is possible
  2009-09-23 12:33 [PATCH 0/6] [RFC] writeback fixes for 2.6.32 Wu Fengguang
  2009-09-23 12:33 ` [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages Wu Fengguang
  2009-09-23 12:33 ` [PATCH 2/6] writeback: stop background writeback when below background threshold Wu Fengguang
@ 2009-09-23 12:33 ` Wu Fengguang
  2009-09-23 12:33 ` [PATCH 4/6] writeback: cleanup writeback_single_inode() Wu Fengguang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2009-09-23 12:33 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Jan Kara, Dave Chinner, Peter Zijlstra, Wu Fengguang,
	Theodore Ts'o, Chris Mason, Christoph Hellwig, linux-fsdevel,
	LKML

[-- Attachment #1: writeback-more-io.patch --]
[-- Type: text/plain, Size: 1190 bytes --]

Fix the kupdate case, which disregards wbc.more_io and stop writeback
prematurely even when there are more inodes to be synced.

wbc.more_io should always be respected.

Also remove the pages_skipped check. It will set when some page(s) of some
inode(s) cannot be written for now. Such inodes will be delayed for a while.
This variable has nothing to do with whether there are other writeable inodes.

CC: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-09-23 18:13:36.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-23 18:33:17.000000000 +0800
@@ -756,8 +756,8 @@ static long wb_writeback(struct bdi_writ
 		/*
 		 * If we ran out of stuff to write, bail unless more_io got set
 		 */
-		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
-			if (wbc.more_io && !wbc.for_kupdate) {
+		if (wbc.nr_to_write > 0) {
+			if (wbc.more_io) {
 				if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
 					continue;
 				/*



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 4/6] writeback: cleanup writeback_single_inode()
  2009-09-23 12:33 [PATCH 0/6] [RFC] writeback fixes for 2.6.32 Wu Fengguang
                   ` (2 preceding siblings ...)
  2009-09-23 12:33 ` [PATCH 3/6] writeback: kupdate writeback shall not stop when more io is possible Wu Fengguang
@ 2009-09-23 12:33 ` Wu Fengguang
  2009-09-23 12:33 ` [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier Wu Fengguang
  2009-09-23 12:33 ` [PATCH 6/6] writeback: redirty a fully scanned inode Wu Fengguang
  5 siblings, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2009-09-23 12:33 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Jan Kara, Michael Rubin, Peter Zijlstra, Theodore Ts'o,
	Dave Chinner, Chris Mason, Christoph Hellwig, Peter Zijlstra,
	linux-fsdevel, Wu Fengguang, LKML

[-- Attachment #1: writeback-simplify-x.patch --]
[-- Type: text/plain, Size: 1445 bytes --]

Make the if-else straight in writeback_single_inode().
No behavior change.

Cc: Jan Kara <jack@suse.cz>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |   15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-09-23 18:33:17.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-23 18:44:10.000000000 +0800
@@ -452,8 +452,13 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
-		if (!(inode->i_state & I_DIRTY) &&
-		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+		if (inode->i_state & I_DIRTY) {
+			/*
+			 * Someone redirtied the inode while were writing back
+			 * the pages.
+			 */
+			redirty_tail(inode);
+		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything. Redirty
@@ -497,12 +502,6 @@ writeback_single_inode(struct inode *ino
 				inode->i_state |= I_DIRTY_PAGES;
 				redirty_tail(inode);
 			}
-		} else if (inode->i_state & I_DIRTY) {
-			/*
-			 * Someone redirtied the inode while were writing back
-			 * the pages.
-			 */
-			redirty_tail(inode);
 		} else if (atomic_read(&inode->i_count)) {
 			/*
 			 * The inode is clean, inuse



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier
  2009-09-23 12:33 [PATCH 0/6] [RFC] writeback fixes for 2.6.32 Wu Fengguang
                   ` (3 preceding siblings ...)
  2009-09-23 12:33 ` [PATCH 4/6] writeback: cleanup writeback_single_inode() Wu Fengguang
@ 2009-09-23 12:33 ` Wu Fengguang
  2009-09-23 13:20   ` Wu Fengguang
  2009-09-26 19:47   ` Christoph Hellwig
  2009-09-23 12:33 ` [PATCH 6/6] writeback: redirty a fully scanned inode Wu Fengguang
  5 siblings, 2 replies; 18+ messages in thread
From: Wu Fengguang @ 2009-09-23 12:33 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Jan Kara, Theodore Tso, Dave Chinner, Chris Mason,
	Christoph Hellwig, Wu Fengguang, Peter Zijlstra, linux-fsdevel,
	LKML

[-- Attachment #1: writeback-redirty-page.patch --]
[-- Type: text/plain, Size: 1893 bytes --]

Debug traces show that in per-bdi writeback, the inode under writeback
almost always get redirtied by a busy dirtier.  We used to call
redirty_tail() in this case, which could delay inode for up to 30s.

This is unacceptable because it now happens so frequently for plain cp/dd,
that the accumulated delays could make writeback of big files very slow.

So let's distinguish between data redirty and metadata only redirty.
The first one is caused by a busy dirtier, while the latter one could
happen in XFS, NFS, etc. when they are doing delalloc or updating isize.

The inode being busy dirtied will now be requeued for next io, while
the inode being redirtied by fs will continue to be delayed to avoid
repeated IO.

CC: Jan Kara <jack@suse.cz>
CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-09-23 16:13:41.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-23 16:21:24.000000000 +0800
@@ -491,10 +493,15 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
-		if (inode->i_state & I_DIRTY) {
+		if (inode->i_state & I_DIRTY_PAGES) {
 			/*
-			 * Someone redirtied the inode while were writing back
-			 * the pages.
+			 * More pages get dirtied by a fast dirtier.
+			 */
+			goto select_queue;
+		} else if (inode->i_state & I_DIRTY) {
+			/*
+			 * At least XFS will redirty the inode during the
+			 * writeback (delalloc) and on io completion (isize).
 			 */
 			redirty_tail(inode);
 		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 6/6] writeback: redirty a fully scanned inode
  2009-09-23 12:33 [PATCH 0/6] [RFC] writeback fixes for 2.6.32 Wu Fengguang
                   ` (4 preceding siblings ...)
  2009-09-23 12:33 ` [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier Wu Fengguang
@ 2009-09-23 12:33 ` Wu Fengguang
  5 siblings, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2009-09-23 12:33 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Jan Kara, Theodore Tso, Dave Chinner, Chris Mason,
	Christoph Hellwig, Wu Fengguang, Peter Zijlstra, linux-fsdevel,
	LKML

[-- Attachment #1: writeback-delay-rewrites.patch --]
[-- Type: text/plain, Size: 1438 bytes --]

Redirty inode when its writeback_index wrapped around.
This is mainly to avoid unnecessary IOs for fast dirtier
that do lots of overwrites.

CC: Jan Kara <jack@suse.cz>
CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

--- linux.orig/fs/fs-writeback.c	2009-09-23 18:44:56.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-23 18:52:22.000000000 +0800
@@ -397,6 +397,7 @@ writeback_single_inode(struct inode *ino
 {
 	struct address_space *mapping = inode->i_mapping;
 	int wait = wbc->sync_mode == WB_SYNC_ALL;
+	pgoff_t start_index;
 	unsigned dirty;
 	int ret;
 
@@ -434,6 +435,7 @@ writeback_single_inode(struct inode *ino
 
 	spin_unlock(&inode_lock);
 
+	start_index = mapping->writeback_index;
 	ret = do_writepages(mapping, wbc);
 
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
@@ -485,7 +487,9 @@ writeback_single_inode(struct inode *ino
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-				if (wbc->nr_to_write <= 0) {
+select_queue:
+				if (wbc->nr_to_write <= 0 &&
+				    start_index < mapping->writeback_index) {
 					/*
 					 * slice used up: queue for next turn
 					 */



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-23 12:33 ` [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages Wu Fengguang
@ 2009-09-23 12:45   ` Christoph Hellwig
  2009-09-23 12:53     ` Wu Fengguang
  2009-09-23 13:56   ` [PATCH 1/6 -v2] " Wu Fengguang
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2009-09-23 12:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Jan Kara, Theodore Ts'o,
	Dave Chinner, Chris Mason, Christoph Hellwig, Peter Zijlstra,
	linux-fsdevel, LKML

On Wed, Sep 23, 2009 at 08:33:39PM +0800, Wu Fengguang wrote:
> Some filesystem may choose to write much more than ratelimit_pages
> before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> determine number to write based on real number of dirtied pages.
> 
> The increased write_chunk may make the dirtier more bumpy.  This is
> filesystem writers' duty not to dirty too much at a time without
> checking the ratelimit.

For some reason the filesystems tricking around
balance_dirty_pages_ratelimited's semantics seem to get much better
buffered write performance.  Are you sure this is not going to destroy
that benefit?  (But yes, we should make sure it behaves the same for all
filesystems)

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> --- linux.orig/mm/page-writeback.c	2009-09-23 16:31:46.000000000 +0800
> +++ linux/mm/page-writeback.c	2009-09-23 16:33:19.000000000 +0800
> @@ -44,12 +44,12 @@ static long ratelimit_pages = 32;
>  /*
>   * When balance_dirty_pages decides that the caller needs to perform some
>   * non-background writeback, this is how many pages it will attempt to write.
> - * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
> + * It should be somewhat larger than dirtied pages to ensure that reasonably
>   * large amounts of I/O are submitted.
>   */
> -static inline long sync_writeback_pages(void)
> +static inline long sync_writeback_pages(unsigned long dirtied)
>  {
> -	return ratelimit_pages + ratelimit_pages / 2;
> +	return dirtied + dirtied / 2;
>  }
>  
>  /* The following parameters are exported via /proc/sys/vm */
> @@ -476,7 +476,8 @@ get_dirty_limits(unsigned long *pbackgro
>   * If we're over `background_thresh' then pdflush is woken to perform some
>   * writeout.
>   */
> -static void balance_dirty_pages(struct address_space *mapping)
> +static void balance_dirty_pages(struct address_space *mapping,
> +				unsigned long write_chunk)
>  {
>  	long nr_reclaimable, bdi_nr_reclaimable;
>  	long nr_writeback, bdi_nr_writeback;
> @@ -484,7 +485,6 @@ static void balance_dirty_pages(struct a
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
>  	unsigned long pages_written = 0;
> -	unsigned long write_chunk = sync_writeback_pages();
>  	unsigned long pause = 1;
>  
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> @@ -640,9 +640,10 @@ void balance_dirty_pages_ratelimited_nr(
>  	p =  &__get_cpu_var(bdp_ratelimits);
>  	*p += nr_pages_dirtied;
>  	if (unlikely(*p >= ratelimit)) {
> +		ratelimit = sync_writeback_pages(*p);
>  		*p = 0;
>  		preempt_enable();
> -		balance_dirty_pages(mapping);
> +		balance_dirty_pages(mapping, ratelimit);
>  		return;
>  	}
>  	preempt_enable();
> 
> 
---end quoted text---

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-23 12:45   ` Christoph Hellwig
@ 2009-09-23 12:53     ` Wu Fengguang
  0 siblings, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2009-09-23 12:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jens Axboe, Jan Kara, Theodore Ts'o,
	Dave Chinner, Chris Mason, Peter Zijlstra, linux-fsdevel, LKML

On Wed, Sep 23, 2009 at 08:45:29PM +0800, Christoph Hellwig wrote:
> On Wed, Sep 23, 2009 at 08:33:39PM +0800, Wu Fengguang wrote:
> > Some filesystem may choose to write much more than ratelimit_pages
> > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > determine number to write based on real number of dirtied pages.
> > 
> > The increased write_chunk may make the dirtier more bumpy.  This is
> > filesystem writers' duty not to dirty too much at a time without
> > checking the ratelimit.
> 
> For some reason the filesystems tricking around
> balance_dirty_pages_ratelimited's semantics seem to get much better
> buffered write performance.

Yes, I would expect enlarged ratelimit numbers to benefit performance.

> Are you sure this is not going to destroy that benefit?  (But yes,
> we should make sure it behaves the same for all filesystems)

If filesystem chooses to use a larger ratelimit number, then this
patch will also enlarge the write chunk size (which is now computed
based on the enlarged ratelimit number), which I believe will be
beneficial, too.

Thanks,
Fengguang

> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |   13 +++++++------
> >  1 file changed, 7 insertions(+), 6 deletions(-)
> > 
> > --- linux.orig/mm/page-writeback.c	2009-09-23 16:31:46.000000000 +0800
> > +++ linux/mm/page-writeback.c	2009-09-23 16:33:19.000000000 +0800
> > @@ -44,12 +44,12 @@ static long ratelimit_pages = 32;
> >  /*
> >   * When balance_dirty_pages decides that the caller needs to perform some
> >   * non-background writeback, this is how many pages it will attempt to write.
> > - * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
> > + * It should be somewhat larger than dirtied pages to ensure that reasonably
> >   * large amounts of I/O are submitted.
> >   */
> > -static inline long sync_writeback_pages(void)
> > +static inline long sync_writeback_pages(unsigned long dirtied)
> >  {
> > -	return ratelimit_pages + ratelimit_pages / 2;
> > +	return dirtied + dirtied / 2;
> >  }
> >  
> >  /* The following parameters are exported via /proc/sys/vm */
> > @@ -476,7 +476,8 @@ get_dirty_limits(unsigned long *pbackgro
> >   * If we're over `background_thresh' then pdflush is woken to perform some
> >   * writeout.
> >   */
> > -static void balance_dirty_pages(struct address_space *mapping)
> > +static void balance_dirty_pages(struct address_space *mapping,
> > +				unsigned long write_chunk)
> >  {
> >  	long nr_reclaimable, bdi_nr_reclaimable;
> >  	long nr_writeback, bdi_nr_writeback;
> > @@ -484,7 +485,6 @@ static void balance_dirty_pages(struct a
> >  	unsigned long dirty_thresh;
> >  	unsigned long bdi_thresh;
> >  	unsigned long pages_written = 0;
> > -	unsigned long write_chunk = sync_writeback_pages();
> >  	unsigned long pause = 1;
> >  
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > @@ -640,9 +640,10 @@ void balance_dirty_pages_ratelimited_nr(
> >  	p =  &__get_cpu_var(bdp_ratelimits);
> >  	*p += nr_pages_dirtied;
> >  	if (unlikely(*p >= ratelimit)) {
> > +		ratelimit = sync_writeback_pages(*p);
> >  		*p = 0;
> >  		preempt_enable();
> > -		balance_dirty_pages(mapping);
> > +		balance_dirty_pages(mapping, ratelimit);
> >  		return;
> >  	}
> >  	preempt_enable();
> > 
> > 
> ---end quoted text---

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier
  2009-09-23 12:33 ` [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier Wu Fengguang
@ 2009-09-23 13:20   ` Wu Fengguang
  2009-09-23 13:23     ` Christoph Hellwig
  2009-09-26 19:47   ` Christoph Hellwig
  1 sibling, 1 reply; 18+ messages in thread
From: Wu Fengguang @ 2009-09-23 13:20 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Jan Kara, Theodore Tso, Dave Chinner, Chris Mason,
	Christoph Hellwig, Peter Zijlstra, linux-fsdevel, LKML

On Wed, Sep 23, 2009 at 08:33:43PM +0800, Wu, Fengguang wrote:
> Debug traces show that in per-bdi writeback, the inode under writeback
> almost always get redirtied by a busy dirtier.  We used to call
> redirty_tail() in this case, which could delay inode for up to 30s.
> 
> This is unacceptable because it now happens so frequently for plain cp/dd,
> that the accumulated delays could make writeback of big files very slow.
> 
> So let's distinguish between data redirty and metadata only redirty.
> The first one is caused by a busy dirtier, while the latter one could
> happen in XFS, NFS, etc. when they are doing delalloc or updating isize.
> 
> The inode being busy dirtied will now be requeued for next io, while
> the inode being redirtied by fs will continue to be delayed to avoid
> repeated IO.

Here are some test results on XFS.  Workload is a simple

         cp /dev/zero /mnt/xfs/

I saw many repeated 

[  344.043711] mm/page-writeback.c +540 balance_dirty_pages(): comm=cp pid=3520 n=12
[  344.043711] global dirty=57051 writeback=12920 nfs=0 flags=CM towrite=0 skipped=0
[  344.043711] redirty_tail() +516: inode=128  => the directory inode
[  344.043711] redirty_tail() +516: inode=131  => the zero file being written to

and then repeated 

[  347.408629] fs/fs-writeback.c +813 wb_writeback(): comm=flush-8:0 pid=3298 n=4096
[  347.411045] global dirty=50397 writeback=18065 nfs=0 flags=CM towrite=0 skipped=0
[  347.422077] requeue_io() +468: inode=131
[  347.423809] redirty_tail() +516: inode=128

traces during copy, and repeated 


[  373.326496] redirty_tail() +516: inode=131
[  373.328209] fs/fs-writeback.c +813 wb_writeback(): comm=flush-8:0 pid=3298 n=4096
[  373.330988] global dirty=25213 writeback=20470 nfs=0 flags=CM towrite=0 skipped=0

after copy is interrupted.

I noticed that
- the write chunk size of balance_dirty_pages() is 12, which is pretty
  small and inefficient.
- during copy, the inode is sometimes redirty_tail (old behavior) and
  sometimes requeue_io (new behavior).
- during copy, the directory inode will always be synced and then
  redirty_tail.
- after copy, the inode will be redirtied after sync.

It shall not be a problem to use requeue_io for XFS, because whether
it be requeue_io or redirty_tail, write_inode() will be called once
for every 4MB.

It would be inefficient if XFS really tries to write inode and
directory inode's metadata every time it synced 4MB page. If
that write attempt is turned into _real_ IO, that would be bad
and kill performance. Increasing MAX_WRITEBACK_PAGES may help
reduce the frequency of write_inode() though.

Thanks,
Fengguang

> CC: Jan Kara <jack@suse.cz>
> CC: Theodore Ts'o <tytso@mit.edu>
> CC: Dave Chinner <david@fromorbit.com>
> CC: Jens Axboe <jens.axboe@oracle.com>
> CC: Chris Mason <chris.mason@oracle.com>
> CC: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c	2009-09-23 16:13:41.000000000 +0800
> +++ linux/fs/fs-writeback.c	2009-09-23 16:21:24.000000000 +0800
> @@ -491,10 +493,15 @@ writeback_single_inode(struct inode *ino
>  	spin_lock(&inode_lock);
>  	inode->i_state &= ~I_SYNC;
>  	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> -		if (inode->i_state & I_DIRTY) {
> +		if (inode->i_state & I_DIRTY_PAGES) {
>  			/*
> -			 * Someone redirtied the inode while were writing back
> -			 * the pages.
> +			 * More pages get dirtied by a fast dirtier.
> +			 */
> +			goto select_queue;
> +		} else if (inode->i_state & I_DIRTY) {
> +			/*
> +			 * At least XFS will redirty the inode during the
> +			 * writeback (delalloc) and on io completion (isize).
>  			 */
>  			redirty_tail(inode);
>  		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier
  2009-09-23 13:20   ` Wu Fengguang
@ 2009-09-23 13:23     ` Christoph Hellwig
  2009-09-23 13:40       ` Wu Fengguang
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2009-09-23 13:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Jan Kara, Theodore Tso, Dave Chinner,
	Chris Mason, Christoph Hellwig, Peter Zijlstra, linux-fsdevel,
	LKML

On Wed, Sep 23, 2009 at 09:20:08PM +0800, Wu Fengguang wrote:
> I noticed that
> - the write chunk size of balance_dirty_pages() is 12, which is pretty
>   small and inefficient.
> - during copy, the inode is sometimes redirty_tail (old behavior) and
>   sometimes requeue_io (new behavior).
> - during copy, the directory inode will always be synced and then
>   redirty_tail.
> - after copy, the inode will be redirtied after sync.

Yeah, XFS uses generic_file_uffered_write and the heuristics in there
for balance_dirty_pages turned out to be really bad.  So far we didn't
manage to sucessfully get that fixed, though.

> It shall not be a problem to use requeue_io for XFS, because whether
> it be requeue_io or redirty_tail, write_inode() will be called once
> for every 4MB.
> 
> It would be inefficient if XFS really tries to write inode and
> directory inode's metadata every time it synced 4MB page. If
> that write attempt is turned into _real_ IO, that would be bad
> and kill performance. Increasing MAX_WRITEBACK_PAGES may help
> reduce the frequency of write_inode() though.

The way we call write_inode for XFS is extremly inefficient for XFS.  As
you noticed XFS tends to redirty the inode on I/O completion, and we
also cluster inode writeouts.  For XFS we'd really prefer to not
intermix data and inode writeout, but first do the data writeout and
then later push out the inodes, preferably with as many as possible
inodes to sweep out in one go.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier
  2009-09-23 13:23     ` Christoph Hellwig
@ 2009-09-23 13:40       ` Wu Fengguang
  0 siblings, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2009-09-23 13:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jens Axboe, Jan Kara, Theodore Tso, Dave Chinner,
	Chris Mason, Peter Zijlstra, linux-fsdevel, LKML

On Wed, Sep 23, 2009 at 09:23:51PM +0800, Christoph Hellwig wrote:
> On Wed, Sep 23, 2009 at 09:20:08PM +0800, Wu Fengguang wrote:
> > I noticed that
> > - the write chunk size of balance_dirty_pages() is 12, which is pretty
> >   small and inefficient.
> > - during copy, the inode is sometimes redirty_tail (old behavior) and
> >   sometimes requeue_io (new behavior).
> > - during copy, the directory inode will always be synced and then
> >   redirty_tail.
> > - after copy, the inode will be redirtied after sync.
> 
> Yeah, XFS uses generic_file_uffered_write and the heuristics in there
> for balance_dirty_pages turned out to be really bad.  So far we didn't
> manage to sucessfully get that fixed, though.

Ah sorry. It's because of the first patch, it does not always "bump up"
the write chunk. In your case it is obviously decreased (the original
ratelimit_pages=4096 is much larger value).  I'll fix it.

> > It shall not be a problem to use requeue_io for XFS, because whether
> > it be requeue_io or redirty_tail, write_inode() will be called once
> > for every 4MB.
> > 
> > It would be inefficient if XFS really tries to write inode and
> > directory inode's metadata every time it synced 4MB page. If
> > that write attempt is turned into _real_ IO, that would be bad
> > and kill performance. Increasing MAX_WRITEBACK_PAGES may help
> > reduce the frequency of write_inode() though.
> 
> The way we call write_inode for XFS is extremly inefficient for XFS.  As
> you noticed XFS tends to redirty the inode on I/O completion, and we
> also cluster inode writeouts.  For XFS we'd really prefer to not
> intermix data and inode writeout, but first do the data writeout and
> then later push out the inodes, preferably with as many as possible
> inodes to sweep out in one go.

I guess the difficult part would be possible policy requirements
on the max batch size (max number of inodes or pages to write before
switching to write metadata) and the delay time (between the sync of
data and metadata). It may take a long time to make a full scan of
the dirty list.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/6 -v2] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-23 12:33 ` [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages Wu Fengguang
  2009-09-23 12:45   ` Christoph Hellwig
@ 2009-09-23 13:56   ` Wu Fengguang
  2009-09-23 13:58     ` Wu Fengguang
  1 sibling, 1 reply; 18+ messages in thread
From: Wu Fengguang @ 2009-09-23 13:56 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Jan Kara, Theodore Ts'o, Dave Chinner, Chris Mason,
	Christoph Hellwig, Peter Zijlstra, linux-fsdevel, LKML

Some filesystem may choose to write much more than ratelimit_pages
before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
determine number to write based on real number of dirtied pages.

Otherwise it is possible that
  loop {
    btrfs_file_write():     dirty 1024 pages
    balance_dirty_pages():  write up to 48 pages (= ratelimit_pages * 1.5)
  }
in which the writeback rate cannot keep up with dirty rate, and the
dirty pages go all the way beyond dirty_thresh.

The increased write_chunk may make the dirtier more bumpy.
So filesystems shall be take care not to dirty too much at
a time (eg. > 4MB) without checking the ratelimit.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-09-23 21:41:21.000000000 +0800
+++ linux/mm/page-writeback.c	2009-09-23 21:42:17.000000000 +0800
@@ -44,12 +44,15 @@ static long ratelimit_pages = 32;
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
+ * It should be somewhat larger than dirtied pages to ensure that reasonably
  * large amounts of I/O are submitted.
  */
-static inline long sync_writeback_pages(void)
+static inline long sync_writeback_pages(unsigned long dirtied)
 {
-	return ratelimit_pages + ratelimit_pages / 2;
+	if (dirtied < ratelimit_pages)
+		dirtied = ratelimit_pages;
+
+	return dirtied + dirtied / 2;
 }
 
 /* The following parameters are exported via /proc/sys/vm */
@@ -476,7 +479,8 @@ get_dirty_limits(unsigned long *pbackgro
  * If we're over `background_thresh' then pdflush is woken to perform some
  * writeout.
  */
-static void balance_dirty_pages(struct address_space *mapping)
+static void balance_dirty_pages(struct address_space *mapping,
+				unsigned long write_chunk)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
@@ -484,7 +488,6 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
-	unsigned long write_chunk = sync_writeback_pages();
 	unsigned long pause = 1;
 
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
@@ -639,9 +642,10 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
+		ratelimit = sync_writeback_pages(*p);
 		*p = 0;
 		preempt_enable();
-		balance_dirty_pages(mapping);
+		balance_dirty_pages(mapping, ratelimit);
 		return;
 	}
 	preempt_enable();

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6 -v2] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-23 13:56   ` [PATCH 1/6 -v2] " Wu Fengguang
@ 2009-09-23 13:58     ` Wu Fengguang
  0 siblings, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2009-09-23 13:58 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Jan Kara, Theodore Ts'o, Dave Chinner, Chris Mason,
	Christoph Hellwig, Peter Zijlstra, linux-fsdevel, LKML

On Wed, Sep 23, 2009 at 09:56:00PM +0800, Wu Fengguang wrote:
> -static inline long sync_writeback_pages(void)
> +static inline long sync_writeback_pages(unsigned long dirtied)
>  {
> -	return ratelimit_pages + ratelimit_pages / 2;
> +	if (dirtied < ratelimit_pages)
> +		dirtied = ratelimit_pages;

Just added the above checks. Now balance_dirty_pages() for XFS 
works in a much larger 1536 chunk size :)

[   40.374081] redirty_tail() +516: inode=131
[   40.374951] mm/page-writeback.c +543 balance_dirty_pages(): comm=cp pid=3296 n=1536
[   40.377001] global dirty=49818 writeback=17027 nfs=0 flags=CM towrite=0 skipped=0

> +
> +	return dirtied + dirtied / 2;
>  }

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/6] writeback: stop background writeback when below background threshold
  2009-09-23 12:33 ` [PATCH 2/6] writeback: stop background writeback when below background threshold Wu Fengguang
@ 2009-09-23 15:05   ` Jens Axboe
  2009-09-24  1:24     ` Wu Fengguang
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2009-09-23 15:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Theodore Ts'o,
	Dave Chinner, Chris Mason, Christoph Hellwig, linux-fsdevel,
	LKML

On Wed, Sep 23 2009, Wu Fengguang wrote:
> Treat bdi_start_writeback(0) as a special request to do background write,
> and stop such work when we are below the background dirty threshold.
> 
> Also simplify the (nr_pages <= 0) checks. Since we already pass in
> nr_pages=LONG_MAX for WB_SYNC_ALL and background writes, we don't
> need to worry about it being decreased to zero.

Good cleanup! Do you want me to collect these fixes, or how shall we
handle this?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/6] writeback: stop background writeback when below background threshold
  2009-09-23 15:05   ` Jens Axboe
@ 2009-09-24  1:24     ` Wu Fengguang
  0 siblings, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2009-09-24  1:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Theodore Ts'o,
	Dave Chinner, Chris Mason, Christoph Hellwig, linux-fsdevel,
	LKML

On Wed, Sep 23, 2009 at 11:05:10PM +0800, Jens Axboe wrote:
> On Wed, Sep 23 2009, Wu Fengguang wrote:
> > Treat bdi_start_writeback(0) as a special request to do background write,
> > and stop such work when we are below the background dirty threshold.
> > 
> > Also simplify the (nr_pages <= 0) checks. Since we already pass in
> > nr_pages=LONG_MAX for WB_SYNC_ALL and background writes, we don't
> > need to worry about it being decreased to zero.
> 
> Good cleanup! Do you want me to collect these fixes, or how shall we
> handle this?

Thanks. I'm feeling OK with these patches(except the last one) as well
as Jan's busy loop fixing patch. Can you pull them to your tree?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier
  2009-09-23 12:33 ` [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier Wu Fengguang
  2009-09-23 13:20   ` Wu Fengguang
@ 2009-09-26 19:47   ` Christoph Hellwig
  2009-09-27  2:02     ` Wu Fengguang
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2009-09-26 19:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Jan Kara, Theodore Tso, Dave Chinner,
	Chris Mason, Christoph Hellwig, Peter Zijlstra, linux-fsdevel,
	LKML

On Wed, Sep 23, 2009 at 08:33:43PM +0800, Wu Fengguang wrote:
> So let's distinguish between data redirty and metadata only redirty.
> The first one is caused by a busy dirtier, while the latter one could
> happen in XFS, NFS, etc. when they are doing delalloc or updating isize.

Btw, I'm not sure the existing and preserved behaviour for that case
is good.  In the worst case the inode writeout gets delayed by another
30 seconds, doubling the window until a file is on disk if it was
extended.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier
  2009-09-26 19:47   ` Christoph Hellwig
@ 2009-09-27  2:02     ` Wu Fengguang
  0 siblings, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2009-09-27  2:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jens Axboe, Jan Kara, Theodore Tso, Dave Chinner,
	Chris Mason, Peter Zijlstra, linux-fsdevel, LKML

On Sun, Sep 27, 2009 at 03:47:47AM +0800, Christoph Hellwig wrote:
> On Wed, Sep 23, 2009 at 08:33:43PM +0800, Wu Fengguang wrote:
> > So let's distinguish between data redirty and metadata only redirty.
> > The first one is caused by a busy dirtier, while the latter one could
> > happen in XFS, NFS, etc. when they are doing delalloc or updating isize.
> 
> Btw, I'm not sure the existing and preserved behaviour for that case
> is good.  In the worst case the inode writeout gets delayed by another
> 30 seconds, doubling the window until a file is on disk if it was
> extended.

Yes, the preserved behaviour is not optimal for XFS, but safe.

We could try redirty_tail when there are no remaining
dirty pages, and only metadata dirtiness. Like this:

---
 fs/fs-writeback.c |   20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-09-27 09:52:15.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-27 09:54:23.000000000 +0800
@@ -477,18 +477,7 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
-		if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) {
-			/*
-			 * More pages get dirtied by a fast dirtier.
-			 */
-			goto select_queue;
-		} else if (inode->i_state & I_DIRTY) {
-			/*
-			 * At least XFS will redirty the inode during the
-			 * writeback (delalloc) and on io completion (isize).
-			 */
-			redirty_tail(inode);
-		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything. Redirty
@@ -510,7 +499,6 @@ writeback_single_inode(struct inode *ino
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-select_queue:
 				if (wbc->nr_to_write <= 0) {
 					/*
 					 * slice used up: queue for next turn
@@ -533,6 +521,12 @@ select_queue:
 				inode->i_state |= I_DIRTY_PAGES;
 				redirty_tail(inode);
 			}
+		} else if (inode->i_state & I_DIRTY) {
+			/*
+			 * At least XFS will redirty the inode during the
+			 * writeback (delalloc) and on io completion (isize).
+			 */
+			redirty_tail(inode);
 		} else if (atomic_read(&inode->i_count)) {
 			/*
 			 * The inode is clean, inuse

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2009-09-27  2:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-23 12:33 [PATCH 0/6] [RFC] writeback fixes for 2.6.32 Wu Fengguang
2009-09-23 12:33 ` [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages Wu Fengguang
2009-09-23 12:45   ` Christoph Hellwig
2009-09-23 12:53     ` Wu Fengguang
2009-09-23 13:56   ` [PATCH 1/6 -v2] " Wu Fengguang
2009-09-23 13:58     ` Wu Fengguang
2009-09-23 12:33 ` [PATCH 2/6] writeback: stop background writeback when below background threshold Wu Fengguang
2009-09-23 15:05   ` Jens Axboe
2009-09-24  1:24     ` Wu Fengguang
2009-09-23 12:33 ` [PATCH 3/6] writeback: kupdate writeback shall not stop when more io is possible Wu Fengguang
2009-09-23 12:33 ` [PATCH 4/6] writeback: cleanup writeback_single_inode() Wu Fengguang
2009-09-23 12:33 ` [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier Wu Fengguang
2009-09-23 13:20   ` Wu Fengguang
2009-09-23 13:23     ` Christoph Hellwig
2009-09-23 13:40       ` Wu Fengguang
2009-09-26 19:47   ` Christoph Hellwig
2009-09-27  2:02     ` Wu Fengguang
2009-09-23 12:33 ` [PATCH 6/6] writeback: redirty a fully scanned inode Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.