All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/7] some random writeback fixes
@ 2009-09-09 14:51 Wu Fengguang
  2009-09-09 14:51 ` [RFC][PATCH 1/7] writeback: cleanup writeback_single_inode() Wu Fengguang
                   ` (6 more replies)
  0 siblings, 7 replies; 29+ messages in thread
From: Wu Fengguang @ 2009-09-09 14:51 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Dave Chinner, Chris Mason, Peter Zijlstra, Christoph Hellwig,
	jack, Artem Bityutskiy, Wu Fengguang, LKML, linux-fsdevel

Hi all,

These are some writeback fixes based on the latest linux-next without
the vm.max_writeback_mb knob, since I'm not aware of any strong reasons
for introducing it. Instead, the MAX_WRITEBACK_PAGES is directly increased
to 64MB.

Other patches are straightforward. Needs more tests but it's enough to show
the basic ideas. Comments are welcome!

Thanks,
Fengguang
-- 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC][PATCH 1/7] writeback: cleanup writeback_single_inode()
  2009-09-09 14:51 [RFC][PATCH 0/7] some random writeback fixes Wu Fengguang
@ 2009-09-09 14:51 ` Wu Fengguang
  2009-09-09 15:45   ` Jan Kara
  2009-09-09 14:51 ` [RFC][PATCH 2/7] writeback: fix queue_io() ordering Wu Fengguang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2009-09-09 14:51 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Dave Chinner, Chris Mason, Peter Zijlstra, Christoph Hellwig,
	jack, Artem Bityutskiy, Wu Fengguang, LKML, linux-fsdevel

[-- Attachment #1: writeback-simplify-x.patch --]
[-- Type: text/plain, Size: 1421 bytes --]

Make the if-else straight in writeback_single_inode().
No behavior change.

Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |   15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-09-09 21:40:41.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-09 21:41:14.000000000 +0800
@@ -417,8 +417,13 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
-		if (!(inode->i_state & I_DIRTY) &&
-		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+		if (inode->i_state & I_DIRTY) {
+			/*
+			 * Someone redirtied the inode while were writing back
+			 * the pages.
+			 */
+			redirty_tail(inode);
+		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything. Redirty
@@ -462,12 +467,6 @@ writeback_single_inode(struct inode *ino
 				inode->i_state |= I_DIRTY_PAGES;
 				redirty_tail(inode);
 			}
-		} else if (inode->i_state & I_DIRTY) {
-			/*
-			 * Someone redirtied the inode while were writing back
-			 * the pages.
-			 */
-			redirty_tail(inode);
 		} else if (atomic_read(&inode->i_count)) {
 			/*
 			 * The inode is clean, inuse

-- 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC][PATCH 2/7] writeback: fix queue_io() ordering
  2009-09-09 14:51 [RFC][PATCH 0/7] some random writeback fixes Wu Fengguang
  2009-09-09 14:51 ` [RFC][PATCH 1/7] writeback: cleanup writeback_single_inode() Wu Fengguang
@ 2009-09-09 14:51 ` Wu Fengguang
  2009-09-09 15:53   ` Jan Kara
  2009-09-09 14:51 ` [RFC][PATCH 3/7] writeback: merge for_kupdate and !for_kupdate requeue io logics Wu Fengguang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2009-09-09 14:51 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Dave Chinner, Chris Mason, Peter Zijlstra, Christoph Hellwig,
	jack, Artem Bityutskiy, Wu Fengguang, LKML, linux-fsdevel

[-- Attachment #1: queue_io-fix.patch --]
[-- Type: text/plain, Size: 1168 bytes --]

This was not a bug, since b_io is empty for kupdate writeback.
The next patch will do requeue_io() for non-kupdate writeback,
so let's fix it.

CC: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-09-09 21:41:14.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-09 21:45:15.000000000 +0800
@@ -313,11 +313,14 @@ static void move_expired_inodes(struct l
 }
 
 /*
- * Queue all expired dirty inodes for io, eldest first.
+ * Queue all expired dirty inodes for io, eldest first:
+ * (newly dirtied) => b_dirty inodes
+ *                 => b_more_io inodes
+ *                 => remaining inodes in b_io => (dequeue for sync)
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
-	list_splice_init(&wb->b_more_io, wb->b_io.prev);
+	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
 

-- 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC][PATCH 3/7] writeback: merge for_kupdate and !for_kupdate requeue io logics
  2009-09-09 14:51 [RFC][PATCH 0/7] some random writeback fixes Wu Fengguang
  2009-09-09 14:51 ` [RFC][PATCH 1/7] writeback: cleanup writeback_single_inode() Wu Fengguang
  2009-09-09 14:51 ` [RFC][PATCH 2/7] writeback: fix queue_io() ordering Wu Fengguang
@ 2009-09-09 14:51 ` Wu Fengguang
  2009-09-09 14:51 ` [RFC][PATCH 4/7] writeback: ensure large files are written in MAX_WRITEBACK_PAGES chunks Wu Fengguang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2009-09-09 14:51 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Dave Chinner, Chris Mason, Peter Zijlstra, Christoph Hellwig,
	jack, Artem Bityutskiy, Wu Fengguang, LKML, linux-fsdevel

[-- Attachment #1: writeback-more_io_wait-a.patch --]
[-- Type: text/plain, Size: 2511 bytes --]

Unify the logic for kupdate and non-kupdate cases.
There won't be starvation because the inodes requeued into b_more_io or
b_more_io_wait will later be spliced _after_ the remaining inodes in b_io,
hence won't stand in the way of other inodes in the next run.

CC: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |   39 ++++++---------------------------------
 1 file changed, 6 insertions(+), 33 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-09-09 20:47:11.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-09 20:48:01.000000000 +0800
@@ -426,45 +426,18 @@ writeback_single_inode(struct inode *ino
 		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything. Redirty
-			 * the inode; Move it from b_io onto b_more_io/b_dirty.
+			 * sometimes bales out without doing anything.
 			 */
-			/*
-			 * akpm: if the caller was the kupdate function we put
-			 * this inode at the head of b_dirty so it gets first
-			 * consideration.  Otherwise, move it to the tail, for
-			 * the reasons described there.  I'm not really sure
-			 * how much sense this makes.  Presumably I had a good
-			 * reasons for doing it this way, and I'd rather not
-			 * muck with it at present.
-			 */
-			if (wbc->for_kupdate) {
+			inode->i_state |= I_DIRTY_PAGES;
+			if (wbc->nr_to_write <= 0) {
 				/*
-				 * For the kupdate function we move the inode
-				 * to b_more_io so it will get more writeout as
-				 * soon as the queue becomes uncongested.
+				 * slice used up: queue for next turn
 				 */
-				inode->i_state |= I_DIRTY_PAGES;
-				if (wbc->nr_to_write <= 0) {
-					/*
-					 * slice used up: queue for next turn
-					 */
-					requeue_io(inode);
-				} else {
-					/*
-					 * somehow blocked: retry later
-					 */
-					redirty_tail(inode);
-				}
+				requeue_io(inode);
 			} else {
 				/*
-				 * Otherwise fully redirty the inode so that
-				 * other inodes on this superblock will get some
-				 * writeout.  Otherwise heavy writing to one
-				 * file would indefinitely suspend writeout of
-				 * all the other files.
+				 * somehow blocked: retry later
 				 */
-				inode->i_state |= I_DIRTY_PAGES;
 				redirty_tail(inode);
 			}
 		} else if (atomic_read(&inode->i_count)) {

-- 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC][PATCH 4/7] writeback: ensure large files are written in MAX_WRITEBACK_PAGES chunks
  2009-09-09 14:51 [RFC][PATCH 0/7] some random writeback fixes Wu Fengguang
                   ` (2 preceding siblings ...)
  2009-09-09 14:51 ` [RFC][PATCH 3/7] writeback: merge for_kupdate and !for_kupdate requeue io logics Wu Fengguang
@ 2009-09-09 14:51 ` Wu Fengguang
  2009-09-09 14:51 ` [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES Wu Fengguang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2009-09-09 14:51 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Dave Chinner, Chris Mason, Peter Zijlstra, Christoph Hellwig,
	jack, Artem Bityutskiy, Wu Fengguang, LKML, linux-fsdevel

[-- Attachment #1: writeback-track-large-file.patch --]
[-- Type: text/plain, Size: 4903 bytes --]

Remember pages written for the current file between successive 
writeback_single_inode() invocations, and modify wbc->nr_to_write
accordingly to continue write the file until MAX_WRITEBACK_PAGES is
reached for this single file.

This ensures large files will be written in large MAX_WRITEBACK_PAGES
chunks. It works best for kernel sync threads which repeatedly call into
writeback_single_inode() with the same wbc. For balance_dirty_pages()
that normally restart with a fresh wbc, it may never collect enough
last_file_written to skip the current large file, hence lead to
starvation of other (small) files. However/luckily balance_dirty_pages()
writeback is normally interleaved with background writeback, which will
do the duty of rotating the writeback files. So this is not a bit problem.

CC: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   41 +++++++++++++++++++++++++++---------
 include/linux/writeback.h |   12 ++++++++++
 2 files changed, 43 insertions(+), 10 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-09-09 21:50:53.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-09 21:51:04.000000000 +0800
@@ -271,6 +271,19 @@ static void requeue_io(struct inode *ino
 	list_move(&inode->i_list, &wb->b_more_io);
 }
 
+/*
+ * continue io on this inode on next writeback if
+ * it has not accumulated large enough writeback io chunk
+ */
+static void requeue_partial_io(struct writeback_control *wbc, struct inode *inode)
+{
+	if (wbc->last_file_written == 0 ||
+	    wbc->last_file_written >= MAX_WRITEBACK_PAGES)
+		return requeue_io(inode);
+
+	list_move_tail(&inode->i_list, &inode_to_bdi(inode)->wb.b_io);
+}
+
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
@@ -365,6 +378,8 @@ writeback_single_inode(struct inode *ino
 {
 	struct address_space *mapping = inode->i_mapping;
 	int wait = wbc->sync_mode == WB_SYNC_ALL;
+	long last_file_written;
+	long nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -402,8 +417,21 @@ writeback_single_inode(struct inode *ino
 
 	spin_unlock(&inode_lock);
 
+	if (wbc->last_file != inode->i_ino)
+		last_file_written = 0;
+	else
+		last_file_written = wbc->last_file_written;
+	wbc->nr_to_write -= last_file_written;
+	nr_to_write = wbc->nr_to_write;
+
 	ret = do_writepages(mapping, wbc);
 
+	if (wbc->last_file != inode->i_ino) {
+		wbc->last_file = inode->i_ino;
+		wbc->last_file_written = nr_to_write - wbc->nr_to_write;
+	} else
+		wbc->last_file_written += nr_to_write - wbc->nr_to_write;
+
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wait);
@@ -436,7 +464,7 @@ writeback_single_inode(struct inode *ino
 				/*
 				 * slice used up: queue for next turn
 				 */
-				requeue_io(inode);
+				requeue_partial_io(wbc, inode);
 			} else {
 				/*
 				 * somehow blocked: retry later
@@ -456,6 +484,8 @@ writeback_single_inode(struct inode *ino
 		}
 	}
 	inode_sync_complete(inode);
+	wbc->nr_to_write += last_file_written;
+
 	return ret;
 }
 
@@ -612,15 +642,6 @@ void writeback_inodes_wbc(struct writeba
 	writeback_inodes_wb(&bdi->wb, wbc);
 }
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
-
 static inline bool over_bground_thresh(void)
 {
 	unsigned long background_thresh, dirty_thresh;
--- linux.orig/include/linux/writeback.h	2009-09-09 21:50:53.000000000 +0800
+++ linux/include/linux/writeback.h	2009-09-09 21:51:22.000000000 +0800
@@ -14,6 +14,16 @@ extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
 /*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES     1024
+
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
@@ -36,6 +46,8 @@ struct writeback_control {
 					   older than this */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
+	unsigned long last_file;	/* Inode number of last written file */
+	long last_file_written;		/* Total pages written for last file */
 	long pages_skipped;		/* Pages which were not written */
 
 	/*

-- 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES
  2009-09-09 14:51 [RFC][PATCH 0/7] some random writeback fixes Wu Fengguang
                   ` (3 preceding siblings ...)
  2009-09-09 14:51 ` [RFC][PATCH 4/7] writeback: ensure large files are written in MAX_WRITEBACK_PAGES chunks Wu Fengguang
@ 2009-09-09 14:51 ` Wu Fengguang
  2009-09-09 23:29   ` Theodore Tso
  2009-09-09 14:51 ` [RFC][PATCH 6/7] writeback: dont abort inode on congestion Wu Fengguang
  2009-09-09 14:51 ` [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages Wu Fengguang
  6 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2009-09-09 14:51 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Dave Chinner, Chris Mason, Peter Zijlstra, Christoph Hellwig,
	jack, Artem Bityutskiy, Wu Fengguang, LKML, linux-fsdevel

[-- Attachment #1: writeback-64M-MAX_WRITEBACK_PAGES.patch --]
[-- Type: text/plain, Size: 2407 bytes --]

Theodore has a nice description:

Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.

According to Christoph, the current writeback size is way too small,
and XFS had a hack that bumped out nr_to_write to four times the value
sent by the VM to be able to saturate medium-sized RAID arrays.  This
value was also problematic for ext4 as well, as it caused large files
to be come interleaved on disk by in 8 megabyte chunks (we bumped up
the nr_to_write by a factor of two).

So, in this patch, we make the MAX_WRITEBACK_PAGES a tunable,
max_writeback_mb, and set it to a default value of 128 megabytes.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |   13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

--- linux.orig/include/linux/writeback.h	2009-09-09 21:51:54.000000000 +0800
+++ linux/include/linux/writeback.h	2009-09-09 21:51:58.000000000 +0800
@@ -14,14 +14,13 @@ extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
 /*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
+ * The maximum number of pages to writeout in a single periodic/background
+ * writeback operation. 64MB means I_SYNC may be hold for up to 1 second.
+ * This is not a big problem since we normally do kind of trylock on I_SYNC
+ * for non-data-integrity writes.  Userspace tasks doing throttled writeback
+ * do not use this value.
  */
-#define MAX_WRITEBACK_PAGES     1024
-
+#define MAX_WRITEBACK_PAGES	(64 * 1024 * 1024 / PAGE_SIZE)
 
 /*
  * fs/fs-writeback.c

-- 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC][PATCH 6/7] writeback: dont abort inode on congestion
  2009-09-09 14:51 [RFC][PATCH 0/7] some random writeback fixes Wu Fengguang
                   ` (4 preceding siblings ...)
  2009-09-09 14:51 ` [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES Wu Fengguang
@ 2009-09-09 14:51 ` Wu Fengguang
  2009-09-09 14:51 ` [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages Wu Fengguang
  6 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2009-09-09 14:51 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Dave Chinner, Chris Mason, Peter Zijlstra, Christoph Hellwig,
	jack, Artem Bityutskiy, Wu Fengguang, LKML, linux-fsdevel

[-- Attachment #1: writeback-requeue-congestion.patch --]
[-- Type: text/plain, Size: 1041 bytes --]

Retry writeback of the current inode when encountered congestion,
instead of redirting it, which may have the problems
- slow down writeback pace
- lead to small writeback io

Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Chad Talbott <ctalbott@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-09-09 21:55:19.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-09 22:04:23.000000000 +0800
@@ -460,9 +460,11 @@ writeback_single_inode(struct inode *ino
 			 * sometimes bales out without doing anything.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
-			if (wbc->nr_to_write <= 0) {
+			if (wbc->encountered_congestion ||
+			    wbc->nr_to_write <= 0) {
 				/*
-				 * slice used up: queue for next turn
+				 * if slice used up, queue for next round;
+				 * otherwise continue this inode after return
 				 */
 				requeue_partial_io(wbc, inode);
 			} else {

-- 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-09 14:51 [RFC][PATCH 0/7] some random writeback fixes Wu Fengguang
                   ` (5 preceding siblings ...)
  2009-09-09 14:51 ` [RFC][PATCH 6/7] writeback: dont abort inode on congestion Wu Fengguang
@ 2009-09-09 14:51 ` Wu Fengguang
  2009-09-09 15:44   ` Jan Kara
  6 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2009-09-09 14:51 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Dave Chinner, Chris Mason, Peter Zijlstra, Christoph Hellwig,
	jack, Artem Bityutskiy, Wu Fengguang, LKML, linux-fsdevel

[-- Attachment #1: writeback-ratelimit.patch --]
[-- Type: text/plain, Size: 2350 bytes --]

Some filesystem may choose to write much more than ratelimit_pages
before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
determine number to write based on real number of dirtied pages.

The increased write_chunk may make the dirtier more bumpy.  This is
filesystem writers' duty not to dirty too much at a time without
checking the ratelimit.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-09-09 21:19:21.000000000 +0800
+++ linux/mm/page-writeback.c	2009-09-09 21:25:45.000000000 +0800
@@ -44,12 +44,12 @@ static long ratelimit_pages = 32;
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
+ * It should be somewhat larger than dirtied pages to ensure that reasonably
  * large amounts of I/O are submitted.
  */
-static inline long sync_writeback_pages(void)
+static inline long sync_writeback_pages(unsigned long dirtied)
 {
-	return ratelimit_pages + ratelimit_pages / 2;
+	return dirtied + dirtied / 2;
 }
 
 /* The following parameters are exported via /proc/sys/vm */
@@ -476,7 +476,8 @@ get_dirty_limits(unsigned long *pbackgro
  * If we're over `background_thresh' then pdflush is woken to perform some
  * writeout.
  */
-static void balance_dirty_pages(struct address_space *mapping)
+static void balance_dirty_pages(struct address_space *mapping,
+				unsigned long write_chunk)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
@@ -484,7 +485,6 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
-	unsigned long write_chunk = sync_writeback_pages();
 
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
@@ -638,9 +638,10 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
+		ratelimit = sync_writeback_pages(*p);
 		*p = 0;
 		preempt_enable();
-		balance_dirty_pages(mapping);
+		balance_dirty_pages(mapping, ratelimit);
 		return;
 	}
 	preempt_enable();

-- 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-09 14:51 ` [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages Wu Fengguang
@ 2009-09-09 15:44   ` Jan Kara
  2009-09-10  1:42     ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2009-09-09 15:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, jack, Artem Bityutskiy, LKML,
	linux-fsdevel

On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> Some filesystem may choose to write much more than ratelimit_pages
> before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> determine number to write based on real number of dirtied pages.
> 
> The increased write_chunk may make the dirtier more bumpy.  This is
> filesystem writers' duty not to dirty too much at a time without
> checking the ratelimit.
  I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
dirty the page, not when we write it out. So a problem would only happen if
filesystem dirties pages by set_page_dirty() and won't call
balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
and do_wp_page() takes care of that. So where's the problem?

								Honza
> ---
>  mm/page-writeback.c |   13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> --- linux.orig/mm/page-writeback.c	2009-09-09 21:19:21.000000000 +0800
> +++ linux/mm/page-writeback.c	2009-09-09 21:25:45.000000000 +0800
> @@ -44,12 +44,12 @@ static long ratelimit_pages = 32;
>  /*
>   * When balance_dirty_pages decides that the caller needs to perform some
>   * non-background writeback, this is how many pages it will attempt to write.
> - * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
> + * It should be somewhat larger than dirtied pages to ensure that reasonably
>   * large amounts of I/O are submitted.
>   */
> -static inline long sync_writeback_pages(void)
> +static inline long sync_writeback_pages(unsigned long dirtied)
>  {
> -	return ratelimit_pages + ratelimit_pages / 2;
> +	return dirtied + dirtied / 2;
>  }
>  
>  /* The following parameters are exported via /proc/sys/vm */
> @@ -476,7 +476,8 @@ get_dirty_limits(unsigned long *pbackgro
>   * If we're over `background_thresh' then pdflush is woken to perform some
>   * writeout.
>   */
> -static void balance_dirty_pages(struct address_space *mapping)
> +static void balance_dirty_pages(struct address_space *mapping,
> +				unsigned long write_chunk)
>  {
>  	long nr_reclaimable, bdi_nr_reclaimable;
>  	long nr_writeback, bdi_nr_writeback;
> @@ -484,7 +485,6 @@ static void balance_dirty_pages(struct a
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
>  	unsigned long pages_written = 0;
> -	unsigned long write_chunk = sync_writeback_pages();
>  
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
> @@ -638,9 +638,10 @@ void balance_dirty_pages_ratelimited_nr(
>  	p =  &__get_cpu_var(bdp_ratelimits);
>  	*p += nr_pages_dirtied;
>  	if (unlikely(*p >= ratelimit)) {
> +		ratelimit = sync_writeback_pages(*p);
>  		*p = 0;
>  		preempt_enable();
> -		balance_dirty_pages(mapping);
> +		balance_dirty_pages(mapping, ratelimit);
>  		return;
>  	}
>  	preempt_enable();
> 
> -- 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 1/7] writeback: cleanup writeback_single_inode()
  2009-09-09 14:51 ` [RFC][PATCH 1/7] writeback: cleanup writeback_single_inode() Wu Fengguang
@ 2009-09-09 15:45   ` Jan Kara
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2009-09-09 15:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, jack, Artem Bityutskiy, LKML,
	linux-fsdevel

On Wed 09-09-09 22:51:42, Wu Fengguang wrote:
> Make the if-else straight in writeback_single_inode().
> No behavior change.
> 
> Cc: Michael Rubin <mrubin@google.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
  The patch looks good.
Acked-by: Jan Kara <jack@suse.cz>

> ---
>  fs/fs-writeback.c |   15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c	2009-09-09 21:40:41.000000000 +0800
> +++ linux/fs/fs-writeback.c	2009-09-09 21:41:14.000000000 +0800
> @@ -417,8 +417,13 @@ writeback_single_inode(struct inode *ino
>  	spin_lock(&inode_lock);
>  	inode->i_state &= ~I_SYNC;
>  	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> -		if (!(inode->i_state & I_DIRTY) &&
> -		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> +		if (inode->i_state & I_DIRTY) {
> +			/*
> +			 * Someone redirtied the inode while were writing back
> +			 * the pages.
> +			 */
> +			redirty_tail(inode);
> +		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  			/*
>  			 * We didn't write back all the pages.  nfs_writepages()
>  			 * sometimes bales out without doing anything. Redirty
> @@ -462,12 +467,6 @@ writeback_single_inode(struct inode *ino
>  				inode->i_state |= I_DIRTY_PAGES;
>  				redirty_tail(inode);
>  			}
> -		} else if (inode->i_state & I_DIRTY) {
> -			/*
> -			 * Someone redirtied the inode while were writing back
> -			 * the pages.
> -			 */
> -			redirty_tail(inode);
>  		} else if (atomic_read(&inode->i_count)) {
>  			/*
>  			 * The inode is clean, inuse
> 
> -- 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 2/7] writeback: fix queue_io() ordering
  2009-09-09 14:51 ` [RFC][PATCH 2/7] writeback: fix queue_io() ordering Wu Fengguang
@ 2009-09-09 15:53   ` Jan Kara
  2009-09-10  1:26     ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2009-09-09 15:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, jack, Artem Bityutskiy, LKML,
	linux-fsdevel

On Wed 09-09-09 22:51:43, Wu Fengguang wrote:
> This was not a bug, since b_io is empty for kupdate writeback.
> The next patch will do requeue_io() for non-kupdate writeback,
> so let's fix it.
  But doesn't this patch also need your "anti-starvation" patch?
Looking into the code, we put inode to b_more_io when nr_to_write
drops to zero and this way we'd just start writing it again
in the next round...

							Honza
> 
> CC: Dave Chinner <david@fromorbit.com>
> Cc: Martin Bligh <mbligh@google.com>
> Cc: Michael Rubin <mrubin@google.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> ---
>  fs/fs-writeback.c |    7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c	2009-09-09 21:41:14.000000000 +0800
> +++ linux/fs/fs-writeback.c	2009-09-09 21:45:15.000000000 +0800
> @@ -313,11 +313,14 @@ static void move_expired_inodes(struct l
>  }
>  
>  /*
> - * Queue all expired dirty inodes for io, eldest first.
> + * Queue all expired dirty inodes for io, eldest first:
> + * (newly dirtied) => b_dirty inodes
> + *                 => b_more_io inodes
> + *                 => remaining inodes in b_io => (dequeue for sync)
>   */
>  static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
>  {
> -	list_splice_init(&wb->b_more_io, wb->b_io.prev);
> +	list_splice_init(&wb->b_more_io, &wb->b_io);
>  	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
>  }
>  
> 
> -- 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES
  2009-09-09 14:51 ` [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES Wu Fengguang
@ 2009-09-09 23:29   ` Theodore Tso
  2009-09-10  0:13       ` Wu Fengguang
  2009-09-10  4:53     ` Peter Zijlstra
  0 siblings, 2 replies; 29+ messages in thread
From: Theodore Tso @ 2009-09-09 23:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, jack, Artem Bityutskiy, LKML,
	linux-fsdevel

On Wed, Sep 09, 2009 at 10:51:46PM +0800, Wu Fengguang wrote:
> + * The maximum number of pages to writeout in a single periodic/background
> + * writeback operation. 64MB means I_SYNC may be hold for up to 1 second.
> + * This is not a big problem since we normally do kind of trylock on I_SYNC
> + * for non-data-integrity writes.  Userspace tasks doing throttled writeback
> + * do not use this value.

What's your justification for using 64MB?  Where are you getting 1
second from?  On a fast RAID array 64MB can be written in much less
than 1 second.

More generally, I assume your patches conflict with Jens' per-bdi patches?

     		  	      	      	       - Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES
  2009-09-09 23:29   ` Theodore Tso
@ 2009-09-10  0:13       ` Wu Fengguang
  2009-09-10  4:53     ` Peter Zijlstra
  1 sibling, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2009-09-10  0:13 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, Jens Axboe, Dave Chinner,
	Chris Mason, Peter Zijlstra, Christoph Hellwig, jack,
	Artem Bityutskiy, LKML, linux-fsdevel

On Thu, Sep 10, 2009 at 07:29:38AM +0800, Theodore Tso wrote:
> On Wed, Sep 09, 2009 at 10:51:46PM +0800, Wu Fengguang wrote:
> > + * The maximum number of pages to writeout in a single periodic/background
> > + * writeback operation. 64MB means I_SYNC may be hold for up to 1 second.
> > + * This is not a big problem since we normally do kind of trylock on I_SYNC
> > + * for non-data-integrity writes.  Userspace tasks doing throttled writeback
> > + * do not use this value.
> 
> What's your justification for using 64MB?  Where are you getting 1
> second from?  On a fast RAID array 64MB can be written in much less
> than 1 second.

Because this value will be used for desktop and server alike, we have
to consider the worst case - for a laptop, it takes about 1 second to
write 64MB. It's not accurate to say I_SYNC may be hold for up to 1
second, but the same I_SYNC will be taken, dropped because of
congestion, retaken, and this loop could go on for 1 second until a
large file have been written 64MB. In the mean while, in the normal
case of a single kupdate thread per bdi, that file blocks writeout of
other files for 1 second.

> More generally, I assume your patches conflict with Jens' per-bdi patches?

It is based on latest linux-next tree with the vm.max_writeback_mb
patch reverted. The changelog only states the need to increase the
size, but not mentioned why it should be made tunable. So I tend to
just bump up MAX_WRITEBACK_PAGES. It seems that a large value won't
hurt anyone. Or are there demand to further increase it a lot for some
server configurations?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES
@ 2009-09-10  0:13       ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2009-09-10  0:13 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason

On Thu, Sep 10, 2009 at 07:29:38AM +0800, Theodore Tso wrote:
> On Wed, Sep 09, 2009 at 10:51:46PM +0800, Wu Fengguang wrote:
> > + * The maximum number of pages to writeout in a single periodic/background
> > + * writeback operation. 64MB means I_SYNC may be hold for up to 1 second.
> > + * This is not a big problem since we normally do kind of trylock on I_SYNC
> > + * for non-data-integrity writes.  Userspace tasks doing throttled writeback
> > + * do not use this value.
> 
> What's your justification for using 64MB?  Where are you getting 1
> second from?  On a fast RAID array 64MB can be written in much less
> than 1 second.

Because this value will be used for desktop and server alike, we have
to consider the worst case - for a laptop, it takes about 1 second to
write 64MB. It's not accurate to say I_SYNC may be hold for up to 1
second, but the same I_SYNC will be taken, dropped because of
congestion, retaken, and this loop could go on for 1 second until a
large file have been written 64MB. In the mean while, in the normal
case of a single kupdate thread per bdi, that file blocks writeout of
other files for 1 second.

> More generally, I assume your patches conflict with Jens' per-bdi patches?

It is based on latest linux-next tree with the vm.max_writeback_mb
patch reverted. The changelog only states the need to increase the
size, but not mentioned why it should be made tunable. So I tend to
just bump up MAX_WRITEBACK_PAGES. It seems that a large value won't
hurt anyone. Or are there demand to further increase it a lot for some
server configurations?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 2/7] writeback: fix queue_io() ordering
  2009-09-09 15:53   ` Jan Kara
@ 2009-09-10  1:26     ` Wu Fengguang
  2009-09-10 14:14       ` Jan Kara
  0 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2009-09-10  1:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, jack, Artem Bityutskiy, LKML,
	linux-fsdevel

On Wed, Sep 09, 2009 at 11:53:30PM +0800, Jan Kara wrote:
> On Wed 09-09-09 22:51:43, Wu Fengguang wrote:
> > This was not a bug, since b_io is empty for kupdate writeback.
> > The next patch will do requeue_io() for non-kupdate writeback,
> > so let's fix it.
>   But doesn't this patch also need your "anti-starvation" patch?

Honza, can you show me that patch?

> Looking into the code, we put inode to b_more_io when nr_to_write
> drops to zero and this way we'd just start writing it again
> in the next round...

I'm confused. It's OK to start it in next round. Starvation can
occur if we start it immediately in the next writeback_inodes()
invocation. How can that happen with this patch?

Thanks,
Fengguang

> 							Honza
> > 
> > CC: Dave Chinner <david@fromorbit.com>
> > Cc: Martin Bligh <mbligh@google.com>
> > Cc: Michael Rubin <mrubin@google.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> > ---
> >  fs/fs-writeback.c |    7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > --- linux.orig/fs/fs-writeback.c	2009-09-09 21:41:14.000000000 +0800
> > +++ linux/fs/fs-writeback.c	2009-09-09 21:45:15.000000000 +0800
> > @@ -313,11 +313,14 @@ static void move_expired_inodes(struct l
> >  }
> >  
> >  /*
> > - * Queue all expired dirty inodes for io, eldest first.
> > + * Queue all expired dirty inodes for io, eldest first:
> > + * (newly dirtied) => b_dirty inodes
> > + *                 => b_more_io inodes
> > + *                 => remaining inodes in b_io => (dequeue for sync)
> >   */
> >  static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
> >  {
> > -	list_splice_init(&wb->b_more_io, wb->b_io.prev);
> > +	list_splice_init(&wb->b_more_io, &wb->b_io);
> >  	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
> >  }
> >  
> > 
> > -- 
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-09 15:44   ` Jan Kara
@ 2009-09-10  1:42     ` Wu Fengguang
  2009-09-10 12:57       ` Chris Mason
  0 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2009-09-10  1:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel

On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > Some filesystem may choose to write much more than ratelimit_pages
> > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > determine number to write based on real number of dirtied pages.
> > 
> > The increased write_chunk may make the dirtier more bumpy.  This is
> > filesystem writers' duty not to dirty too much at a time without
> > checking the ratelimit.
>   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> dirty the page, not when we write it out. So a problem would only happen if
> filesystem dirties pages by set_page_dirty() and won't call
> balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> and do_wp_page() takes care of that. So where's the problem?

It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
(1024 is the computed nrptrs value in a 32bit kernel). And it calls
balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.

Thanks,
Fengguang

> > ---
> >  mm/page-writeback.c |   13 +++++++------
> >  1 file changed, 7 insertions(+), 6 deletions(-)
> > 
> > --- linux.orig/mm/page-writeback.c	2009-09-09 21:19:21.000000000 +0800
> > +++ linux/mm/page-writeback.c	2009-09-09 21:25:45.000000000 +0800
> > @@ -44,12 +44,12 @@ static long ratelimit_pages = 32;
> >  /*
> >   * When balance_dirty_pages decides that the caller needs to perform some
> >   * non-background writeback, this is how many pages it will attempt to write.
> > - * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
> > + * It should be somewhat larger than dirtied pages to ensure that reasonably
> >   * large amounts of I/O are submitted.
> >   */
> > -static inline long sync_writeback_pages(void)
> > +static inline long sync_writeback_pages(unsigned long dirtied)
> >  {
> > -	return ratelimit_pages + ratelimit_pages / 2;
> > +	return dirtied + dirtied / 2;
> >  }
> >  
> >  /* The following parameters are exported via /proc/sys/vm */
> > @@ -476,7 +476,8 @@ get_dirty_limits(unsigned long *pbackgro
> >   * If we're over `background_thresh' then pdflush is woken to perform some
> >   * writeout.
> >   */
> > -static void balance_dirty_pages(struct address_space *mapping)
> > +static void balance_dirty_pages(struct address_space *mapping,
> > +				unsigned long write_chunk)
> >  {
> >  	long nr_reclaimable, bdi_nr_reclaimable;
> >  	long nr_writeback, bdi_nr_writeback;
> > @@ -484,7 +485,6 @@ static void balance_dirty_pages(struct a
> >  	unsigned long dirty_thresh;
> >  	unsigned long bdi_thresh;
> >  	unsigned long pages_written = 0;
> > -	unsigned long write_chunk = sync_writeback_pages();
> >  
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> >  
> > @@ -638,9 +638,10 @@ void balance_dirty_pages_ratelimited_nr(
> >  	p =  &__get_cpu_var(bdp_ratelimits);
> >  	*p += nr_pages_dirtied;
> >  	if (unlikely(*p >= ratelimit)) {
> > +		ratelimit = sync_writeback_pages(*p);
> >  		*p = 0;
> >  		preempt_enable();
> > -		balance_dirty_pages(mapping);
> > +		balance_dirty_pages(mapping, ratelimit);
> >  		return;
> >  	}
> >  	preempt_enable();
> > 
> > -- 
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES
  2009-09-09 23:29   ` Theodore Tso
  2009-09-10  0:13       ` Wu Fengguang
@ 2009-09-10  4:53     ` Peter Zijlstra
  2009-09-10  7:35       ` Wu Fengguang
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2009-09-10  4:53 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Wu Fengguang, Andrew Morton, Jens Axboe, Dave Chinner,
	Chris Mason, Christoph Hellwig, jack, Artem Bityutskiy, LKML,
	linux-fsdevel

On Wed, 2009-09-09 at 19:29 -0400, Theodore Tso wrote:
> On Wed, Sep 09, 2009 at 10:51:46PM +0800, Wu Fengguang wrote:
> > + * The maximum number of pages to writeout in a single periodic/background
> > + * writeback operation. 64MB means I_SYNC may be hold for up to 1 second.
> > + * This is not a big problem since we normally do kind of trylock on I_SYNC
> > + * for non-data-integrity writes.  Userspace tasks doing throttled writeback
> > + * do not use this value.
> 
> What's your justification for using 64MB?  Where are you getting 1
> second from?  On a fast RAID array 64MB can be written in much less
> than 1 second.

Worse, on my 5mb/s usb stick writing out 64m will take forever.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES
  2009-09-10  4:53     ` Peter Zijlstra
@ 2009-09-10  7:35       ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2009-09-10  7:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Theodore Tso, Andrew Morton, Jens Axboe, Dave Chinner,
	Chris Mason, Christoph Hellwig, jack, Artem Bityutskiy, LKML,
	linux-fsdevel

On Thu, Sep 10, 2009 at 12:53:18PM +0800, Peter Zijlstra wrote:
> On Wed, 2009-09-09 at 19:29 -0400, Theodore Tso wrote:
> > On Wed, Sep 09, 2009 at 10:51:46PM +0800, Wu Fengguang wrote:
> > > + * The maximum number of pages to writeout in a single periodic/background
> > > + * writeback operation. 64MB means I_SYNC may be hold for up to 1 second.
> > > + * This is not a big problem since we normally do kind of trylock on I_SYNC
> > > + * for non-data-integrity writes.  Userspace tasks doing throttled writeback
> > > + * do not use this value.
> > 
> > What's your justification for using 64MB?  Where are you getting 1
> > second from?  On a fast RAID array 64MB can be written in much less
> > than 1 second.
> 
> Worse, on my 5mb/s usb stick writing out 64m will take forever.

        cp bigfile1 bigfile2 /mnt/usb/
        sync

In that case the user would notice that kernel keeps writing to one
file for up to 13 seconds before switching to another file.

A simple fix would look like this. It stops io continuation on one
file after 1 second. It will work because when io is congested, it
relies on the io continuation logic (based on last_file*) to retry
the same file until MAX_WRITEBACK_PAGES is reached. The queue-able
requests between congested <=> uncongested states are not very large.
For slow devices, the queue-able pages between empty <=> congested
states are also not very large. For example, my USB stick has
nr_requests=128 and max_sectors_kb=120. It would take less than 12MB
to congest this queue.

With this patch and my usb stick, the kernel may first sync 12MB for
bigfile1 (which takes 1-3 seconds), then sync bigfile2 for 1 second,
and then bigfile1 for 1 second, and so on.

It seems that we could now safely bump MAX_WRITEBACK_PAGES to even
larger values beyond 128MB :)

Thanks,
Fengguang
---

--- linux.orig/fs/fs-writeback.c	2009-09-10 15:02:48.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-10 15:07:23.000000000 +0800
@@ -277,7 +277,8 @@ static void requeue_io(struct inode *ino
  */
 static void requeue_partial_io(struct writeback_control *wbc, struct inode *inode)
 {
-	if (wbc->last_file_written == 0 ||
+	if (time_before(wbc->last_file_time + HZ, jiffies) ||
+	    wbc->last_file_written == 0 ||
 	    wbc->last_file_written >= MAX_WRITEBACK_PAGES)
 		return requeue_io(inode);
 
@@ -428,6 +429,7 @@ writeback_single_inode(struct inode *ino
 
 	if (wbc->last_file != inode->i_ino) {
 		wbc->last_file = inode->i_ino;
+		wbc->last_file_time = jiffies;
 		wbc->last_file_written = nr_to_write - wbc->nr_to_write;
 	} else
 		wbc->last_file_written += nr_to_write - wbc->nr_to_write;
--- linux.orig/include/linux/writeback.h	2009-09-10 15:07:28.000000000 +0800
+++ linux/include/linux/writeback.h	2009-09-10 15:08:46.000000000 +0800
@@ -46,6 +46,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	unsigned long last_file;	/* Inode number of last written file */
+	unsigned long last_file_time;	/* First sync time for last file */
 	long last_file_written;		/* Total pages written for last file */
 	long pages_skipped;		/* Pages which were not written */
 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10  1:42     ` Wu Fengguang
@ 2009-09-10 12:57       ` Chris Mason
  2009-09-10 13:21           ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Mason @ 2009-09-10 12:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Peter Zijlstra, Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel

On Thu, Sep 10, 2009 at 09:42:01AM +0800, Wu Fengguang wrote:
> On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> > On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > > Some filesystem may choose to write much more than ratelimit_pages
> > > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > > determine number to write based on real number of dirtied pages.
> > > 
> > > The increased write_chunk may make the dirtier more bumpy.  This is
> > > filesystem writers' duty not to dirty too much at a time without
> > > checking the ratelimit.
> >   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> > dirty the page, not when we write it out. So a problem would only happen if
> > filesystem dirties pages by set_page_dirty() and won't call
> > balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> > and do_wp_page() takes care of that. So where's the problem?
> 
> It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
> (1024 is the computed nrptrs value in a 32bit kernel). And it calls
> balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.

I can easily change this to call more often, but we do always call
balance_dirty_pages to reflect how much ram we've really sent down.

-chris


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10 12:57       ` Chris Mason
@ 2009-09-10 13:21           ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2009-09-10 13:21 UTC (permalink / raw)
  To: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Peter Zijlstra, Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel

On Thu, Sep 10, 2009 at 08:57:42PM +0800, Chris Mason wrote:
> On Thu, Sep 10, 2009 at 09:42:01AM +0800, Wu Fengguang wrote:
> > On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> > > On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > > > Some filesystem may choose to write much more than ratelimit_pages
> > > > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > > > determine number to write based on real number of dirtied pages.
> > > > 
> > > > The increased write_chunk may make the dirtier more bumpy.  This is
> > > > filesystem writers' duty not to dirty too much at a time without
> > > > checking the ratelimit.
> > >   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> > > dirty the page, not when we write it out. So a problem would only happen if
> > > filesystem dirties pages by set_page_dirty() and won't call
> > > balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> > > and do_wp_page() takes care of that. So where's the problem?
> > 
> > It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
> > (1024 is the computed nrptrs value in a 32bit kernel). And it calls
> > balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.
> 
> I can easily change this to call more often, but we do always call
> balance_dirty_pages to reflect how much ram we've really sent down.

Btrfs is doing OK. 2MB/4MB looks like reasonable chunk sizes. The
need-change part is balance_dirty_pages_ratelimited_nr(), hence this
patch :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
@ 2009-09-10 13:21           ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2009-09-10 13:21 UTC (permalink / raw)
  To: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner, Pete

On Thu, Sep 10, 2009 at 08:57:42PM +0800, Chris Mason wrote:
> On Thu, Sep 10, 2009 at 09:42:01AM +0800, Wu Fengguang wrote:
> > On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> > > On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > > > Some filesystem may choose to write much more than ratelimit_pages
> > > > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > > > determine number to write based on real number of dirtied pages.
> > > > 
> > > > The increased write_chunk may make the dirtier more bumpy.  This is
> > > > filesystem writers' duty not to dirty too much at a time without
> > > > checking the ratelimit.
> > >   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> > > dirty the page, not when we write it out. So a problem would only happen if
> > > filesystem dirties pages by set_page_dirty() and won't call
> > > balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> > > and do_wp_page() takes care of that. So where's the problem?
> > 
> > It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
> > (1024 is the computed nrptrs value in a 32bit kernel). And it calls
> > balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.
> 
> I can easily change this to call more often, but we do always call
> balance_dirty_pages to reflect how much ram we've really sent down.

Btrfs is doing OK. 2MB/4MB looks like reasonable chunk sizes. The
need-change part is balance_dirty_pages_ratelimited_nr(), hence this
patch :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 2/7] writeback: fix queue_io() ordering
  2009-09-10  1:26     ` Wu Fengguang
@ 2009-09-10 14:14       ` Jan Kara
  2009-09-10 14:17         ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2009-09-10 14:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel

On Thu 10-09-09 09:26:24, Wu Fengguang wrote:
> On Wed, Sep 09, 2009 at 11:53:30PM +0800, Jan Kara wrote:
> > On Wed 09-09-09 22:51:43, Wu Fengguang wrote:
> > > This was not a bug, since b_io is empty for kupdate writeback.
> > > The next patch will do requeue_io() for non-kupdate writeback,
> > > so let's fix it.
> >   But doesn't this patch also need your "anti-starvation" patch?
> 
> Honza, can you show me that patch?
> 
> > Looking into the code, we put inode to b_more_io when nr_to_write
> > drops to zero and this way we'd just start writing it again
> > in the next round...
> 
> I'm confused. It's OK to start it in next round. Starvation can
> occur if we start it immediately in the next writeback_inodes()
> invocation. How can that happen with this patch?
  Sorry, my fault. For kupdate, we splice the list only once s_io is empty
so that's not an issue. So the patch looks good.
  Acked-by: Jan Kara <jack@suse.cz>

> > > CC: Dave Chinner <david@fromorbit.com>
> > > Cc: Martin Bligh <mbligh@google.com>
> > > Cc: Michael Rubin <mrubin@google.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> > > ---
> > >  fs/fs-writeback.c |    7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > --- linux.orig/fs/fs-writeback.c	2009-09-09 21:41:14.000000000 +0800
> > > +++ linux/fs/fs-writeback.c	2009-09-09 21:45:15.000000000 +0800
> > > @@ -313,11 +313,14 @@ static void move_expired_inodes(struct l
> > >  }
> > >  
> > >  /*
> > > - * Queue all expired dirty inodes for io, eldest first.
> > > + * Queue all expired dirty inodes for io, eldest first:
> > > + * (newly dirtied) => b_dirty inodes
> > > + *                 => b_more_io inodes
> > > + *                 => remaining inodes in b_io => (dequeue for sync)
> > >   */
> > >  static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
> > >  {
> > > -	list_splice_init(&wb->b_more_io, wb->b_io.prev);
> > > +	list_splice_init(&wb->b_more_io, &wb->b_io);
> > >  	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
> > >  }
> > >  
> > > 
> > > -- 
> > > 
> > -- 
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 2/7] writeback: fix queue_io() ordering
  2009-09-10 14:14       ` Jan Kara
@ 2009-09-10 14:17         ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2009-09-10 14:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel

On Thu, Sep 10, 2009 at 10:14:15PM +0800, Jan Kara wrote:
> On Thu 10-09-09 09:26:24, Wu Fengguang wrote:
> > On Wed, Sep 09, 2009 at 11:53:30PM +0800, Jan Kara wrote:
> > > On Wed 09-09-09 22:51:43, Wu Fengguang wrote:
> > > > This was not a bug, since b_io is empty for kupdate writeback.
> > > > The next patch will do requeue_io() for non-kupdate writeback,
> > > > so let's fix it.
> > >   But doesn't this patch also need your "anti-starvation" patch?
> > 
> > Honza, can you show me that patch?
> > 
> > > Looking into the code, we put inode to b_more_io when nr_to_write
> > > drops to zero and this way we'd just start writing it again
> > > in the next round...
> > 
> > I'm confused. It's OK to start it in next round. Starvation can
> > occur if we start it immediately in the next writeback_inodes()
> > invocation. How can that happen with this patch?
>   Sorry, my fault. For kupdate, we splice the list only once s_io is empty
> so that's not an issue. So the patch looks good.
>   Acked-by: Jan Kara <jack@suse.cz>

Thank you :)

Regards,
Fengguang

> > > > CC: Dave Chinner <david@fromorbit.com>
> > > > Cc: Martin Bligh <mbligh@google.com>
> > > > Cc: Michael Rubin <mrubin@google.com>
> > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> > > > ---
> > > >  fs/fs-writeback.c |    7 +++++--
> > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > 
> > > > --- linux.orig/fs/fs-writeback.c	2009-09-09 21:41:14.000000000 +0800
> > > > +++ linux/fs/fs-writeback.c	2009-09-09 21:45:15.000000000 +0800
> > > > @@ -313,11 +313,14 @@ static void move_expired_inodes(struct l
> > > >  }
> > > >  
> > > >  /*
> > > > - * Queue all expired dirty inodes for io, eldest first.
> > > > + * Queue all expired dirty inodes for io, eldest first:
> > > > + * (newly dirtied) => b_dirty inodes
> > > > + *                 => b_more_io inodes
> > > > + *                 => remaining inodes in b_io => (dequeue for sync)
> > > >   */
> > > >  static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
> > > >  {
> > > > -	list_splice_init(&wb->b_more_io, wb->b_io.prev);
> > > > +	list_splice_init(&wb->b_more_io, &wb->b_io);
> > > >  	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
> > > >  }
> > > >  
> > > > 
> > > > -- 
> > > > 
> > > -- 
> > > Jan Kara <jack@suse.cz>
> > > SUSE Labs, CR
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10 13:21           ` Wu Fengguang
  (?)
@ 2009-09-10 14:56           ` Peter Zijlstra
  2009-09-10 15:14             ` Wu Fengguang
  -1 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2009-09-10 14:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Artem Bityutskiy, LKML, linux-fsdevel

On Thu, 2009-09-10 at 21:21 +0800, Wu Fengguang wrote:
> On Thu, Sep 10, 2009 at 08:57:42PM +0800, Chris Mason wrote:
> > On Thu, Sep 10, 2009 at 09:42:01AM +0800, Wu Fengguang wrote:
> > > On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> > > > On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > > > > Some filesystem may choose to write much more than ratelimit_pages
> > > > > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > > > > determine number to write based on real number of dirtied pages.
> > > > > 
> > > > > The increased write_chunk may make the dirtier more bumpy.  This is
> > > > > filesystem writers' duty not to dirty too much at a time without
> > > > > checking the ratelimit.
> > > >   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> > > > dirty the page, not when we write it out. So a problem would only happen if
> > > > filesystem dirties pages by set_page_dirty() and won't call
> > > > balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> > > > and do_wp_page() takes care of that. So where's the problem?
> > > 
> > > It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
> > > (1024 is the computed nrptrs value in a 32bit kernel). And it calls
> > > balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.
> > 
> > I can easily change this to call more often, but we do always call
> > balance_dirty_pages to reflect how much ram we've really sent down.
> 
> Btrfs is doing OK. 2MB/4MB looks like reasonable chunk sizes. The
> need-change part is balance_dirty_pages_ratelimited_nr(), hence this
> patch :)

I'm not getting it, it calls set_page_dirty() for each page, right? and
then it calls into balance_dirty_pages_ratelimited_nr(), that sounds
right. What is the problem with that?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10 14:56           ` Peter Zijlstra
@ 2009-09-10 15:14             ` Wu Fengguang
  2009-09-10 15:31               ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2009-09-10 15:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Artem Bityutskiy, LKML, linux-fsdevel

On Thu, Sep 10, 2009 at 10:56:04PM +0800, Peter Zijlstra wrote:
> On Thu, 2009-09-10 at 21:21 +0800, Wu Fengguang wrote:
> > On Thu, Sep 10, 2009 at 08:57:42PM +0800, Chris Mason wrote:
> > > On Thu, Sep 10, 2009 at 09:42:01AM +0800, Wu Fengguang wrote:
> > > > On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> > > > > On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > > > > > Some filesystem may choose to write much more than ratelimit_pages
> > > > > > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > > > > > determine number to write based on real number of dirtied pages.
> > > > > > 
> > > > > > The increased write_chunk may make the dirtier more bumpy.  This is
> > > > > > filesystem writers' duty not to dirty too much at a time without
> > > > > > checking the ratelimit.
> > > > >   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> > > > > dirty the page, not when we write it out. So a problem would only happen if
> > > > > filesystem dirties pages by set_page_dirty() and won't call
> > > > > balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> > > > > and do_wp_page() takes care of that. So where's the problem?
> > > > 
> > > > It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
> > > > (1024 is the computed nrptrs value in a 32bit kernel). And it calls
> > > > balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.
> > > 
> > > I can easily change this to call more often, but we do always call
> > > balance_dirty_pages to reflect how much ram we've really sent down.
> > 
> > Btrfs is doing OK. 2MB/4MB looks like reasonable chunk sizes. The
> > need-change part is balance_dirty_pages_ratelimited_nr(), hence this
> > patch :)
> 
> I'm not getting it, it calls set_page_dirty() for each page, right? and
> then it calls into balance_dirty_pages_ratelimited_nr(), that sounds
> right. What is the problem with that?

It looks like btrfs_file_write() eventually calls
__set_page_dirty_buffers() which in turn won't call
balance_dirty_pages*(). This is why do_wp_page() calls
set_page_dirty_balance() to do balance_dirty_pages*().

So btrfs_file_write() explicitly calls
balance_dirty_pages_ratelimited_nr() to get throttled.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10 15:14             ` Wu Fengguang
@ 2009-09-10 15:31               ` Peter Zijlstra
  2009-09-10 15:41                 ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2009-09-10 15:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Artem Bityutskiy, LKML, linux-fsdevel

On Thu, 2009-09-10 at 23:14 +0800, Wu Fengguang wrote:
> On Thu, Sep 10, 2009 at 10:56:04PM +0800, Peter Zijlstra wrote:
> > On Thu, 2009-09-10 at 21:21 +0800, Wu Fengguang wrote:
> > > On Thu, Sep 10, 2009 at 08:57:42PM +0800, Chris Mason wrote:
> > > > On Thu, Sep 10, 2009 at 09:42:01AM +0800, Wu Fengguang wrote:
> > > > > On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> > > > > > On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > > > > > > Some filesystem may choose to write much more than ratelimit_pages
> > > > > > > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > > > > > > determine number to write based on real number of dirtied pages.
> > > > > > > 
> > > > > > > The increased write_chunk may make the dirtier more bumpy.  This is
> > > > > > > filesystem writers' duty not to dirty too much at a time without
> > > > > > > checking the ratelimit.
> > > > > >   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> > > > > > dirty the page, not when we write it out. So a problem would only happen if
> > > > > > filesystem dirties pages by set_page_dirty() and won't call
> > > > > > balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> > > > > > and do_wp_page() takes care of that. So where's the problem?
> > > > > 
> > > > > It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
> > > > > (1024 is the computed nrptrs value in a 32bit kernel). And it calls
> > > > > balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.
> > > > 
> > > > I can easily change this to call more often, but we do always call
> > > > balance_dirty_pages to reflect how much ram we've really sent down.
> > > 
> > > Btrfs is doing OK. 2MB/4MB looks like reasonable chunk sizes. The
> > > need-change part is balance_dirty_pages_ratelimited_nr(), hence this
> > > patch :)
> > 
> > I'm not getting it, it calls set_page_dirty() for each page, right? and
> > then it calls into balance_dirty_pages_ratelimited_nr(), that sounds
> > right. What is the problem with that?
> 
> It looks like btrfs_file_write() eventually calls
> __set_page_dirty_buffers() which in turn won't call
> balance_dirty_pages*(). This is why do_wp_page() calls
> set_page_dirty_balance() to do balance_dirty_pages*().
> 
> So btrfs_file_write() explicitly calls
> balance_dirty_pages_ratelimited_nr() to get throttled.

Right, so what is wrong with than, and how does this patch fix that?

[ the only thing you have to be careful with is that you don't
excessively grow the error bound on the dirty limit ]


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10 15:31               ` Peter Zijlstra
@ 2009-09-10 15:41                 ` Wu Fengguang
  2009-09-10 15:54                   ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2009-09-10 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Artem Bityutskiy, LKML, linux-fsdevel

On Thu, Sep 10, 2009 at 11:31:38PM +0800, Peter Zijlstra wrote:
> On Thu, 2009-09-10 at 23:14 +0800, Wu Fengguang wrote:
> > On Thu, Sep 10, 2009 at 10:56:04PM +0800, Peter Zijlstra wrote:
> > > On Thu, 2009-09-10 at 21:21 +0800, Wu Fengguang wrote:
> > > > On Thu, Sep 10, 2009 at 08:57:42PM +0800, Chris Mason wrote:
> > > > > On Thu, Sep 10, 2009 at 09:42:01AM +0800, Wu Fengguang wrote:
> > > > > > On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> > > > > > > On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > > > > > > > Some filesystem may choose to write much more than ratelimit_pages
> > > > > > > > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > > > > > > > determine number to write based on real number of dirtied pages.
> > > > > > > > 
> > > > > > > > The increased write_chunk may make the dirtier more bumpy.  This is
> > > > > > > > filesystem writers' duty not to dirty too much at a time without
> > > > > > > > checking the ratelimit.
> > > > > > >   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> > > > > > > dirty the page, not when we write it out. So a problem would only happen if
> > > > > > > filesystem dirties pages by set_page_dirty() and won't call
> > > > > > > balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> > > > > > > and do_wp_page() takes care of that. So where's the problem?
> > > > > > 
> > > > > > It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
> > > > > > (1024 is the computed nrptrs value in a 32bit kernel). And it calls
> > > > > > balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.
> > > > > 
> > > > > I can easily change this to call more often, but we do always call
> > > > > balance_dirty_pages to reflect how much ram we've really sent down.
> > > > 
> > > > Btrfs is doing OK. 2MB/4MB looks like reasonable chunk sizes. The
> > > > need-change part is balance_dirty_pages_ratelimited_nr(), hence this
> > > > patch :)
> > > 
> > > I'm not getting it, it calls set_page_dirty() for each page, right? and
> > > then it calls into balance_dirty_pages_ratelimited_nr(), that sounds
> > > right. What is the problem with that?
> > 
> > It looks like btrfs_file_write() eventually calls
> > __set_page_dirty_buffers() which in turn won't call
> > balance_dirty_pages*(). This is why do_wp_page() calls
> > set_page_dirty_balance() to do balance_dirty_pages*().
> > 
> > So btrfs_file_write() explicitly calls
> > balance_dirty_pages_ratelimited_nr() to get throttled.
> 
> Right, so what is wrong with than, and how does this patch fix that?
> 
> [ the only thing you have to be careful with is that you don't
> excessively grow the error bound on the dirty limit ]

Then we could form a loop:

        btrfs_file_write():     dirty 1024 pages
        balance_dirty_pages():  write up to 12 pages (= ratelimit_pages * 1.5)

in which the writeback rate cannot keep up with dirty rate,
and the dirty pages go all the way beyond dirty_thresh.

Sorry for writing such a vague changelog!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10 15:41                 ` Wu Fengguang
@ 2009-09-10 15:54                   ` Peter Zijlstra
  2009-09-10 16:08                     ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2009-09-10 15:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Artem Bityutskiy, LKML, linux-fsdevel

On Thu, 2009-09-10 at 23:41 +0800, Wu Fengguang wrote:
> > > So btrfs_file_write() explicitly calls
> > > balance_dirty_pages_ratelimited_nr() to get throttled.
> > 
> > Right, so what is wrong with than, and how does this patch fix that?
> > 
> > [ the only thing you have to be careful with is that you don't
> > excessively grow the error bound on the dirty limit ]
> 
> Then we could form a loop:
> 
>         btrfs_file_write():     dirty 1024 pages
>         balance_dirty_pages():  write up to 12 pages (= ratelimit_pages * 1.5)
> 
> in which the writeback rate cannot keep up with dirty rate,
> and the dirty pages go all the way beyond dirty_thresh.

Ah, ok so this is to keep the error bound on the dirty limit bounded,
because we can break out of balance_dirty_pages() early, the /* We've
done our duty */ break.

Which unbalances the duty vs the dirty ratio.

I figure that with the task dirty limit stuff we could maybe try to get
rid of this break.. worth a try.

> Sorry for writing such a vague changelog!

np, as long as we get there :-)

Change makes sense now, thanks!


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10 15:54                   ` Peter Zijlstra
@ 2009-09-10 16:08                     ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2009-09-10 16:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Artem Bityutskiy, LKML, linux-fsdevel

On Thu, Sep 10, 2009 at 11:54:29PM +0800, Peter Zijlstra wrote:
> On Thu, 2009-09-10 at 23:41 +0800, Wu Fengguang wrote:
> > > > So btrfs_file_write() explicitly calls
> > > > balance_dirty_pages_ratelimited_nr() to get throttled.
> > > 
> > > Right, so what is wrong with than, and how does this patch fix that?
> > > 
> > > [ the only thing you have to be careful with is that you don't
> > > excessively grow the error bound on the dirty limit ]
> > 
> > Then we could form a loop:
> > 
> >         btrfs_file_write():     dirty 1024 pages
> >         balance_dirty_pages():  write up to 12 pages (= ratelimit_pages * 1.5)
> > 
> > in which the writeback rate cannot keep up with dirty rate,
> > and the dirty pages go all the way beyond dirty_thresh.
> 
> Ah, ok so this is to keep the error bound on the dirty limit bounded,
> because we can break out of balance_dirty_pages() early, the /* We've
> done our duty */ break.
> 
> Which unbalances the duty vs the dirty ratio.

Right!

> I figure that with the task dirty limit stuff we could maybe try to get
> rid of this break.. worth a try.

Be careful. Without that break, the time a task get throttled in a
single trip may go out of control. For example, task B get blocked
for 1000 seconds because there is a task A keep dirtying pages, in
the mean time task A's dirty thresh going down slowly, but still
larger than B's.

> > Sorry for writing such a vague changelog!
> 
> np, as long as we get there :-)
> 
> Change makes sense now, thanks!

May I add you ack?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2009-09-10 16:09 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-09 14:51 [RFC][PATCH 0/7] some random writeback fixes Wu Fengguang
2009-09-09 14:51 ` [RFC][PATCH 1/7] writeback: cleanup writeback_single_inode() Wu Fengguang
2009-09-09 15:45   ` Jan Kara
2009-09-09 14:51 ` [RFC][PATCH 2/7] writeback: fix queue_io() ordering Wu Fengguang
2009-09-09 15:53   ` Jan Kara
2009-09-10  1:26     ` Wu Fengguang
2009-09-10 14:14       ` Jan Kara
2009-09-10 14:17         ` Wu Fengguang
2009-09-09 14:51 ` [RFC][PATCH 3/7] writeback: merge for_kupdate and !for_kupdate requeue io logics Wu Fengguang
2009-09-09 14:51 ` [RFC][PATCH 4/7] writeback: ensure large files are written in MAX_WRITEBACK_PAGES chunks Wu Fengguang
2009-09-09 14:51 ` [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES Wu Fengguang
2009-09-09 23:29   ` Theodore Tso
2009-09-10  0:13     ` Wu Fengguang
2009-09-10  0:13       ` Wu Fengguang
2009-09-10  4:53     ` Peter Zijlstra
2009-09-10  7:35       ` Wu Fengguang
2009-09-09 14:51 ` [RFC][PATCH 6/7] writeback: dont abort inode on congestion Wu Fengguang
2009-09-09 14:51 ` [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages Wu Fengguang
2009-09-09 15:44   ` Jan Kara
2009-09-10  1:42     ` Wu Fengguang
2009-09-10 12:57       ` Chris Mason
2009-09-10 13:21         ` Wu Fengguang
2009-09-10 13:21           ` Wu Fengguang
2009-09-10 14:56           ` Peter Zijlstra
2009-09-10 15:14             ` Wu Fengguang
2009-09-10 15:31               ` Peter Zijlstra
2009-09-10 15:41                 ` Wu Fengguang
2009-09-10 15:54                   ` Peter Zijlstra
2009-09-10 16:08                     ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.