All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] writeback: moving expire targets for background/kupdate works
@ 2011-04-19  3:00 ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Dave Chinner, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, Wu Fengguang, LKML, linux-fsdevel,
	Linux Memory Management List


Andrew,

This aims to reduce possible pageout() calls by making the flusher
concentrate a bit more on old/expired dirty inodes.

Patches 04, 05 have been updated since last post, please review.
The concerns from last review have been addressed.

It runs fine on simple workloads over ext3/4, xfs, btrfs and NFS.

Trond, will you take the last patch? The fixed "bug" has no real impact for now.

make dirty expire time a moving target
        [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
        [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
        [PATCH 3/6] writeback: sync expired inodes first in background writeback

loop condition fixes (the most tricky part)
        [PATCH 4/6] writeback: introduce writeback_control.inodes_cleaned
        [PATCH 5/6] writeback: try more writeback as long as something was written

NFS fix
        [PATCH 6/6] NFS: return -EAGAIN when skipped commit in nfs_commit_unstable_pages()

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 0/6] writeback: moving expire targets for background/kupdate works
@ 2011-04-19  3:00 ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Dave Chinner, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, Wu Fengguang, LKML, linux-fsdevel,
	Linux Memory Management List


Andrew,

This aims to reduce possible pageout() calls by making the flusher
concentrate a bit more on old/expired dirty inodes.

Patches 04, 05 have been updated since last post, please review.
The concerns from last review have been addressed.

It runs fine on simple workloads over ext3/4, xfs, btrfs and NFS.

Trond, will you take the last patch? The fixed "bug" has no real impact for now.

make dirty expire time a moving target
        [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
        [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
        [PATCH 3/6] writeback: sync expired inodes first in background writeback

loop condition fixes (the most tricky part)
        [PATCH 4/6] writeback: introduce writeback_control.inodes_cleaned
        [PATCH 5/6] writeback: try more writeback as long as something was written

NFS fix
        [PATCH 6/6] NFS: return -EAGAIN when skipped commit in nfs_commit_unstable_pages()

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 0/6] writeback: moving expire targets for background/kupdate works
@ 2011-04-19  3:00 ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Dave Chinner, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, Wu Fengguang, LKML, linux-fsdevel,
	Linux Memory Management List


Andrew,

This aims to reduce possible pageout() calls by making the flusher
concentrate a bit more on old/expired dirty inodes.

Patches 04, 05 have been updated since last post, please review.
The concerns from last review have been addressed.

It runs fine on simple workloads over ext3/4, xfs, btrfs and NFS.

Trond, will you take the last patch? The fixed "bug" has no real impact for now.

make dirty expire time a moving target
        [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
        [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
        [PATCH 3/6] writeback: sync expired inodes first in background writeback

loop condition fixes (the most tricky part)
        [PATCH 4/6] writeback: introduce writeback_control.inodes_cleaned
        [PATCH 5/6] writeback: try more writeback as long as something was written

NFS fix
        [PATCH 6/6] NFS: return -EAGAIN when skipped commit in nfs_commit_unstable_pages()

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2011-04-19  3:00 ` Wu Fengguang
  (?)
@ 2011-04-19  3:00   ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Mel Gorman, Wu Fengguang, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

[-- Attachment #1: writeback-pass-wbc-to-queue_io.patch --]
[-- Type: text/plain, Size: 2526 bytes --]

This is to prepare for moving the dirty expire policy to move_expired_inodes().
No behavior change.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:28.000000000 +0800
@@ -251,8 +251,8 @@ static bool inode_dirtied_after(struct i
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
 static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+				struct list_head *dispatch_queue,
+				struct writeback_control *wbc)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -262,8 +262,8 @@ static void move_expired_inodes(struct l
 
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (wbc->older_than_this &&
+		    inode_dirtied_after(inode, *wbc->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -299,11 +299,11 @@ static void move_expired_inodes(struct l
  *                                           |
  *                                           +--> dequeue for IO
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
 {
 	assert_spin_locked(&inode_wb_list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -579,7 +579,7 @@ void writeback_inodes_wb(struct bdi_writ
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_wb_list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -606,7 +606,7 @@ static void __writeback_inodes_sb(struct
 
 	spin_lock(&inode_wb_list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_wb_list_lock);
 }



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2011-04-19  3:00   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Mel Gorman, Wu Fengguang, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

[-- Attachment #1: writeback-pass-wbc-to-queue_io.patch --]
[-- Type: text/plain, Size: 2829 bytes --]

This is to prepare for moving the dirty expire policy to move_expired_inodes().
No behavior change.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:28.000000000 +0800
@@ -251,8 +251,8 @@ static bool inode_dirtied_after(struct i
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
 static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+				struct list_head *dispatch_queue,
+				struct writeback_control *wbc)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -262,8 +262,8 @@ static void move_expired_inodes(struct l
 
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (wbc->older_than_this &&
+		    inode_dirtied_after(inode, *wbc->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -299,11 +299,11 @@ static void move_expired_inodes(struct l
  *                                           |
  *                                           +--> dequeue for IO
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
 {
 	assert_spin_locked(&inode_wb_list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -579,7 +579,7 @@ void writeback_inodes_wb(struct bdi_writ
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_wb_list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -606,7 +606,7 @@ static void __writeback_inodes_sb(struct
 
 	spin_lock(&inode_wb_list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_wb_list_lock);
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2011-04-19  3:00   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Mel Gorman, Wu Fengguang, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

[-- Attachment #1: writeback-pass-wbc-to-queue_io.patch --]
[-- Type: text/plain, Size: 2829 bytes --]

This is to prepare for moving the dirty expire policy to move_expired_inodes().
No behavior change.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:28.000000000 +0800
@@ -251,8 +251,8 @@ static bool inode_dirtied_after(struct i
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
 static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+				struct list_head *dispatch_queue,
+				struct writeback_control *wbc)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -262,8 +262,8 @@ static void move_expired_inodes(struct l
 
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (wbc->older_than_this &&
+		    inode_dirtied_after(inode, *wbc->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -299,11 +299,11 @@ static void move_expired_inodes(struct l
  *                                           |
  *                                           +--> dequeue for IO
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
 {
 	assert_spin_locked(&inode_wb_list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -579,7 +579,7 @@ void writeback_inodes_wb(struct bdi_writ
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_wb_list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -606,7 +606,7 @@ static void __writeback_inodes_sb(struct
 
 	spin_lock(&inode_wb_list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_wb_list_lock);
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
  2011-04-19  3:00 ` Wu Fengguang
  (?)
@ 2011-04-19  3:00   ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Mel Gorman, Itaru Kitayama, Wu Fengguang,
	Dave Chinner, Trond Myklebust, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

[-- Attachment #1: writeback-moving-target-dirty-expired.patch --]
[-- Type: text/plain, Size: 1996 bytes --]

Dynamically compute the dirty expire timestamp at queue_io() time.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the evil pageout(). Nevertheless this patch merely addresses part
  of the problem.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:28.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:29.000000000 +0800
@@ -254,16 +254,23 @@ static void move_expired_inodes(struct l
 				struct list_head *dispatch_queue,
 				struct writeback_control *wbc)
 {
+	unsigned long expire_interval = 0;
+	unsigned long older_than_this;
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
+	if (wbc->for_kupdate) {
+		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
+		older_than_this = jiffies - expire_interval;
+	}
+
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
-		if (wbc->older_than_this &&
-		    inode_dirtied_after(inode, *wbc->older_than_this))
+		if (expire_interval &&
+		    inode_dirtied_after(inode, older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
@ 2011-04-19  3:00   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Mel Gorman, Itaru Kitayama, Wu Fengguang,
	Dave Chinner, Trond Myklebust, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

[-- Attachment #1: writeback-moving-target-dirty-expired.patch --]
[-- Type: text/plain, Size: 2299 bytes --]

Dynamically compute the dirty expire timestamp at queue_io() time.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the evil pageout(). Nevertheless this patch merely addresses part
  of the problem.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:28.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:29.000000000 +0800
@@ -254,16 +254,23 @@ static void move_expired_inodes(struct l
 				struct list_head *dispatch_queue,
 				struct writeback_control *wbc)
 {
+	unsigned long expire_interval = 0;
+	unsigned long older_than_this;
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
+	if (wbc->for_kupdate) {
+		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
+		older_than_this = jiffies - expire_interval;
+	}
+
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
-		if (wbc->older_than_this &&
-		    inode_dirtied_after(inode, *wbc->older_than_this))
+		if (expire_interval &&
+		    inode_dirtied_after(inode, older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
@ 2011-04-19  3:00   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Mel Gorman, Itaru Kitayama, Wu Fengguang,
	Dave Chinner, Trond Myklebust, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

[-- Attachment #1: writeback-moving-target-dirty-expired.patch --]
[-- Type: text/plain, Size: 2299 bytes --]

Dynamically compute the dirty expire timestamp at queue_io() time.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the evil pageout(). Nevertheless this patch merely addresses part
  of the problem.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:28.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:29.000000000 +0800
@@ -254,16 +254,23 @@ static void move_expired_inodes(struct l
 				struct list_head *dispatch_queue,
 				struct writeback_control *wbc)
 {
+	unsigned long expire_interval = 0;
+	unsigned long older_than_this;
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
+	if (wbc->for_kupdate) {
+		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
+		older_than_this = jiffies - expire_interval;
+	}
+
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
-		if (wbc->older_than_this &&
-		    inode_dirtied_after(inode, *wbc->older_than_this))
+		if (expire_interval &&
+		    inode_dirtied_after(inode, older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-19  3:00 ` Wu Fengguang
  (?)
@ 2011-04-19  3:00   ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Mel Gorman, Wu Fengguang, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

[-- Attachment #1: writeback-expired-for-background.patch --]
[-- Type: text/plain, Size: 3061 bytes --]

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- enqueue all dirty inodes if there are no more expired inodes to sync

This will help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

Side effects: it will reduce the batch size and hence reduce
inode_wb_list_lock hold time, but also make the cluster-by-partition
logic in the same function less effective on reducing disk seeks.

CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:29.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
@@ -255,14 +255,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long uninitialized_var(older_than_this);
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -270,8 +270,20 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
+		    inode_dirtied_after(inode, older_than_this)) {
+			/*
+			 * background writeback will start with expired inodes,
+			 * and then fresh inodes. This order helps reduce the
+			 * number of dirty pages reaching the end of LRU lists
+			 * and cause trouble to the page reclaim.
+			 */
+			if (wbc->for_background &&
+			    list_empty(dispatch_queue) && list_empty(&tmp)) {
+				expire_interval = 0;
+				continue;
+			}
 			break;
+		}
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -585,7 +597,8 @@ void writeback_inodes_wb(struct bdi_writ
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_wb_list_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
@@ -612,7 +625,7 @@ static void __writeback_inodes_sb(struct
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_wb_list_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_wb_list_lock);



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-19  3:00   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Mel Gorman, Wu Fengguang, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

[-- Attachment #1: writeback-expired-for-background.patch --]
[-- Type: text/plain, Size: 3364 bytes --]

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- enqueue all dirty inodes if there are no more expired inodes to sync

This will help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

Side effects: it will reduce the batch size and hence reduce
inode_wb_list_lock hold time, but also make the cluster-by-partition
logic in the same function less effective on reducing disk seeks.

CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:29.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
@@ -255,14 +255,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long uninitialized_var(older_than_this);
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -270,8 +270,20 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
+		    inode_dirtied_after(inode, older_than_this)) {
+			/*
+			 * background writeback will start with expired inodes,
+			 * and then fresh inodes. This order helps reduce the
+			 * number of dirty pages reaching the end of LRU lists
+			 * and cause trouble to the page reclaim.
+			 */
+			if (wbc->for_background &&
+			    list_empty(dispatch_queue) && list_empty(&tmp)) {
+				expire_interval = 0;
+				continue;
+			}
 			break;
+		}
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -585,7 +597,8 @@ void writeback_inodes_wb(struct bdi_writ
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_wb_list_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
@@ -612,7 +625,7 @@ static void __writeback_inodes_sb(struct
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_wb_list_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_wb_list_lock);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-19  3:00   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Mel Gorman, Wu Fengguang, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

[-- Attachment #1: writeback-expired-for-background.patch --]
[-- Type: text/plain, Size: 3364 bytes --]

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- enqueue all dirty inodes if there are no more expired inodes to sync

This will help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

Side effects: it will reduce the batch size and hence reduce
inode_wb_list_lock hold time, but also make the cluster-by-partition
logic in the same function less effective on reducing disk seeks.

CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:29.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
@@ -255,14 +255,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long uninitialized_var(older_than_this);
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -270,8 +270,20 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
+		    inode_dirtied_after(inode, older_than_this)) {
+			/*
+			 * background writeback will start with expired inodes,
+			 * and then fresh inodes. This order helps reduce the
+			 * number of dirty pages reaching the end of LRU lists
+			 * and cause trouble to the page reclaim.
+			 */
+			if (wbc->for_background &&
+			    list_empty(dispatch_queue) && list_empty(&tmp)) {
+				expire_interval = 0;
+				continue;
+			}
 			break;
+		}
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -585,7 +597,8 @@ void writeback_inodes_wb(struct bdi_writ
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_wb_list_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
@@ -612,7 +625,7 @@ static void __writeback_inodes_sb(struct
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_wb_list_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_wb_list_lock);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 4/6] writeback: introduce writeback_control.inodes_cleaned
  2011-04-19  3:00 ` Wu Fengguang
  (?)
@ 2011-04-19  3:00   ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Mel Gorman, Wu Fengguang, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

[-- Attachment #1: writeback-inodes_written.patch --]
[-- Type: text/plain, Size: 2285 bytes --]

The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".

So introduce writeback_control.inodes_cleaned to count the inodes get
cleaned.  A non-zero value means there are some progress on writeback,
in which case more writeback can be tried.

about v1: The initial version was to count successful ->write_inode()
calls.  However it leads to busy loops for sync() over NFS, because NFS
ridiculously returns 0 (success) while at the same time redirties the
inode.  The NFS case can be trivially fixed, however there may be more
hidden bugs in other filesystems..

Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    4 ++++
 include/linux/writeback.h |    1 +
 2 files changed, 5 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
@@ -473,6 +473,7 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
+			wbc->inodes_cleaned++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -736,6 +737,7 @@ static long wb_writeback(struct bdi_writ
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
+		wbc.inodes_cleaned = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
@@ -752,6 +754,8 @@ static long wb_writeback(struct bdi_writ
 		 */
 		if (wbc.nr_to_write <= 0)
 			continue;
+		if (wbc.inodes_cleaned)
+			continue;
 		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
--- linux-next.orig/include/linux/writeback.h	2011-04-19 10:18:17.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-04-19 10:18:30.000000000 +0800
@@ -34,6 +34,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long inodes_cleaned;		/* # of inodes cleaned */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 4/6] writeback: introduce writeback_control.inodes_cleaned
@ 2011-04-19  3:00   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Mel Gorman, Wu Fengguang, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

[-- Attachment #1: writeback-inodes_written.patch --]
[-- Type: text/plain, Size: 2588 bytes --]

The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".

So introduce writeback_control.inodes_cleaned to count the inodes get
cleaned.  A non-zero value means there are some progress on writeback,
in which case more writeback can be tried.

about v1: The initial version was to count successful ->write_inode()
calls.  However it leads to busy loops for sync() over NFS, because NFS
ridiculously returns 0 (success) while at the same time redirties the
inode.  The NFS case can be trivially fixed, however there may be more
hidden bugs in other filesystems..

Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    4 ++++
 include/linux/writeback.h |    1 +
 2 files changed, 5 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
@@ -473,6 +473,7 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
+			wbc->inodes_cleaned++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -736,6 +737,7 @@ static long wb_writeback(struct bdi_writ
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
+		wbc.inodes_cleaned = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
@@ -752,6 +754,8 @@ static long wb_writeback(struct bdi_writ
 		 */
 		if (wbc.nr_to_write <= 0)
 			continue;
+		if (wbc.inodes_cleaned)
+			continue;
 		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
--- linux-next.orig/include/linux/writeback.h	2011-04-19 10:18:17.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-04-19 10:18:30.000000000 +0800
@@ -34,6 +34,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long inodes_cleaned;		/* # of inodes cleaned */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 4/6] writeback: introduce writeback_control.inodes_cleaned
@ 2011-04-19  3:00   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Mel Gorman, Wu Fengguang, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

[-- Attachment #1: writeback-inodes_written.patch --]
[-- Type: text/plain, Size: 2588 bytes --]

The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".

So introduce writeback_control.inodes_cleaned to count the inodes get
cleaned.  A non-zero value means there are some progress on writeback,
in which case more writeback can be tried.

about v1: The initial version was to count successful ->write_inode()
calls.  However it leads to busy loops for sync() over NFS, because NFS
ridiculously returns 0 (success) while at the same time redirties the
inode.  The NFS case can be trivially fixed, however there may be more
hidden bugs in other filesystems..

Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    4 ++++
 include/linux/writeback.h |    1 +
 2 files changed, 5 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
@@ -473,6 +473,7 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
+			wbc->inodes_cleaned++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -736,6 +737,7 @@ static long wb_writeback(struct bdi_writ
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
+		wbc.inodes_cleaned = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
@@ -752,6 +754,8 @@ static long wb_writeback(struct bdi_writ
 		 */
 		if (wbc.nr_to_write <= 0)
 			continue;
+		if (wbc.inodes_cleaned)
+			continue;
 		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
--- linux-next.orig/include/linux/writeback.h	2011-04-19 10:18:17.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-04-19 10:18:30.000000000 +0800
@@ -34,6 +34,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long inodes_cleaned;		/* # of inodes cleaned */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-19  3:00 ` Wu Fengguang
  (?)
@ 2011-04-19  3:00   ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Wu Fengguang, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

[-- Attachment #1: writeback-background-retry.patch --]
[-- Type: text/plain, Size: 2766 bytes --]

writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of elegible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

Jan raised the concern

	I'm just afraid that in some pathological cases this could
	result in bad writeback pattern - like if there is some process
	which manages to dirty just a few pages while we are doing
	writeout, this looping could result in writing just a few pages
	in each round which is bad for fragmentation etc.

However it requires really strong timing to make that to (continuously)
happen.  In practice it's very hard to produce such a pattern even if
it's possible in theory. I actually tried to write 1 page per 1ms with
this command

	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test

and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs. The readers could try other write-and-sleep patterns and
check if it can block sync for longer time.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:31.000000000 +0800
@@ -750,23 +750,23 @@ static long wb_writeback(struct bdi_writ
 		wrote += write_chunk - wbc.nr_to_write;
 
 		/*
-		 * If we consumed everything, see if we have more
+		 * Did we write something? Try for more
+		 *
+		 * Dirty inodes are moved to b_io for writeback in batches.
+		 * The completion of the current batch does not necessarily
+		 * mean the overall work is done. So we keep looping as long
+		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write <= 0)
+		if (wbc.nr_to_write < write_chunk)
 			continue;
 		if (wbc.inodes_cleaned)
 			continue;
 		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * No more inodes for IO, bail
 		 */
 		if (!wbc.more_io)
 			break;
 		/*
-		 * Did we write something? Try for more
-		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-19  3:00   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Wu Fengguang, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

[-- Attachment #1: writeback-background-retry.patch --]
[-- Type: text/plain, Size: 3069 bytes --]

writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of elegible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

Jan raised the concern

	I'm just afraid that in some pathological cases this could
	result in bad writeback pattern - like if there is some process
	which manages to dirty just a few pages while we are doing
	writeout, this looping could result in writing just a few pages
	in each round which is bad for fragmentation etc.

However it requires really strong timing to make that to (continuously)
happen.  In practice it's very hard to produce such a pattern even if
it's possible in theory. I actually tried to write 1 page per 1ms with
this command

	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test

and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs. The readers could try other write-and-sleep patterns and
check if it can block sync for longer time.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:31.000000000 +0800
@@ -750,23 +750,23 @@ static long wb_writeback(struct bdi_writ
 		wrote += write_chunk - wbc.nr_to_write;
 
 		/*
-		 * If we consumed everything, see if we have more
+		 * Did we write something? Try for more
+		 *
+		 * Dirty inodes are moved to b_io for writeback in batches.
+		 * The completion of the current batch does not necessarily
+		 * mean the overall work is done. So we keep looping as long
+		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write <= 0)
+		if (wbc.nr_to_write < write_chunk)
 			continue;
 		if (wbc.inodes_cleaned)
 			continue;
 		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * No more inodes for IO, bail
 		 */
 		if (!wbc.more_io)
 			break;
 		/*
-		 * Did we write something? Try for more
-		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-19  3:00   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Wu Fengguang, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

[-- Attachment #1: writeback-background-retry.patch --]
[-- Type: text/plain, Size: 3069 bytes --]

writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of elegible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

Jan raised the concern

	I'm just afraid that in some pathological cases this could
	result in bad writeback pattern - like if there is some process
	which manages to dirty just a few pages while we are doing
	writeout, this looping could result in writing just a few pages
	in each round which is bad for fragmentation etc.

However it requires really strong timing to make that to (continuously)
happen.  In practice it's very hard to produce such a pattern even if
it's possible in theory. I actually tried to write 1 page per 1ms with
this command

	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test

and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs. The readers could try other write-and-sleep patterns and
check if it can block sync for longer time.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:31.000000000 +0800
@@ -750,23 +750,23 @@ static long wb_writeback(struct bdi_writ
 		wrote += write_chunk - wbc.nr_to_write;
 
 		/*
-		 * If we consumed everything, see if we have more
+		 * Did we write something? Try for more
+		 *
+		 * Dirty inodes are moved to b_io for writeback in batches.
+		 * The completion of the current batch does not necessarily
+		 * mean the overall work is done. So we keep looping as long
+		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write <= 0)
+		if (wbc.nr_to_write < write_chunk)
 			continue;
 		if (wbc.inodes_cleaned)
 			continue;
 		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * No more inodes for IO, bail
 		 */
 		if (!wbc.more_io)
 			break;
 		/*
-		 * Did we write something? Try for more
-		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 6/6] NFS: return -EAGAIN when skipped commit in nfs_commit_unstable_pages()
  2011-04-19  3:00 ` Wu Fengguang
@ 2011-04-19  3:00   ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Trond Myklebust, Wu Fengguang,
	Dave Chinner, Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

[-- Attachment #1: nfs-fix-write_inode-retval.patch --]
[-- Type: text/plain, Size: 701 bytes --]

It's probably not sane to return success while redirtying the inode at
the same time in ->write_inode().

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/fs/nfs/write.c	2011-04-19 10:18:16.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-04-19 10:18:32.000000000 +0800
@@ -1519,7 +1519,7 @@ static int nfs_commit_unstable_pages(str
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
 	int flags = FLUSH_SYNC;
-	int ret = 0;
+	int ret = -EAGAIN;
 
 	if (wbc->sync_mode == WB_SYNC_NONE) {
 		/* Don't commit yet if this is a non-blocking flush and there



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 6/6] NFS: return -EAGAIN when skipped commit in nfs_commit_unstable_pages()
@ 2011-04-19  3:00   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Trond Myklebust, Wu Fengguang,
	Dave Chinner, Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

[-- Attachment #1: nfs-fix-write_inode-retval.patch --]
[-- Type: text/plain, Size: 1004 bytes --]

It's probably not sane to return success while redirtying the inode at
the same time in ->write_inode().

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/fs/nfs/write.c	2011-04-19 10:18:16.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-04-19 10:18:32.000000000 +0800
@@ -1519,7 +1519,7 @@ static int nfs_commit_unstable_pages(str
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
 	int flags = FLUSH_SYNC;
-	int ret = 0;
+	int ret = -EAGAIN;
 
 	if (wbc->sync_mode == WB_SYNC_NONE) {
 		/* Don't commit yet if this is a non-blocking flush and there


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 6/6] NFS: return -EAGAIN when skipped commit in nfs_commit_unstable_pages()
  2011-04-19  3:00   ` Wu Fengguang
@ 2011-04-19  3:29     ` Trond Myklebust
  -1 siblings, 0 replies; 135+ messages in thread
From: Trond Myklebust @ 2011-04-19  3:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, 2011-04-19 at 11:00 +0800, Wu Fengguang wrote:
> plain text document attachment (nfs-fix-write_inode-retval.patch)
> It's probably not sane to return success while redirtying the inode at
> the same time in ->write_inode().
> 
> CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/nfs/write.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-next.orig/fs/nfs/write.c	2011-04-19 10:18:16.000000000 +0800
> +++ linux-next/fs/nfs/write.c	2011-04-19 10:18:32.000000000 +0800
> @@ -1519,7 +1519,7 @@ static int nfs_commit_unstable_pages(str
>  {
>  	struct nfs_inode *nfsi = NFS_I(inode);
>  	int flags = FLUSH_SYNC;
> -	int ret = 0;
> +	int ret = -EAGAIN;
>  
>  	if (wbc->sync_mode == WB_SYNC_NONE) {
>  		/* Don't commit yet if this is a non-blocking flush and there
> 
> 

Hi Fengguang,

I don't understand the purpose of this patch...

Currently, the value of 'ret' only affects the case where the commit
exits early due to this being a non-blocking flush where we have not yet
written back enough pages to make it worth our while to send a commit.

In essence, this really only matters for the cases where someone calls
'write_inode_now' (not used by anybody calling into the NFS client) and
'sync_inode', which is only called by nfs_wb_all (with sync_mode =
WB_SYNC_ALL).

So can you please elaborate on the possible use cases for this change?

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 6/6] NFS: return -EAGAIN when skipped commit in nfs_commit_unstable_pages()
@ 2011-04-19  3:29     ` Trond Myklebust
  0 siblings, 0 replies; 135+ messages in thread
From: Trond Myklebust @ 2011-04-19  3:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, 2011-04-19 at 11:00 +0800, Wu Fengguang wrote:
> plain text document attachment (nfs-fix-write_inode-retval.patch)
> It's probably not sane to return success while redirtying the inode at
> the same time in ->write_inode().
> 
> CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/nfs/write.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-next.orig/fs/nfs/write.c	2011-04-19 10:18:16.000000000 +0800
> +++ linux-next/fs/nfs/write.c	2011-04-19 10:18:32.000000000 +0800
> @@ -1519,7 +1519,7 @@ static int nfs_commit_unstable_pages(str
>  {
>  	struct nfs_inode *nfsi = NFS_I(inode);
>  	int flags = FLUSH_SYNC;
> -	int ret = 0;
> +	int ret = -EAGAIN;
>  
>  	if (wbc->sync_mode == WB_SYNC_NONE) {
>  		/* Don't commit yet if this is a non-blocking flush and there
> 
> 

Hi Fengguang,

I don't understand the purpose of this patch...

Currently, the value of 'ret' only affects the case where the commit
exits early due to this being a non-blocking flush where we have not yet
written back enough pages to make it worth our while to send a commit.

In essence, this really only matters for the cases where someone calls
'write_inode_now' (not used by anybody calling into the NFS client) and
'sync_inode', which is only called by nfs_wb_all (with sync_mode =
WB_SYNC_ALL).

So can you please elaborate on the possible use cases for this change?

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 6/6] NFS: return -EAGAIN when skipped commit in nfs_commit_unstable_pages()
  2011-04-19  3:29     ` Trond Myklebust
@ 2011-04-19  3:55       ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:55 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

Hi Trond,

On Tue, Apr 19, 2011 at 11:29:07AM +0800, Trond Myklebust wrote:
> On Tue, 2011-04-19 at 11:00 +0800, Wu Fengguang wrote:
> > plain text document attachment (nfs-fix-write_inode-retval.patch)
> > It's probably not sane to return success while redirtying the inode at
> > the same time in ->write_inode().
> > 
> > CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/nfs/write.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > --- linux-next.orig/fs/nfs/write.c	2011-04-19 10:18:16.000000000 +0800
> > +++ linux-next/fs/nfs/write.c	2011-04-19 10:18:32.000000000 +0800
> > @@ -1519,7 +1519,7 @@ static int nfs_commit_unstable_pages(str
> >  {
> >  	struct nfs_inode *nfsi = NFS_I(inode);
> >  	int flags = FLUSH_SYNC;
> > -	int ret = 0;
> > +	int ret = -EAGAIN;
> >  
> >  	if (wbc->sync_mode == WB_SYNC_NONE) {
> >  		/* Don't commit yet if this is a non-blocking flush and there
> > 
> > 
> 
> Hi Fengguang,
> 
> I don't understand the purpose of this patch...
> 
> Currently, the value of 'ret' only affects the case where the commit
> exits early due to this being a non-blocking flush where we have not yet
> written back enough pages to make it worth our while to send a commit.
> 
> In essence, this really only matters for the cases where someone calls
> 'write_inode_now' (not used by anybody calling into the NFS client) and
> 'sync_inode', which is only called by nfs_wb_all (with sync_mode =
> WB_SYNC_ALL).
> 
> So can you please elaborate on the possible use cases for this change?

Yeah it has no real impact for current kernel. The "fix" is just to
make it behave more aligned to my expectation.

It did lead to a sync() hung bug with the v1 patch 4/6 in this series,
where I do the below code and expected "write_inode() == 0" to be
"done with the inode".  But only to find that it's not the case for NFS..

Thanks,
Fengguang
---

@@ -389,6 +389,8 @@ writeback_single_inode(struct inode *ino
                int err = write_inode(inode, wbc);
                if (ret == 0)
                        ret = err;
+               if (!err)
+                       wbc->inodes_written++;
        }
       
        spin_lock(&inode_lock);
@@ -664,6 +667,8 @@ static long wb_writeback(struct bdi_writ
                 */
                if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
                        continue;
+               if (wbc.inodes_written)
+                       continue;
               
                /*
                 * Nothing written and no more inodes for IO, bail


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 6/6] NFS: return -EAGAIN when skipped commit in nfs_commit_unstable_pages()
@ 2011-04-19  3:55       ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:55 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

Hi Trond,

On Tue, Apr 19, 2011 at 11:29:07AM +0800, Trond Myklebust wrote:
> On Tue, 2011-04-19 at 11:00 +0800, Wu Fengguang wrote:
> > plain text document attachment (nfs-fix-write_inode-retval.patch)
> > It's probably not sane to return success while redirtying the inode at
> > the same time in ->write_inode().
> > 
> > CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/nfs/write.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > --- linux-next.orig/fs/nfs/write.c	2011-04-19 10:18:16.000000000 +0800
> > +++ linux-next/fs/nfs/write.c	2011-04-19 10:18:32.000000000 +0800
> > @@ -1519,7 +1519,7 @@ static int nfs_commit_unstable_pages(str
> >  {
> >  	struct nfs_inode *nfsi = NFS_I(inode);
> >  	int flags = FLUSH_SYNC;
> > -	int ret = 0;
> > +	int ret = -EAGAIN;
> >  
> >  	if (wbc->sync_mode == WB_SYNC_NONE) {
> >  		/* Don't commit yet if this is a non-blocking flush and there
> > 
> > 
> 
> Hi Fengguang,
> 
> I don't understand the purpose of this patch...
> 
> Currently, the value of 'ret' only affects the case where the commit
> exits early due to this being a non-blocking flush where we have not yet
> written back enough pages to make it worth our while to send a commit.
> 
> In essence, this really only matters for the cases where someone calls
> 'write_inode_now' (not used by anybody calling into the NFS client) and
> 'sync_inode', which is only called by nfs_wb_all (with sync_mode =
> WB_SYNC_ALL).
> 
> So can you please elaborate on the possible use cases for this change?

Yeah it has no real impact for current kernel. The "fix" is just to
make it behave more aligned to my expectation.

It did lead to a sync() hung bug with the v1 patch 4/6 in this series,
where I do the below code and expected "write_inode() == 0" to be
"done with the inode".  But only to find that it's not the case for NFS..

Thanks,
Fengguang
---

@@ -389,6 +389,8 @@ writeback_single_inode(struct inode *ino
                int err = write_inode(inode, wbc);
                if (ret == 0)
                        ret = err;
+               if (!err)
+                       wbc->inodes_written++;
        }
       
        spin_lock(&inode_lock);
@@ -664,6 +667,8 @@ static long wb_writeback(struct bdi_writ
                 */
                if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
                        continue;
+               if (wbc.inodes_written)
+                       continue;
               
                /*
                 * Nothing written and no more inodes for IO, bail

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
  2011-04-19  3:00 ` Wu Fengguang
@ 2011-04-19  6:38   ` Dave Chinner
  -1 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-19  6:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 11:00:03AM +0800, Wu Fengguang wrote:
> 
> Andrew,
> 
> This aims to reduce possible pageout() calls by making the flusher
> concentrate a bit more on old/expired dirty inodes.

In what situation is this a problem? Can you demonstrate how you
trigger it? And then how much improvement does this patchset make?

> Patches 04, 05 have been updated since last post, please review.
> The concerns from last review have been addressed.
> 
> It runs fine on simple workloads over ext3/4, xfs, btrfs and NFS.

But it starts propagating new differences between background and
kupdate style writeback. We've been trying to reduce the number of
permutations of writeback behaviour, so it seems to me to be wrong
to further increase the behavioural differences. Indeed, why do we
need "for kupdate" style writeback and "background" writeback
anymore - can' we just use background style writeback for both?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
@ 2011-04-19  6:38   ` Dave Chinner
  0 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-19  6:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 11:00:03AM +0800, Wu Fengguang wrote:
> 
> Andrew,
> 
> This aims to reduce possible pageout() calls by making the flusher
> concentrate a bit more on old/expired dirty inodes.

In what situation is this a problem? Can you demonstrate how you
trigger it? And then how much improvement does this patchset make?

> Patches 04, 05 have been updated since last post, please review.
> The concerns from last review have been addressed.
> 
> It runs fine on simple workloads over ext3/4, xfs, btrfs and NFS.

But it starts propagating new differences between background and
kupdate style writeback. We've been trying to reduce the number of
permutations of writeback behaviour, so it seems to me to be wrong
to further increase the behavioural differences. Indeed, why do we
need "for kupdate" style writeback and "background" writeback
anymore - can' we just use background style writeback for both?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
  2011-04-19  3:00   ` Wu Fengguang
@ 2011-04-19  7:02     ` Dave Chinner
  -1 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-19  7:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman, Itaru Kitayama,
	Trond Myklebust, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 11:00:05AM +0800, Wu Fengguang wrote:
> Dynamically compute the dirty expire timestamp at queue_io() time.
> 
> writeback_control.older_than_this used to be determined at entrance to
> the kupdate writeback work. This _static_ timestamp may go stale if the
> kupdate work runs on and on. The flusher may then stuck with some old
> busy inodes, never considering newly expired inodes thereafter.
> 
> This has two possible problems:
> 
> - It is unfair for a large dirty inode to delay (for a long time) the
>   writeback of small dirty inodes.
> 
> - As time goes by, the large and busy dirty inode may contain only
>   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
>   delaying the expired dirty pages to the end of LRU lists, triggering
>   the evil pageout(). Nevertheless this patch merely addresses part
>   of the problem.

When wb_writeback() is called with for_kupdate set, it initialises
wbc->older_than_this appropriately outside the writeback loop.
queue_io() is called once per writeback_inodes_wb() call, which is
once per loop in wb_writeback. All your change does is re-initialise
older_than_this once per loop in wb_writeback, jus tin a different
and very non-obvious place.

So why didn't you just re-initialise it inside the loop in
wb_writeback() and leave all the other code alone?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
@ 2011-04-19  7:02     ` Dave Chinner
  0 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-19  7:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman, Itaru Kitayama,
	Trond Myklebust, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 11:00:05AM +0800, Wu Fengguang wrote:
> Dynamically compute the dirty expire timestamp at queue_io() time.
> 
> writeback_control.older_than_this used to be determined at entrance to
> the kupdate writeback work. This _static_ timestamp may go stale if the
> kupdate work runs on and on. The flusher may then stuck with some old
> busy inodes, never considering newly expired inodes thereafter.
> 
> This has two possible problems:
> 
> - It is unfair for a large dirty inode to delay (for a long time) the
>   writeback of small dirty inodes.
> 
> - As time goes by, the large and busy dirty inode may contain only
>   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
>   delaying the expired dirty pages to the end of LRU lists, triggering
>   the evil pageout(). Nevertheless this patch merely addresses part
>   of the problem.

When wb_writeback() is called with for_kupdate set, it initialises
wbc->older_than_this appropriately outside the writeback loop.
queue_io() is called once per writeback_inodes_wb() call, which is
once per loop in wb_writeback. All your change does is re-initialise
older_than_this once per loop in wb_writeback, jus tin a different
and very non-obvious place.

So why didn't you just re-initialise it inside the loop in
wb_writeback() and leave all the other code alone?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
  2011-04-19  7:02     ` Dave Chinner
@ 2011-04-19  7:20       ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  7:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman, Itaru Kitayama,
	Trond Myklebust, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 03:02:47PM +0800, Dave Chinner wrote:
> On Tue, Apr 19, 2011 at 11:00:05AM +0800, Wu Fengguang wrote:
> > Dynamically compute the dirty expire timestamp at queue_io() time.
> > 
> > writeback_control.older_than_this used to be determined at entrance to
> > the kupdate writeback work. This _static_ timestamp may go stale if the
> > kupdate work runs on and on. The flusher may then stuck with some old
> > busy inodes, never considering newly expired inodes thereafter.
> > 
> > This has two possible problems:
> > 
> > - It is unfair for a large dirty inode to delay (for a long time) the
> >   writeback of small dirty inodes.
> > 
> > - As time goes by, the large and busy dirty inode may contain only
> >   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
> >   delaying the expired dirty pages to the end of LRU lists, triggering
> >   the evil pageout(). Nevertheless this patch merely addresses part
> >   of the problem.
> 
> When wb_writeback() is called with for_kupdate set, it initialises
> wbc->older_than_this appropriately outside the writeback loop.
> queue_io() is called once per writeback_inodes_wb() call, which is
> once per loop in wb_writeback. All your change does is re-initialise
> older_than_this once per loop in wb_writeback, jus tin a different
> and very non-obvious place.
> 
> So why didn't you just re-initialise it inside the loop in
> wb_writeback() and leave all the other code alone?

It helps both readability and efficiency to make it a local var.

I have another patch to kill the wbc->older_than_this (and one more
for wbc->more_io). They are delayed to avoid possible merge conflicts
with the IO-less patchset.

But yeah, it seems reasonable to move the first chunk of the below
patch to this one.

Thanks,
Fengguang
---
Subject: writeback: the kupdate expire timestamp should be a moving target
Date: Wed Jul 21 20:32:30 CST 2010

Remove writeback_control.older_than_this which is no longer used.

[kitayama@cl.bb4u.ne.jp] fix btrfs and ext4 references

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/extent_io.c             |    2 --
 fs/fs-writeback.c                |   13 -------------
 include/linux/writeback.h        |    2 --
 include/trace/events/writeback.h |    6 +-----
 mm/backing-dev.c                 |    1 -
 5 files changed, 1 insertion(+), 23 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-18 08:37:01.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-18 08:38:16.000000000 +0800
@@ -681,30 +681,20 @@ static unsigned long writeback_chunk_siz
  * Try to run once per dirty_writeback_interval.  But if a writeback event
  * takes longer than a dirty_writeback_interval interval, then leave a
  * one-second gap.
- *
- * older_than_this takes precedence over nr_to_write.  So we'll only write back
- * all dirty pages if they are all attached to "old" mappings.
  */
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
 	struct writeback_control wbc = {
 		.sync_mode		= work->sync_mode,
-		.older_than_this	= NULL,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
 		.range_cyclic		= work->range_cyclic,
 	};
-	unsigned long oldest_jif;
 	long wrote = 0;
 	long write_chunk;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -1139,9 +1129,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  * Write out a superblock's list of dirty inodes.  A wait will be performed
  * upon no inodes, all inodes or the final one, depending upon sync_mode.
  *
- * If older_than_this is non-NULL, then only write out inodes which
- * had their first dirtying at a time earlier than *older_than_this.
- *
  * If `bdi' is non-zero then we're being asked to writeback a specific queue.
  * This function assumes that the blockdev superblock's inodes are backed by
  * a variety of queues, so all inodes are searched.  For other superblocks,
--- linux-next.orig/include/linux/writeback.h	2011-04-18 08:36:59.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-04-18 08:38:16.000000000 +0800
@@ -66,8 +66,6 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	unsigned long wb_start;         /* Time writeback_inodes_wb was
 					   called. This is needed to avoid
 					   extra jobs and livelock */
--- linux-next.orig/include/trace/events/writeback.h	2011-04-18 08:36:59.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-04-18 08:38:16.000000000 +0800
@@ -115,7 +115,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
 		__field(int, more_io)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -130,14 +129,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
 		__entry->more_io	= wbc->more_io;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d more=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -148,7 +145,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
 		__entry->more_io,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
--- linux-next.orig/mm/backing-dev.c	2011-04-18 08:36:59.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-04-18 08:38:16.000000000 +0800
@@ -263,7 +263,6 @@ static void bdi_flush_io(struct backing_
 {
 	struct writeback_control wbc = {
 		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 		.nr_to_write		= 1024,
 	};
--- linux-next.orig/fs/btrfs/extent_io.c	2011-04-18 08:36:59.000000000 +0800
+++ linux-next/fs/btrfs/extent_io.c	2011-04-18 08:38:16.000000000 +0800
@@ -2556,7 +2556,6 @@ int extent_write_full_page(struct extent
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= wbc->sync_mode,
-		.older_than_this = NULL,
 		.nr_to_write	= 64,
 		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
 		.range_end	= (loff_t)-1,
@@ -2589,7 +2588,6 @@ int extent_write_locked_range(struct ext
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= mode,
-		.older_than_this = NULL,
 		.nr_to_write	= nr_pages * 2,
 		.range_start	= start,
 		.range_end	= end + 1,

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
@ 2011-04-19  7:20       ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  7:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman, Itaru Kitayama,
	Trond Myklebust, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 03:02:47PM +0800, Dave Chinner wrote:
> On Tue, Apr 19, 2011 at 11:00:05AM +0800, Wu Fengguang wrote:
> > Dynamically compute the dirty expire timestamp at queue_io() time.
> > 
> > writeback_control.older_than_this used to be determined at entrance to
> > the kupdate writeback work. This _static_ timestamp may go stale if the
> > kupdate work runs on and on. The flusher may then stuck with some old
> > busy inodes, never considering newly expired inodes thereafter.
> > 
> > This has two possible problems:
> > 
> > - It is unfair for a large dirty inode to delay (for a long time) the
> >   writeback of small dirty inodes.
> > 
> > - As time goes by, the large and busy dirty inode may contain only
> >   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
> >   delaying the expired dirty pages to the end of LRU lists, triggering
> >   the evil pageout(). Nevertheless this patch merely addresses part
> >   of the problem.
> 
> When wb_writeback() is called with for_kupdate set, it initialises
> wbc->older_than_this appropriately outside the writeback loop.
> queue_io() is called once per writeback_inodes_wb() call, which is
> once per loop in wb_writeback. All your change does is re-initialise
> older_than_this once per loop in wb_writeback, jus tin a different
> and very non-obvious place.
> 
> So why didn't you just re-initialise it inside the loop in
> wb_writeback() and leave all the other code alone?

It helps both readability and efficiency to make it a local var.

I have another patch to kill the wbc->older_than_this (and one more
for wbc->more_io). They are delayed to avoid possible merge conflicts
with the IO-less patchset.

But yeah, it seems reasonable to move the first chunk of the below
patch to this one.

Thanks,
Fengguang
---
Subject: writeback: the kupdate expire timestamp should be a moving target
Date: Wed Jul 21 20:32:30 CST 2010

Remove writeback_control.older_than_this which is no longer used.

[kitayama@cl.bb4u.ne.jp] fix btrfs and ext4 references

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/extent_io.c             |    2 --
 fs/fs-writeback.c                |   13 -------------
 include/linux/writeback.h        |    2 --
 include/trace/events/writeback.h |    6 +-----
 mm/backing-dev.c                 |    1 -
 5 files changed, 1 insertion(+), 23 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-18 08:37:01.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-18 08:38:16.000000000 +0800
@@ -681,30 +681,20 @@ static unsigned long writeback_chunk_siz
  * Try to run once per dirty_writeback_interval.  But if a writeback event
  * takes longer than a dirty_writeback_interval interval, then leave a
  * one-second gap.
- *
- * older_than_this takes precedence over nr_to_write.  So we'll only write back
- * all dirty pages if they are all attached to "old" mappings.
  */
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
 	struct writeback_control wbc = {
 		.sync_mode		= work->sync_mode,
-		.older_than_this	= NULL,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
 		.range_cyclic		= work->range_cyclic,
 	};
-	unsigned long oldest_jif;
 	long wrote = 0;
 	long write_chunk;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -1139,9 +1129,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  * Write out a superblock's list of dirty inodes.  A wait will be performed
  * upon no inodes, all inodes or the final one, depending upon sync_mode.
  *
- * If older_than_this is non-NULL, then only write out inodes which
- * had their first dirtying at a time earlier than *older_than_this.
- *
  * If `bdi' is non-zero then we're being asked to writeback a specific queue.
  * This function assumes that the blockdev superblock's inodes are backed by
  * a variety of queues, so all inodes are searched.  For other superblocks,
--- linux-next.orig/include/linux/writeback.h	2011-04-18 08:36:59.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-04-18 08:38:16.000000000 +0800
@@ -66,8 +66,6 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	unsigned long wb_start;         /* Time writeback_inodes_wb was
 					   called. This is needed to avoid
 					   extra jobs and livelock */
--- linux-next.orig/include/trace/events/writeback.h	2011-04-18 08:36:59.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-04-18 08:38:16.000000000 +0800
@@ -115,7 +115,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
 		__field(int, more_io)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -130,14 +129,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
 		__entry->more_io	= wbc->more_io;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d more=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -148,7 +145,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
 		__entry->more_io,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
--- linux-next.orig/mm/backing-dev.c	2011-04-18 08:36:59.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-04-18 08:38:16.000000000 +0800
@@ -263,7 +263,6 @@ static void bdi_flush_io(struct backing_
 {
 	struct writeback_control wbc = {
 		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 		.nr_to_write		= 1024,
 	};
--- linux-next.orig/fs/btrfs/extent_io.c	2011-04-18 08:36:59.000000000 +0800
+++ linux-next/fs/btrfs/extent_io.c	2011-04-18 08:38:16.000000000 +0800
@@ -2556,7 +2556,6 @@ int extent_write_full_page(struct extent
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= wbc->sync_mode,
-		.older_than_this = NULL,
 		.nr_to_write	= 64,
 		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
 		.range_end	= (loff_t)-1,
@@ -2589,7 +2588,6 @@ int extent_write_locked_range(struct ext
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= mode,
-		.older_than_this = NULL,
 		.nr_to_write	= nr_pages * 2,
 		.range_start	= start,
 		.range_end	= end + 1,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-19  3:00   ` Wu Fengguang
@ 2011-04-19  7:35     ` Dave Chinner
  -1 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-19  7:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 11:00:06AM +0800, Wu Fengguang wrote:
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> The policy is
> - enqueue all newly expired inodes at each queue_io() time
> - enqueue all dirty inodes if there are no more expired inodes to sync
> 
> This will help reduce the number of dirty pages encountered by page
> reclaim, eg. the pageout() calls. Normally older inodes contain older
> dirty pages, which are more close to the end of the LRU lists. So
> syncing older inodes first helps reducing the dirty pages reached by
> the page reclaim code.

Once again I think this is the wrong place to be changing writeback
policy decisions. for_background writeback only goes through
wb_writeback() and writeback_inodes_wb() (same as for_kupdate
writeback), so a decision to change from expired inodes to fresh
inodes, IMO, should be made in wb_writeback.

That is, for_background and for_kupdate writeback start with the
same policy (older_than_this set) to writeback expired inodes first,
then when background writeback runs out of expired inodes, it should
switch to all remaining inodes by clearing older_than_this instead
of refreshing it for the next loop.

This keeps all the policy decisions in the one place, all using the
same (existing) mechanism, and all relatively simple to understand,
and easy to tracepoint for debugging.  Changing writeback policy
deep in the writeback stack is not a good idea as it will make
extending writeback policies in future (e.g. for cgroup awareness)
very messy.

> @@ -585,7 +597,8 @@ void writeback_inodes_wb(struct bdi_writ
>  	if (!wbc->wb_start)
>  		wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_wb_list_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +
> +	if (list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
> @@ -612,7 +625,7 @@ static void __writeback_inodes_sb(struct
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
>  	spin_lock(&inode_wb_list_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +	if (list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  	writeback_sb_inodes(sb, wb, wbc, true);
>  	spin_unlock(&inode_wb_list_lock);

That changes the order in which we queue inodes for writeback.
Instead of calling every time to move b_more_io inodes onto the b_io
list and expiring more aged inodes, we only ever do it when the list
is empty. That is, it seems to me that this will tend to give
b_more_io inodes a smaller share of writeback because they are being
moved back to the b_io list less frequently where there are lots of
other inodes being dirtied. Have you tested the impact of this
change on mixed workload performance? Indeed, can you starve
writeback of a large file simply by creating lots of small files in
another thread?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-19  7:35     ` Dave Chinner
  0 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-19  7:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 11:00:06AM +0800, Wu Fengguang wrote:
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> The policy is
> - enqueue all newly expired inodes at each queue_io() time
> - enqueue all dirty inodes if there are no more expired inodes to sync
> 
> This will help reduce the number of dirty pages encountered by page
> reclaim, eg. the pageout() calls. Normally older inodes contain older
> dirty pages, which are more close to the end of the LRU lists. So
> syncing older inodes first helps reducing the dirty pages reached by
> the page reclaim code.

Once again I think this is the wrong place to be changing writeback
policy decisions. for_background writeback only goes through
wb_writeback() and writeback_inodes_wb() (same as for_kupdate
writeback), so a decision to change from expired inodes to fresh
inodes, IMO, should be made in wb_writeback.

That is, for_background and for_kupdate writeback start with the
same policy (older_than_this set) to writeback expired inodes first,
then when background writeback runs out of expired inodes, it should
switch to all remaining inodes by clearing older_than_this instead
of refreshing it for the next loop.

This keeps all the policy decisions in the one place, all using the
same (existing) mechanism, and all relatively simple to understand,
and easy to tracepoint for debugging.  Changing writeback policy
deep in the writeback stack is not a good idea as it will make
extending writeback policies in future (e.g. for cgroup awareness)
very messy.

> @@ -585,7 +597,8 @@ void writeback_inodes_wb(struct bdi_writ
>  	if (!wbc->wb_start)
>  		wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_wb_list_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +
> +	if (list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
> @@ -612,7 +625,7 @@ static void __writeback_inodes_sb(struct
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
>  	spin_lock(&inode_wb_list_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +	if (list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  	writeback_sb_inodes(sb, wb, wbc, true);
>  	spin_unlock(&inode_wb_list_lock);

That changes the order in which we queue inodes for writeback.
Instead of calling every time to move b_more_io inodes onto the b_io
list and expiring more aged inodes, we only ever do it when the list
is empty. That is, it seems to me that this will tend to give
b_more_io inodes a smaller share of writeback because they are being
moved back to the b_io list less frequently where there are lots of
other inodes being dirtied. Have you tested the impact of this
change on mixed workload performance? Indeed, can you starve
writeback of a large file simply by creating lots of small files in
another thread?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
  2011-04-19  6:38   ` Dave Chinner
@ 2011-04-19  8:02     ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  8:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Jan Kara, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Tue, Apr 19, 2011 at 02:38:23PM +0800, Dave Chinner wrote:
> On Tue, Apr 19, 2011 at 11:00:03AM +0800, Wu Fengguang wrote:
> > 
> > Andrew,
> > 
> > This aims to reduce possible pageout() calls by making the flusher
> > concentrate a bit more on old/expired dirty inodes.
> 
> In what situation is this a problem? Can you demonstrate how you
> trigger it? And then how much improvement does this patchset make?

As Mel put it, "it makes sense to write old pages first to reduce the
chances page reclaim is initiating IO."

In last year's LSF, Rik presented the situation with a graph:

LRU head                                 [*] dirty page
[                          *              *      * *  *  * * * * * *]

Ideally, most dirty pages should lie close to the LRU tail instead of
LRU head. That requires the flusher thread to sync old/expired inodes
first (as there are obvious correlations between inode age and page
age), and to give fair opportunities to newly expired inodes rather
than sticking with some large eldest inodes (as larger inodes have
weaker correlations in the inode<=>page ages).

This patchset helps the flusher to meet both the above requirements.

The measurable improvements will depend a lot on the workload.  Mel
once did some tests and observed it to help (but as large as his
forward flush patches ;)

https://lkml.org/lkml/2010/7/28/124

> > Patches 04, 05 have been updated since last post, please review.
> > The concerns from last review have been addressed.
> > 
> > It runs fine on simple workloads over ext3/4, xfs, btrfs and NFS.
> 
> But it starts propagating new differences between background and
> kupdate style writeback. We've been trying to reduce the number of
> permutations of writeback behaviour, so it seems to me to be wrong
> to further increase the behavioural differences. Indeed, why do we
> need "for kupdate" style writeback and "background" writeback
> anymore - can' we just use background style writeback for both?

This patchset actually brings the background work semantic/behavior
closer to the kupdate work.

The two type of works have different termination rules: one is the 30s
dirty expire time, another is the background_thresh in number of dirty
pages. So they have to be treated differently when selecting the inodes
to sync.

This "if" could possibly be eliminated later, but should be done
carefully in an independent patch, preferably after this patchset is
confirmed to work reliably in upstream.

-       if (wbc->for_kupdate || wbc->for_background) {
                expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
                older_than_this = jiffies - expire_interval;
-       }

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
@ 2011-04-19  8:02     ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19  8:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Jan Kara, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Tue, Apr 19, 2011 at 02:38:23PM +0800, Dave Chinner wrote:
> On Tue, Apr 19, 2011 at 11:00:03AM +0800, Wu Fengguang wrote:
> > 
> > Andrew,
> > 
> > This aims to reduce possible pageout() calls by making the flusher
> > concentrate a bit more on old/expired dirty inodes.
> 
> In what situation is this a problem? Can you demonstrate how you
> trigger it? And then how much improvement does this patchset make?

As Mel put it, "it makes sense to write old pages first to reduce the
chances page reclaim is initiating IO."

In last year's LSF, Rik presented the situation with a graph:

LRU head                                 [*] dirty page
[                          *              *      * *  *  * * * * * *]

Ideally, most dirty pages should lie close to the LRU tail instead of
LRU head. That requires the flusher thread to sync old/expired inodes
first (as there are obvious correlations between inode age and page
age), and to give fair opportunities to newly expired inodes rather
than sticking with some large eldest inodes (as larger inodes have
weaker correlations in the inode<=>page ages).

This patchset helps the flusher to meet both the above requirements.

The measurable improvements will depend a lot on the workload.  Mel
once did some tests and observed it to help (but as large as his
forward flush patches ;)

https://lkml.org/lkml/2010/7/28/124

> > Patches 04, 05 have been updated since last post, please review.
> > The concerns from last review have been addressed.
> > 
> > It runs fine on simple workloads over ext3/4, xfs, btrfs and NFS.
> 
> But it starts propagating new differences between background and
> kupdate style writeback. We've been trying to reduce the number of
> permutations of writeback behaviour, so it seems to me to be wrong
> to further increase the behavioural differences. Indeed, why do we
> need "for kupdate" style writeback and "background" writeback
> anymore - can' we just use background style writeback for both?

This patchset actually brings the background work semantic/behavior
closer to the kupdate work.

The two type of works have different termination rules: one is the 30s
dirty expire time, another is the background_thresh in number of dirty
pages. So they have to be treated differently when selecting the inodes
to sync.

This "if" could possibly be eliminated later, but should be done
carefully in an independent patch, preferably after this patchset is
confirmed to work reliably in upstream.

-       if (wbc->for_kupdate || wbc->for_background) {
                expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
                older_than_this = jiffies - expire_interval;
-       }

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
  2011-04-19  7:20       ` Wu Fengguang
@ 2011-04-19  9:31         ` Jan Kara
  -1 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-19  9:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman,
	Itaru Kitayama, Trond Myklebust, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Tue 19-04-11 15:20:40, Wu Fengguang wrote:
> On Tue, Apr 19, 2011 at 03:02:47PM +0800, Dave Chinner wrote:
> > On Tue, Apr 19, 2011 at 11:00:05AM +0800, Wu Fengguang wrote:
> > > Dynamically compute the dirty expire timestamp at queue_io() time.
> > > 
> > > writeback_control.older_than_this used to be determined at entrance to
> > > the kupdate writeback work. This _static_ timestamp may go stale if the
> > > kupdate work runs on and on. The flusher may then stuck with some old
> > > busy inodes, never considering newly expired inodes thereafter.
> > > 
> > > This has two possible problems:
> > > 
> > > - It is unfair for a large dirty inode to delay (for a long time) the
> > >   writeback of small dirty inodes.
> > > 
> > > - As time goes by, the large and busy dirty inode may contain only
> > >   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
> > >   delaying the expired dirty pages to the end of LRU lists, triggering
> > >   the evil pageout(). Nevertheless this patch merely addresses part
> > >   of the problem.
> > 
> > When wb_writeback() is called with for_kupdate set, it initialises
> > wbc->older_than_this appropriately outside the writeback loop.
> > queue_io() is called once per writeback_inodes_wb() call, which is
> > once per loop in wb_writeback. All your change does is re-initialise
> > older_than_this once per loop in wb_writeback, jus tin a different
> > and very non-obvious place.
> > 
> > So why didn't you just re-initialise it inside the loop in
> > wb_writeback() and leave all the other code alone?
> 
> It helps both readability and efficiency to make it a local var.
> 
> I have another patch to kill the wbc->older_than_this (and one more
> for wbc->more_io). They are delayed to avoid possible merge conflicts
> with the IO-less patchset.
> 
> But yeah, it seems reasonable to move the first chunk of the below
> patch to this one.
  I agree - killing of wbc.older_than_this would be a logical part of
this patch as well.

								Honza
> ---
> Subject: writeback: the kupdate expire timestamp should be a moving target
> Date: Wed Jul 21 20:32:30 CST 2010
> 
> Remove writeback_control.older_than_this which is no longer used.
> 
> [kitayama@cl.bb4u.ne.jp] fix btrfs and ext4 references
> 
> Acked-by: Jan Kara <jack@suse.cz>
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/btrfs/extent_io.c             |    2 --
>  fs/fs-writeback.c                |   13 -------------
>  include/linux/writeback.h        |    2 --
>  include/trace/events/writeback.h |    6 +-----
>  mm/backing-dev.c                 |    1 -
>  5 files changed, 1 insertion(+), 23 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-18 08:37:01.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-18 08:38:16.000000000 +0800
> @@ -681,30 +681,20 @@ static unsigned long writeback_chunk_siz
>   * Try to run once per dirty_writeback_interval.  But if a writeback event
>   * takes longer than a dirty_writeback_interval interval, then leave a
>   * one-second gap.
> - *
> - * older_than_this takes precedence over nr_to_write.  So we'll only write back
> - * all dirty pages if they are all attached to "old" mappings.
>   */
>  static long wb_writeback(struct bdi_writeback *wb,
>  			 struct wb_writeback_work *work)
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= work->sync_mode,
> -		.older_than_this	= NULL,
>  		.for_kupdate		= work->for_kupdate,
>  		.for_background		= work->for_background,
>  		.range_cyclic		= work->range_cyclic,
>  	};
> -	unsigned long oldest_jif;
>  	long wrote = 0;
>  	long write_chunk;
>  	struct inode *inode;
>  
> -	if (wbc.for_kupdate) {
> -		wbc.older_than_this = &oldest_jif;
> -		oldest_jif = jiffies -
> -				msecs_to_jiffies(dirty_expire_interval * 10);
> -	}
>  	if (!wbc.range_cyclic) {
>  		wbc.range_start = 0;
>  		wbc.range_end = LLONG_MAX;
> @@ -1139,9 +1129,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   * Write out a superblock's list of dirty inodes.  A wait will be performed
>   * upon no inodes, all inodes or the final one, depending upon sync_mode.
>   *
> - * If older_than_this is non-NULL, then only write out inodes which
> - * had their first dirtying at a time earlier than *older_than_this.
> - *
>   * If `bdi' is non-zero then we're being asked to writeback a specific queue.
>   * This function assumes that the blockdev superblock's inodes are backed by
>   * a variety of queues, so all inodes are searched.  For other superblocks,
> --- linux-next.orig/include/linux/writeback.h	2011-04-18 08:36:59.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-04-18 08:38:16.000000000 +0800
> @@ -66,8 +66,6 @@ enum writeback_sync_modes {
>   */
>  struct writeback_control {
>  	enum writeback_sync_modes sync_mode;
> -	unsigned long *older_than_this;	/* If !NULL, only write back inodes
> -					   older than this */
>  	unsigned long wb_start;         /* Time writeback_inodes_wb was
>  					   called. This is needed to avoid
>  					   extra jobs and livelock */
> --- linux-next.orig/include/trace/events/writeback.h	2011-04-18 08:36:59.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2011-04-18 08:38:16.000000000 +0800
> @@ -115,7 +115,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__field(int, for_reclaim)
>  		__field(int, range_cyclic)
>  		__field(int, more_io)
> -		__field(unsigned long, older_than_this)
>  		__field(long, range_start)
>  		__field(long, range_end)
>  	),
> @@ -130,14 +129,12 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_reclaim	= wbc->for_reclaim;
>  		__entry->range_cyclic	= wbc->range_cyclic;
>  		__entry->more_io	= wbc->more_io;
> -		__entry->older_than_this = wbc->older_than_this ?
> -						*wbc->older_than_this : 0;
>  		__entry->range_start	= (long)wbc->range_start;
>  		__entry->range_end	= (long)wbc->range_end;
>  	),
>  
>  	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> -		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
> +		"bgrd=%d reclm=%d cyclic=%d more=%d "
>  		"start=0x%lx end=0x%lx",
>  		__entry->name,
>  		__entry->nr_to_write,
> @@ -148,7 +145,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_reclaim,
>  		__entry->range_cyclic,
>  		__entry->more_io,
> -		__entry->older_than_this,
>  		__entry->range_start,
>  		__entry->range_end)
>  )
> --- linux-next.orig/mm/backing-dev.c	2011-04-18 08:36:59.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2011-04-18 08:38:16.000000000 +0800
> @@ -263,7 +263,6 @@ static void bdi_flush_io(struct backing_
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= WB_SYNC_NONE,
> -		.older_than_this	= NULL,
>  		.range_cyclic		= 1,
>  		.nr_to_write		= 1024,
>  	};
> --- linux-next.orig/fs/btrfs/extent_io.c	2011-04-18 08:36:59.000000000 +0800
> +++ linux-next/fs/btrfs/extent_io.c	2011-04-18 08:38:16.000000000 +0800
> @@ -2556,7 +2556,6 @@ int extent_write_full_page(struct extent
>  	};
>  	struct writeback_control wbc_writepages = {
>  		.sync_mode	= wbc->sync_mode,
> -		.older_than_this = NULL,
>  		.nr_to_write	= 64,
>  		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
>  		.range_end	= (loff_t)-1,
> @@ -2589,7 +2588,6 @@ int extent_write_locked_range(struct ext
>  	};
>  	struct writeback_control wbc_writepages = {
>  		.sync_mode	= mode,
> -		.older_than_this = NULL,
>  		.nr_to_write	= nr_pages * 2,
>  		.range_start	= start,
>  		.range_end	= end + 1,
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
@ 2011-04-19  9:31         ` Jan Kara
  0 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-19  9:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman,
	Itaru Kitayama, Trond Myklebust, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Tue 19-04-11 15:20:40, Wu Fengguang wrote:
> On Tue, Apr 19, 2011 at 03:02:47PM +0800, Dave Chinner wrote:
> > On Tue, Apr 19, 2011 at 11:00:05AM +0800, Wu Fengguang wrote:
> > > Dynamically compute the dirty expire timestamp at queue_io() time.
> > > 
> > > writeback_control.older_than_this used to be determined at entrance to
> > > the kupdate writeback work. This _static_ timestamp may go stale if the
> > > kupdate work runs on and on. The flusher may then stuck with some old
> > > busy inodes, never considering newly expired inodes thereafter.
> > > 
> > > This has two possible problems:
> > > 
> > > - It is unfair for a large dirty inode to delay (for a long time) the
> > >   writeback of small dirty inodes.
> > > 
> > > - As time goes by, the large and busy dirty inode may contain only
> > >   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
> > >   delaying the expired dirty pages to the end of LRU lists, triggering
> > >   the evil pageout(). Nevertheless this patch merely addresses part
> > >   of the problem.
> > 
> > When wb_writeback() is called with for_kupdate set, it initialises
> > wbc->older_than_this appropriately outside the writeback loop.
> > queue_io() is called once per writeback_inodes_wb() call, which is
> > once per loop in wb_writeback. All your change does is re-initialise
> > older_than_this once per loop in wb_writeback, jus tin a different
> > and very non-obvious place.
> > 
> > So why didn't you just re-initialise it inside the loop in
> > wb_writeback() and leave all the other code alone?
> 
> It helps both readability and efficiency to make it a local var.
> 
> I have another patch to kill the wbc->older_than_this (and one more
> for wbc->more_io). They are delayed to avoid possible merge conflicts
> with the IO-less patchset.
> 
> But yeah, it seems reasonable to move the first chunk of the below
> patch to this one.
  I agree - killing of wbc.older_than_this would be a logical part of
this patch as well.

								Honza
> ---
> Subject: writeback: the kupdate expire timestamp should be a moving target
> Date: Wed Jul 21 20:32:30 CST 2010
> 
> Remove writeback_control.older_than_this which is no longer used.
> 
> [kitayama@cl.bb4u.ne.jp] fix btrfs and ext4 references
> 
> Acked-by: Jan Kara <jack@suse.cz>
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/btrfs/extent_io.c             |    2 --
>  fs/fs-writeback.c                |   13 -------------
>  include/linux/writeback.h        |    2 --
>  include/trace/events/writeback.h |    6 +-----
>  mm/backing-dev.c                 |    1 -
>  5 files changed, 1 insertion(+), 23 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-18 08:37:01.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-18 08:38:16.000000000 +0800
> @@ -681,30 +681,20 @@ static unsigned long writeback_chunk_siz
>   * Try to run once per dirty_writeback_interval.  But if a writeback event
>   * takes longer than a dirty_writeback_interval interval, then leave a
>   * one-second gap.
> - *
> - * older_than_this takes precedence over nr_to_write.  So we'll only write back
> - * all dirty pages if they are all attached to "old" mappings.
>   */
>  static long wb_writeback(struct bdi_writeback *wb,
>  			 struct wb_writeback_work *work)
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= work->sync_mode,
> -		.older_than_this	= NULL,
>  		.for_kupdate		= work->for_kupdate,
>  		.for_background		= work->for_background,
>  		.range_cyclic		= work->range_cyclic,
>  	};
> -	unsigned long oldest_jif;
>  	long wrote = 0;
>  	long write_chunk;
>  	struct inode *inode;
>  
> -	if (wbc.for_kupdate) {
> -		wbc.older_than_this = &oldest_jif;
> -		oldest_jif = jiffies -
> -				msecs_to_jiffies(dirty_expire_interval * 10);
> -	}
>  	if (!wbc.range_cyclic) {
>  		wbc.range_start = 0;
>  		wbc.range_end = LLONG_MAX;
> @@ -1139,9 +1129,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   * Write out a superblock's list of dirty inodes.  A wait will be performed
>   * upon no inodes, all inodes or the final one, depending upon sync_mode.
>   *
> - * If older_than_this is non-NULL, then only write out inodes which
> - * had their first dirtying at a time earlier than *older_than_this.
> - *
>   * If `bdi' is non-zero then we're being asked to writeback a specific queue.
>   * This function assumes that the blockdev superblock's inodes are backed by
>   * a variety of queues, so all inodes are searched.  For other superblocks,
> --- linux-next.orig/include/linux/writeback.h	2011-04-18 08:36:59.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-04-18 08:38:16.000000000 +0800
> @@ -66,8 +66,6 @@ enum writeback_sync_modes {
>   */
>  struct writeback_control {
>  	enum writeback_sync_modes sync_mode;
> -	unsigned long *older_than_this;	/* If !NULL, only write back inodes
> -					   older than this */
>  	unsigned long wb_start;         /* Time writeback_inodes_wb was
>  					   called. This is needed to avoid
>  					   extra jobs and livelock */
> --- linux-next.orig/include/trace/events/writeback.h	2011-04-18 08:36:59.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2011-04-18 08:38:16.000000000 +0800
> @@ -115,7 +115,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__field(int, for_reclaim)
>  		__field(int, range_cyclic)
>  		__field(int, more_io)
> -		__field(unsigned long, older_than_this)
>  		__field(long, range_start)
>  		__field(long, range_end)
>  	),
> @@ -130,14 +129,12 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_reclaim	= wbc->for_reclaim;
>  		__entry->range_cyclic	= wbc->range_cyclic;
>  		__entry->more_io	= wbc->more_io;
> -		__entry->older_than_this = wbc->older_than_this ?
> -						*wbc->older_than_this : 0;
>  		__entry->range_start	= (long)wbc->range_start;
>  		__entry->range_end	= (long)wbc->range_end;
>  	),
>  
>  	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> -		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
> +		"bgrd=%d reclm=%d cyclic=%d more=%d "
>  		"start=0x%lx end=0x%lx",
>  		__entry->name,
>  		__entry->nr_to_write,
> @@ -148,7 +145,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_reclaim,
>  		__entry->range_cyclic,
>  		__entry->more_io,
> -		__entry->older_than_this,
>  		__entry->range_start,
>  		__entry->range_end)
>  )
> --- linux-next.orig/mm/backing-dev.c	2011-04-18 08:36:59.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2011-04-18 08:38:16.000000000 +0800
> @@ -263,7 +263,6 @@ static void bdi_flush_io(struct backing_
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= WB_SYNC_NONE,
> -		.older_than_this	= NULL,
>  		.range_cyclic		= 1,
>  		.nr_to_write		= 1024,
>  	};
> --- linux-next.orig/fs/btrfs/extent_io.c	2011-04-18 08:36:59.000000000 +0800
> +++ linux-next/fs/btrfs/extent_io.c	2011-04-18 08:38:16.000000000 +0800
> @@ -2556,7 +2556,6 @@ int extent_write_full_page(struct extent
>  	};
>  	struct writeback_control wbc_writepages = {
>  		.sync_mode	= wbc->sync_mode,
> -		.older_than_this = NULL,
>  		.nr_to_write	= 64,
>  		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
>  		.range_end	= (loff_t)-1,
> @@ -2589,7 +2588,6 @@ int extent_write_locked_range(struct ext
>  	};
>  	struct writeback_control wbc_writepages = {
>  		.sync_mode	= mode,
> -		.older_than_this = NULL,
>  		.nr_to_write	= nr_pages * 2,
>  		.range_start	= start,
>  		.range_end	= end + 1,
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 4/6] writeback: introduce writeback_control.inodes_cleaned
  2011-04-19  3:00   ` Wu Fengguang
@ 2011-04-19  9:47     ` Jan Kara
  -1 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-19  9:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Tue 19-04-11 11:00:07, Wu Fengguang wrote:
> The flusher works on dirty inodes in batches, and may quit prematurely
> if the batch of inodes happen to be metadata-only dirtied: in this case
> wbc->nr_to_write won't be decreased at all, which stands for "no pages
> written" but also mis-interpreted as "no progress".
> 
> So introduce writeback_control.inodes_cleaned to count the inodes get
> cleaned.  A non-zero value means there are some progress on writeback,
> in which case more writeback can be tried.
> 
> about v1: The initial version was to count successful ->write_inode()
> calls.  However it leads to busy loops for sync() over NFS, because NFS
> ridiculously returns 0 (success) while at the same time redirties the
> inode.  The NFS case can be trivially fixed, however there may be more
> hidden bugs in other filesystems..
  OK, makes sense.
Acked-by: Jan Kara <jack@suse.cz>

								Honza
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c         |    4 ++++
>  include/linux/writeback.h |    1 +
>  2 files changed, 5 insertions(+)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
> @@ -473,6 +473,7 @@ writeback_single_inode(struct inode *ino
>  			 * No need to add it back to the LRU.
>  			 */
>  			list_del_init(&inode->i_wb_list);
> +			wbc->inodes_cleaned++;
>  		}
>  	}
>  	inode_sync_complete(inode);
> @@ -736,6 +737,7 @@ static long wb_writeback(struct bdi_writ
>  		wbc.more_io = 0;
>  		wbc.nr_to_write = write_chunk;
>  		wbc.pages_skipped = 0;
> +		wbc.inodes_cleaned = 0;
>  
>  		trace_wbc_writeback_start(&wbc, wb->bdi);
>  		if (work->sb)
> @@ -752,6 +754,8 @@ static long wb_writeback(struct bdi_writ
>  		 */
>  		if (wbc.nr_to_write <= 0)
>  			continue;
> +		if (wbc.inodes_cleaned)
> +			continue;
>  		/*
>  		 * Didn't write everything and we don't have more IO, bail
>  		 */
> --- linux-next.orig/include/linux/writeback.h	2011-04-19 10:18:17.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-04-19 10:18:30.000000000 +0800
> @@ -34,6 +34,7 @@ struct writeback_control {
>  	long nr_to_write;		/* Write this many pages, and decrement
>  					   this for each page written */
>  	long pages_skipped;		/* Pages which were not written */
> +	long inodes_cleaned;		/* # of inodes cleaned */
>  
>  	/*
>  	 * For a_ops->writepages(): is start or end are non-zero then this is
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 4/6] writeback: introduce writeback_control.inodes_cleaned
@ 2011-04-19  9:47     ` Jan Kara
  0 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-19  9:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Tue 19-04-11 11:00:07, Wu Fengguang wrote:
> The flusher works on dirty inodes in batches, and may quit prematurely
> if the batch of inodes happen to be metadata-only dirtied: in this case
> wbc->nr_to_write won't be decreased at all, which stands for "no pages
> written" but also mis-interpreted as "no progress".
> 
> So introduce writeback_control.inodes_cleaned to count the inodes get
> cleaned.  A non-zero value means there are some progress on writeback,
> in which case more writeback can be tried.
> 
> about v1: The initial version was to count successful ->write_inode()
> calls.  However it leads to busy loops for sync() over NFS, because NFS
> ridiculously returns 0 (success) while at the same time redirties the
> inode.  The NFS case can be trivially fixed, however there may be more
> hidden bugs in other filesystems..
  OK, makes sense.
Acked-by: Jan Kara <jack@suse.cz>

								Honza
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c         |    4 ++++
>  include/linux/writeback.h |    1 +
>  2 files changed, 5 insertions(+)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
> @@ -473,6 +473,7 @@ writeback_single_inode(struct inode *ino
>  			 * No need to add it back to the LRU.
>  			 */
>  			list_del_init(&inode->i_wb_list);
> +			wbc->inodes_cleaned++;
>  		}
>  	}
>  	inode_sync_complete(inode);
> @@ -736,6 +737,7 @@ static long wb_writeback(struct bdi_writ
>  		wbc.more_io = 0;
>  		wbc.nr_to_write = write_chunk;
>  		wbc.pages_skipped = 0;
> +		wbc.inodes_cleaned = 0;
>  
>  		trace_wbc_writeback_start(&wbc, wb->bdi);
>  		if (work->sb)
> @@ -752,6 +754,8 @@ static long wb_writeback(struct bdi_writ
>  		 */
>  		if (wbc.nr_to_write <= 0)
>  			continue;
> +		if (wbc.inodes_cleaned)
> +			continue;
>  		/*
>  		 * Didn't write everything and we don't have more IO, bail
>  		 */
> --- linux-next.orig/include/linux/writeback.h	2011-04-19 10:18:17.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-04-19 10:18:30.000000000 +0800
> @@ -34,6 +34,7 @@ struct writeback_control {
>  	long nr_to_write;		/* Write this many pages, and decrement
>  					   this for each page written */
>  	long pages_skipped;		/* Pages which were not written */
> +	long inodes_cleaned;		/* # of inodes cleaned */
>  
>  	/*
>  	 * For a_ops->writepages(): is start or end are non-zero then this is
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-19  7:35     ` Dave Chinner
@ 2011-04-19  9:57       ` Jan Kara
  -1 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-19  9:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Tue 19-04-11 17:35:23, Dave Chinner wrote:
> On Tue, Apr 19, 2011 at 11:00:06AM +0800, Wu Fengguang wrote:
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> > 
> > The policy is
> > - enqueue all newly expired inodes at each queue_io() time
> > - enqueue all dirty inodes if there are no more expired inodes to sync
> > 
> > This will help reduce the number of dirty pages encountered by page
> > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > dirty pages, which are more close to the end of the LRU lists. So
> > syncing older inodes first helps reducing the dirty pages reached by
> > the page reclaim code.
> 
> Once again I think this is the wrong place to be changing writeback
> policy decisions. for_background writeback only goes through
> wb_writeback() and writeback_inodes_wb() (same as for_kupdate
> writeback), so a decision to change from expired inodes to fresh
> inodes, IMO, should be made in wb_writeback.
> 
> That is, for_background and for_kupdate writeback start with the
> same policy (older_than_this set) to writeback expired inodes first,
> then when background writeback runs out of expired inodes, it should
> switch to all remaining inodes by clearing older_than_this instead
> of refreshing it for the next loop.
  Yes, I agree with this and my impression is that Fengguang is trying to
achieve exactly this behavior.

> This keeps all the policy decisions in the one place, all using the
> same (existing) mechanism, and all relatively simple to understand,
> and easy to tracepoint for debugging.  Changing writeback policy
> deep in the writeback stack is not a good idea as it will make
> extending writeback policies in future (e.g. for cgroup awareness)
> very messy.
  Hmm, I see. I agree the policy decisions should be at one place if
reasonably possible. Fengguang moves them from wb_writeback() to inode
queueing code which looks like a logical place to me as well - there we
have the largest control over what inodes do we decide to write and don't
have to pass all the detailed 'instructions' down in wbc structure. So if
we later want to add cgroup awareness to writeback, I imagine we just add
the knowledge to inode queueing code.

> > @@ -585,7 +597,8 @@ void writeback_inodes_wb(struct bdi_writ
> >  	if (!wbc->wb_start)
> >  		wbc->wb_start = jiffies; /* livelock avoidance */
> >  	spin_lock(&inode_wb_list_lock);
> > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > +
> > +	if (list_empty(&wb->b_io))
> >  		queue_io(wb, wbc);
> >  
> >  	while (!list_empty(&wb->b_io)) {
> > @@ -612,7 +625,7 @@ static void __writeback_inodes_sb(struct
> >  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> >  
> >  	spin_lock(&inode_wb_list_lock);
> > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > +	if (list_empty(&wb->b_io))
> >  		queue_io(wb, wbc);
> >  	writeback_sb_inodes(sb, wb, wbc, true);
> >  	spin_unlock(&inode_wb_list_lock);
> 
> That changes the order in which we queue inodes for writeback.
> Instead of calling every time to move b_more_io inodes onto the b_io
> list and expiring more aged inodes, we only ever do it when the list
> is empty. That is, it seems to me that this will tend to give
> b_more_io inodes a smaller share of writeback because they are being
> moved back to the b_io list less frequently where there are lots of
> other inodes being dirtied. Have you tested the impact of this
> change on mixed workload performance? Indeed, can you starve
> writeback of a large file simply by creating lots of small files in
> another thread?
  Yeah, this change looks suspicious to me as well.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-19  9:57       ` Jan Kara
  0 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-19  9:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Tue 19-04-11 17:35:23, Dave Chinner wrote:
> On Tue, Apr 19, 2011 at 11:00:06AM +0800, Wu Fengguang wrote:
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> > 
> > The policy is
> > - enqueue all newly expired inodes at each queue_io() time
> > - enqueue all dirty inodes if there are no more expired inodes to sync
> > 
> > This will help reduce the number of dirty pages encountered by page
> > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > dirty pages, which are more close to the end of the LRU lists. So
> > syncing older inodes first helps reducing the dirty pages reached by
> > the page reclaim code.
> 
> Once again I think this is the wrong place to be changing writeback
> policy decisions. for_background writeback only goes through
> wb_writeback() and writeback_inodes_wb() (same as for_kupdate
> writeback), so a decision to change from expired inodes to fresh
> inodes, IMO, should be made in wb_writeback.
> 
> That is, for_background and for_kupdate writeback start with the
> same policy (older_than_this set) to writeback expired inodes first,
> then when background writeback runs out of expired inodes, it should
> switch to all remaining inodes by clearing older_than_this instead
> of refreshing it for the next loop.
  Yes, I agree with this and my impression is that Fengguang is trying to
achieve exactly this behavior.

> This keeps all the policy decisions in the one place, all using the
> same (existing) mechanism, and all relatively simple to understand,
> and easy to tracepoint for debugging.  Changing writeback policy
> deep in the writeback stack is not a good idea as it will make
> extending writeback policies in future (e.g. for cgroup awareness)
> very messy.
  Hmm, I see. I agree the policy decisions should be at one place if
reasonably possible. Fengguang moves them from wb_writeback() to inode
queueing code which looks like a logical place to me as well - there we
have the largest control over what inodes do we decide to write and don't
have to pass all the detailed 'instructions' down in wbc structure. So if
we later want to add cgroup awareness to writeback, I imagine we just add
the knowledge to inode queueing code.

> > @@ -585,7 +597,8 @@ void writeback_inodes_wb(struct bdi_writ
> >  	if (!wbc->wb_start)
> >  		wbc->wb_start = jiffies; /* livelock avoidance */
> >  	spin_lock(&inode_wb_list_lock);
> > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > +
> > +	if (list_empty(&wb->b_io))
> >  		queue_io(wb, wbc);
> >  
> >  	while (!list_empty(&wb->b_io)) {
> > @@ -612,7 +625,7 @@ static void __writeback_inodes_sb(struct
> >  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> >  
> >  	spin_lock(&inode_wb_list_lock);
> > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > +	if (list_empty(&wb->b_io))
> >  		queue_io(wb, wbc);
> >  	writeback_sb_inodes(sb, wb, wbc, true);
> >  	spin_unlock(&inode_wb_list_lock);
> 
> That changes the order in which we queue inodes for writeback.
> Instead of calling every time to move b_more_io inodes onto the b_io
> list and expiring more aged inodes, we only ever do it when the list
> is empty. That is, it seems to me that this will tend to give
> b_more_io inodes a smaller share of writeback because they are being
> moved back to the b_io list less frequently where there are lots of
> other inodes being dirtied. Have you tested the impact of this
> change on mixed workload performance? Indeed, can you starve
> writeback of a large file simply by creating lots of small files in
> another thread?
  Yeah, this change looks suspicious to me as well.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-19  3:00   ` Wu Fengguang
@ 2011-04-19 10:20     ` Jan Kara
  -1 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-19 10:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Tue 19-04-11 11:00:08, Wu Fengguang wrote:
> writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> they only populate possibly a subset of elegible inodes into b_io at
> entrance time. When the queued set of inodes are all synced, they just
> return, possibly with all queued inode pages written but still
> wbc.nr_to_write > 0.
> 
> For kupdate and background writeback, there may be more eligible inodes
> sitting in b_dirty when the current set of b_io inodes are completed. So
> it is necessary to try another round of writeback as long as we made some
> progress in this round. When there are no more eligible inodes, no more
> inodes will be enqueued in queue_io(), hence nothing could/will be
> synced and we may safely bail.
  Let me understand your concern here: You are afraid that if we do
for_background or for_kupdate writeback and we write less than
MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
inodes to write at the time we are stopping writeback - the two realistic
cases I can think of are:
a) when inodes just freshly expired during writeback
b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
  background threshold due to data on some other bdi. And then while we are
  doing writeback someone does dirtying at our bdi.
Or do you see some other case as well?

The a) case does not seem like a big issue to me after your changes to
move_expired_inodes(). The b) case maybe but do you think it will make any
difference? 

								Honza
> 
> Jan raised the concern
> 
> 	I'm just afraid that in some pathological cases this could
> 	result in bad writeback pattern - like if there is some process
> 	which manages to dirty just a few pages while we are doing
> 	writeout, this looping could result in writing just a few pages
> 	in each round which is bad for fragmentation etc.
> 
> However it requires really strong timing to make that to (continuously)
> happen.  In practice it's very hard to produce such a pattern even if
> it's possible in theory. I actually tried to write 1 page per 1ms with
> this command
> 
> 	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
> 
> and do sync(1) at the same time. The sync completes quickly on ext4,
> xfs, btrfs. The readers could try other write-and-sleep patterns and
> check if it can block sync for longer time.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:31.000000000 +0800
> @@ -750,23 +750,23 @@ static long wb_writeback(struct bdi_writ
>  		wrote += write_chunk - wbc.nr_to_write;
>  
>  		/*
> -		 * If we consumed everything, see if we have more
> +		 * Did we write something? Try for more
> +		 *
> +		 * Dirty inodes are moved to b_io for writeback in batches.
> +		 * The completion of the current batch does not necessarily
> +		 * mean the overall work is done. So we keep looping as long
> +		 * as made some progress on cleaning pages or inodes.
>  		 */
> -		if (wbc.nr_to_write <= 0)
> +		if (wbc.nr_to_write < write_chunk)
>  			continue;
>  		if (wbc.inodes_cleaned)
>  			continue;
>  		/*
> -		 * Didn't write everything and we don't have more IO, bail
> +		 * No more inodes for IO, bail
>  		 */
>  		if (!wbc.more_io)
>  			break;
>  		/*
> -		 * Did we write something? Try for more
> -		 */
> -		if (wbc.nr_to_write < write_chunk)
> -			continue;
> -		/*
>  		 * Nothing written. Wait for some inode to
>  		 * become available for writeback. Otherwise
>  		 * we'll just busyloop.
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-19 10:20     ` Jan Kara
  0 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-19 10:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Tue 19-04-11 11:00:08, Wu Fengguang wrote:
> writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> they only populate possibly a subset of elegible inodes into b_io at
> entrance time. When the queued set of inodes are all synced, they just
> return, possibly with all queued inode pages written but still
> wbc.nr_to_write > 0.
> 
> For kupdate and background writeback, there may be more eligible inodes
> sitting in b_dirty when the current set of b_io inodes are completed. So
> it is necessary to try another round of writeback as long as we made some
> progress in this round. When there are no more eligible inodes, no more
> inodes will be enqueued in queue_io(), hence nothing could/will be
> synced and we may safely bail.
  Let me understand your concern here: You are afraid that if we do
for_background or for_kupdate writeback and we write less than
MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
inodes to write at the time we are stopping writeback - the two realistic
cases I can think of are:
a) when inodes just freshly expired during writeback
b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
  background threshold due to data on some other bdi. And then while we are
  doing writeback someone does dirtying at our bdi.
Or do you see some other case as well?

The a) case does not seem like a big issue to me after your changes to
move_expired_inodes(). The b) case maybe but do you think it will make any
difference? 

								Honza
> 
> Jan raised the concern
> 
> 	I'm just afraid that in some pathological cases this could
> 	result in bad writeback pattern - like if there is some process
> 	which manages to dirty just a few pages while we are doing
> 	writeout, this looping could result in writing just a few pages
> 	in each round which is bad for fragmentation etc.
> 
> However it requires really strong timing to make that to (continuously)
> happen.  In practice it's very hard to produce such a pattern even if
> it's possible in theory. I actually tried to write 1 page per 1ms with
> this command
> 
> 	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
> 
> and do sync(1) at the same time. The sync completes quickly on ext4,
> xfs, btrfs. The readers could try other write-and-sleep patterns and
> check if it can block sync for longer time.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:31.000000000 +0800
> @@ -750,23 +750,23 @@ static long wb_writeback(struct bdi_writ
>  		wrote += write_chunk - wbc.nr_to_write;
>  
>  		/*
> -		 * If we consumed everything, see if we have more
> +		 * Did we write something? Try for more
> +		 *
> +		 * Dirty inodes are moved to b_io for writeback in batches.
> +		 * The completion of the current batch does not necessarily
> +		 * mean the overall work is done. So we keep looping as long
> +		 * as made some progress on cleaning pages or inodes.
>  		 */
> -		if (wbc.nr_to_write <= 0)
> +		if (wbc.nr_to_write < write_chunk)
>  			continue;
>  		if (wbc.inodes_cleaned)
>  			continue;
>  		/*
> -		 * Didn't write everything and we don't have more IO, bail
> +		 * No more inodes for IO, bail
>  		 */
>  		if (!wbc.more_io)
>  			break;
>  		/*
> -		 * Did we write something? Try for more
> -		 */
> -		if (wbc.nr_to_write < write_chunk)
> -			continue;
> -		/*
>  		 * Nothing written. Wait for some inode to
>  		 * become available for writeback. Otherwise
>  		 * we'll just busyloop.
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-19 10:20     ` Jan Kara
@ 2011-04-19 11:16       ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19 11:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Mel Gorman, Dave Chinner, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 06:20:16PM +0800, Jan Kara wrote:
> On Tue 19-04-11 11:00:08, Wu Fengguang wrote:
> > writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> > they only populate possibly a subset of elegible inodes into b_io at
> > entrance time. When the queued set of inodes are all synced, they just
> > return, possibly with all queued inode pages written but still
> > wbc.nr_to_write > 0.
> > 
> > For kupdate and background writeback, there may be more eligible inodes
> > sitting in b_dirty when the current set of b_io inodes are completed. So
> > it is necessary to try another round of writeback as long as we made some
> > progress in this round. When there are no more eligible inodes, no more
> > inodes will be enqueued in queue_io(), hence nothing could/will be
> > synced and we may safely bail.
>   Let me understand your concern here: You are afraid that if we do
> for_background or for_kupdate writeback and we write less than
> MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
> inodes to write at the time we are stopping writeback - the two realistic

Yes.

> cases I can think of are:
> a) when inodes just freshly expired during writeback
> b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
>   background threshold due to data on some other bdi. And then while we are
>   doing writeback someone does dirtying at our bdi.
> Or do you see some other case as well?
> 
> The a) case does not seem like a big issue to me after your changes to

Yeah (a) is not an issue with kupdate writeback.

> move_expired_inodes(). The b) case maybe but do you think it will make any
> difference? 

(b) seems also weird. What in my mind is this for_background case.
Imagine 100 inodes

        i0, i1, i2, ..., i90, i91, i99

At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold.

This will be a fairly normal/frequent case I guess.

Thanks,
Fengguang

> 								Honza
> > 
> > Jan raised the concern
> > 
> > 	I'm just afraid that in some pathological cases this could
> > 	result in bad writeback pattern - like if there is some process
> > 	which manages to dirty just a few pages while we are doing
> > 	writeout, this looping could result in writing just a few pages
> > 	in each round which is bad for fragmentation etc.
> > 
> > However it requires really strong timing to make that to (continuously)
> > happen.  In practice it's very hard to produce such a pattern even if
> > it's possible in theory. I actually tried to write 1 page per 1ms with
> > this command
> > 
> > 	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
> > 
> > and do sync(1) at the same time. The sync completes quickly on ext4,
> > xfs, btrfs. The readers could try other write-and-sleep patterns and
> > check if it can block sync for longer time.
> > 
> > CC: Jan Kara <jack@suse.cz>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c |   16 ++++++++--------
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> > 
> > --- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:31.000000000 +0800
> > @@ -750,23 +750,23 @@ static long wb_writeback(struct bdi_writ
> >  		wrote += write_chunk - wbc.nr_to_write;
> >  
> >  		/*
> > -		 * If we consumed everything, see if we have more
> > +		 * Did we write something? Try for more
> > +		 *
> > +		 * Dirty inodes are moved to b_io for writeback in batches.
> > +		 * The completion of the current batch does not necessarily
> > +		 * mean the overall work is done. So we keep looping as long
> > +		 * as made some progress on cleaning pages or inodes.
> >  		 */
> > -		if (wbc.nr_to_write <= 0)
> > +		if (wbc.nr_to_write < write_chunk)
> >  			continue;
> >  		if (wbc.inodes_cleaned)
> >  			continue;
> >  		/*
> > -		 * Didn't write everything and we don't have more IO, bail
> > +		 * No more inodes for IO, bail
> >  		 */
> >  		if (!wbc.more_io)
> >  			break;
> >  		/*
> > -		 * Did we write something? Try for more
> > -		 */
> > -		if (wbc.nr_to_write < write_chunk)
> > -			continue;
> > -		/*
> >  		 * Nothing written. Wait for some inode to
> >  		 * become available for writeback. Otherwise
> >  		 * we'll just busyloop.
> > 
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-19 11:16       ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19 11:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Mel Gorman, Dave Chinner, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 06:20:16PM +0800, Jan Kara wrote:
> On Tue 19-04-11 11:00:08, Wu Fengguang wrote:
> > writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> > they only populate possibly a subset of elegible inodes into b_io at
> > entrance time. When the queued set of inodes are all synced, they just
> > return, possibly with all queued inode pages written but still
> > wbc.nr_to_write > 0.
> > 
> > For kupdate and background writeback, there may be more eligible inodes
> > sitting in b_dirty when the current set of b_io inodes are completed. So
> > it is necessary to try another round of writeback as long as we made some
> > progress in this round. When there are no more eligible inodes, no more
> > inodes will be enqueued in queue_io(), hence nothing could/will be
> > synced and we may safely bail.
>   Let me understand your concern here: You are afraid that if we do
> for_background or for_kupdate writeback and we write less than
> MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
> inodes to write at the time we are stopping writeback - the two realistic

Yes.

> cases I can think of are:
> a) when inodes just freshly expired during writeback
> b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
>   background threshold due to data on some other bdi. And then while we are
>   doing writeback someone does dirtying at our bdi.
> Or do you see some other case as well?
> 
> The a) case does not seem like a big issue to me after your changes to

Yeah (a) is not an issue with kupdate writeback.

> move_expired_inodes(). The b) case maybe but do you think it will make any
> difference? 

(b) seems also weird. What in my mind is this for_background case.
Imagine 100 inodes

        i0, i1, i2, ..., i90, i91, i99

At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold.

This will be a fairly normal/frequent case I guess.

Thanks,
Fengguang

> 								Honza
> > 
> > Jan raised the concern
> > 
> > 	I'm just afraid that in some pathological cases this could
> > 	result in bad writeback pattern - like if there is some process
> > 	which manages to dirty just a few pages while we are doing
> > 	writeout, this looping could result in writing just a few pages
> > 	in each round which is bad for fragmentation etc.
> > 
> > However it requires really strong timing to make that to (continuously)
> > happen.  In practice it's very hard to produce such a pattern even if
> > it's possible in theory. I actually tried to write 1 page per 1ms with
> > this command
> > 
> > 	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
> > 
> > and do sync(1) at the same time. The sync completes quickly on ext4,
> > xfs, btrfs. The readers could try other write-and-sleep patterns and
> > check if it can block sync for longer time.
> > 
> > CC: Jan Kara <jack@suse.cz>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c |   16 ++++++++--------
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> > 
> > --- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:31.000000000 +0800
> > @@ -750,23 +750,23 @@ static long wb_writeback(struct bdi_writ
> >  		wrote += write_chunk - wbc.nr_to_write;
> >  
> >  		/*
> > -		 * If we consumed everything, see if we have more
> > +		 * Did we write something? Try for more
> > +		 *
> > +		 * Dirty inodes are moved to b_io for writeback in batches.
> > +		 * The completion of the current batch does not necessarily
> > +		 * mean the overall work is done. So we keep looping as long
> > +		 * as made some progress on cleaning pages or inodes.
> >  		 */
> > -		if (wbc.nr_to_write <= 0)
> > +		if (wbc.nr_to_write < write_chunk)
> >  			continue;
> >  		if (wbc.inodes_cleaned)
> >  			continue;
> >  		/*
> > -		 * Didn't write everything and we don't have more IO, bail
> > +		 * No more inodes for IO, bail
> >  		 */
> >  		if (!wbc.more_io)
> >  			break;
> >  		/*
> > -		 * Did we write something? Try for more
> > -		 */
> > -		if (wbc.nr_to_write < write_chunk)
> > -			continue;
> > -		/*
> >  		 * Nothing written. Wait for some inode to
> >  		 * become available for writeback. Otherwise
> >  		 * we'll just busyloop.
> > 
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-19  9:57       ` Jan Kara
  (?)
@ 2011-04-19 12:56       ` Wu Fengguang
  2011-04-19 13:46           ` Wu Fengguang
  2011-04-20  1:21           ` Dave Chinner
  -1 siblings, 2 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19 12:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

[-- Attachment #1: Type: text/plain, Size: 8147 bytes --]

On Tue, Apr 19, 2011 at 05:57:40PM +0800, Jan Kara wrote:
> On Tue 19-04-11 17:35:23, Dave Chinner wrote:
> > On Tue, Apr 19, 2011 at 11:00:06AM +0800, Wu Fengguang wrote:
> > > A background flush work may run for ever. So it's reasonable for it to
> > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > 
> > > The policy is
> > > - enqueue all newly expired inodes at each queue_io() time
> > > - enqueue all dirty inodes if there are no more expired inodes to sync
> > > 
> > > This will help reduce the number of dirty pages encountered by page
> > > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > > dirty pages, which are more close to the end of the LRU lists. So
> > > syncing older inodes first helps reducing the dirty pages reached by
> > > the page reclaim code.
> > 
> > Once again I think this is the wrong place to be changing writeback
> > policy decisions. for_background writeback only goes through
> > wb_writeback() and writeback_inodes_wb() (same as for_kupdate
> > writeback), so a decision to change from expired inodes to fresh
> > inodes, IMO, should be made in wb_writeback.
> > 
> > That is, for_background and for_kupdate writeback start with the
> > same policy (older_than_this set) to writeback expired inodes first,
> > then when background writeback runs out of expired inodes, it should
> > switch to all remaining inodes by clearing older_than_this instead
> > of refreshing it for the next loop.
>   Yes, I agree with this and my impression is that Fengguang is trying to
> achieve exactly this behavior.
> 
> > This keeps all the policy decisions in the one place, all using the
> > same (existing) mechanism, and all relatively simple to understand,
> > and easy to tracepoint for debugging.  Changing writeback policy
> > deep in the writeback stack is not a good idea as it will make
> > extending writeback policies in future (e.g. for cgroup awareness)
> > very messy.
>   Hmm, I see. I agree the policy decisions should be at one place if
> reasonably possible. Fengguang moves them from wb_writeback() to inode
> queueing code which looks like a logical place to me as well - there we
> have the largest control over what inodes do we decide to write and don't
> have to pass all the detailed 'instructions' down in wbc structure. So if
> we later want to add cgroup awareness to writeback, I imagine we just add
> the knowledge to inode queueing code.

I actually started with wb_writeback() as a natural choice, and then
found it much easier to do the expired-only=>all-inodes switching in
move_expired_inodes() since it needs to know the @b_dirty and @tmp
lists' emptiness to trigger the switch. It's not sane for
wb_writeback() to look into such details. And once you do the switch
part in move_expired_inodes(), the whole policy naturally follows.

> > > @@ -585,7 +597,8 @@ void writeback_inodes_wb(struct bdi_writ
> > >  	if (!wbc->wb_start)
> > >  		wbc->wb_start = jiffies; /* livelock avoidance */
> > >  	spin_lock(&inode_wb_list_lock);
> > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > +
> > > +	if (list_empty(&wb->b_io))
> > >  		queue_io(wb, wbc);
> > >  
> > >  	while (!list_empty(&wb->b_io)) {
> > > @@ -612,7 +625,7 @@ static void __writeback_inodes_sb(struct
> > >  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> > >  
> > >  	spin_lock(&inode_wb_list_lock);
> > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > +	if (list_empty(&wb->b_io))
> > >  		queue_io(wb, wbc);
> > >  	writeback_sb_inodes(sb, wb, wbc, true);
> > >  	spin_unlock(&inode_wb_list_lock);
> > 
> > That changes the order in which we queue inodes for writeback.
> > Instead of calling every time to move b_more_io inodes onto the b_io
> > list and expiring more aged inodes, we only ever do it when the list
> > is empty. That is, it seems to me that this will tend to give
> > b_more_io inodes a smaller share of writeback because they are being
> > moved back to the b_io list less frequently where there are lots of
> > other inodes being dirtied. Have you tested the impact of this
> > change on mixed workload performance? Indeed, can you starve
> > writeback of a large file simply by creating lots of small files in
> > another thread?
>   Yeah, this change looks suspicious to me as well.

The exact behaviors are indeed rather complex. I personally feel the
new "always refill iff empty" policy more consistent, clean and easy
to understand.

It basically says: at each round started by a b_io refill, setup a
_fixed_ work set with all current expired (or all currently dirtied
inodes if non is expired) and walk through it. "Fixed" work set means
no new inodes will be added to the work set during the walk.  When a
complete walk is done, start over with a new set of inodes that are
eligible at the time.

The figure in page 14 illustrates the "rounds" idea:
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/linux-writeback-queues.pdf

This procedure provides fairness among the inodes and guarantees each
inode to be synced once and only once at each round. So it's free from
starvations.

If you are worried about performance, here is a simple tar+dd benchmark.
Both commands are actually running faster with this patchset:

wfg /tmp% g cpu log-* | g dd
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 9% cpu 13.658 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 9% cpu 12.961 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 9% cpu 13.420 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.30s system 9% cpu 13.103 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.31s system 9% cpu 13.650 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 8% cpu 15.258 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 8% cpu 14.255 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 8% cpu 14.443 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 8% cpu 14.051 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.27s system 8% cpu 14.648 total

wfg /tmp% g cpu log-* | g tar
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.49s user 3.99s system 60% cpu 27.285 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.78s user 4.40s system 65% cpu 26.125 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.50s user 4.56s system 64% cpu 26.265 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.50s user 4.18s system 62% cpu 26.766 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.60s user 4.03s system 60% cpu 27.463 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.42s user 4.17s system 57% cpu 28.688 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.67s user 4.04s system 58% cpu 28.738 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.53s user 4.50s system 58% cpu 29.287 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.38s user 4.28s system 57% cpu 28.861 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.44s user 4.19s system 56% cpu 29.443 total

Total elapsed time (from tar/dd start to sync complete) is
244.36s vs. 239.91s, also a bit faster with patch. 

The base kernel is 2.6.39-rc3+ plus IO-less patchset plus large write
chunk size. The test box has 3G mem and runs XFS. Test script is:

#!/bin/zsh


# we are doing pure write tests
cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/

umount /dev/sda7
mkfs.xfs -f /dev/sda7
mount /dev/sda7 /fs

echo 3 > /proc/sys/vm/drop_caches

echo 1 > /debug/tracing/events/writeback/writeback_single_inode/enable

cat /proc/uptime

cd /fs
time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &

wait
sync
cat /proc/uptime

Thanks,
Fengguang

[-- Attachment #2: log-no-moving-expire --]
[-- Type: text/plain, Size: 5213 bytes --]

dt7, no moving target

wfg ~% s fat                                                                                                   [ 255 ]  :-(
Linux fat 2.6.39-rc3-dt7+ #235 SMP Tue Apr 19 19:33:15 CST 2011 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
No mail.
Last login: Tue Apr 19 19:16:05 2011 from 10.255.20.73
wfg@fat ~% su
root@fat /home/wfg# for i in 1 2 3 4 5; do bin/test-tar-dd.sh; sleep 3; done
umount: /dev/sda7: not mounted
meta-data=/dev/sda7              isize=256    agcount=4, agsize=6170464 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=24681856, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=12051, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
306.70 2423.01
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 15.2306 s, 68.8 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 8% cpu 15.258 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.42s user 4.17s system 57% cpu 28.688 total
344.05 2662.47
meta-data=/dev/sda7              isize=256    agcount=4, agsize=6170464 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=24681856, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=12051, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
351.63 2721.77
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 14.1873 s, 73.9 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 8% cpu 14.255 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.67s user 4.04s system 58% cpu 28.738 total
388.94 2963.14
meta-data=/dev/sda7              isize=256    agcount=4, agsize=6170464 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=24681856, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=12051, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
396.53 3024.20
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 14.385 s, 72.9 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 8% cpu 14.443 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.53s user 4.50s system 58% cpu 29.287 total
434.18 3268.86
meta-data=/dev/sda7              isize=256    agcount=4, agsize=6170464 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=24681856, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=12051, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
441.69 3327.58
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 13.997 s, 74.9 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 8% cpu 14.051 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.38s user 4.28s system 57% cpu 28.861 total
478.91 3569.24
meta-data=/dev/sda7              isize=256    agcount=4, agsize=6170464 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=24681856, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=12051, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
486.48 3627.06
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 14.5851 s, 71.9 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.27s system 8% cpu 14.648 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.44s user 4.19s system 56% cpu 29.443 total
524.46 3871.42

3871.42 - 3627.06 = 244.36


ext4:

1855.48 14403.91
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.48s user 3.31s system 86% cpu 18.345 total
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.4943 s, 51.2 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.65s system 8% cpu 20.518 total
1884.20 14562.35

14562.35 - 14403.91 = 158.44

[-- Attachment #3: log-moving-expire --]
[-- Type: text/plain, Size: 5051 bytes --]

dt7, moving target

wfg ~% s fat                                                                                                   [ 255 ]  :-(
Linux fat 2.6.39-rc3-dt7+ #234 SMP Tue Apr 19 17:23:44 CST 2011 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
No mail.
Last login: Tue Apr 19 17:25:16 2011 from 10.255.20.73
wfg@fat ~% su
root@fat /home/wfg# vi bin/test-tar-dd.sh
root@fat /home/wfg# bin/test-tar-dd.sh
umount: /dev/sda7: not mounted
meta-data=/dev/sda7              isize=256    agcount=4, agsize=6170464 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=24681856, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=12051, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
634.16 5029.23
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 13.6318 s, 76.9 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 9% cpu 13.658 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.49s user 3.99s system 60% cpu 27.285 total
670.17 5262.84
root@fat /home/wfg# bin/test-tar-dd.sh
meta-data=/dev/sda7              isize=256    agcount=4, agsize=6170464 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=24681856, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=12051, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
678.41 5327.07
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 12.9063 s, 81.2 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 9% cpu 12.961 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.78s user 4.40s system 65% cpu 26.125 total
713.93 5559.64
root@fat /home/wfg# bin/test-tar-dd.sh
meta-data=/dev/sda7              isize=256    agcount=4, agsize=6170464 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=24681856, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=12051, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
722.54 5626.94
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 13.3658 s, 78.5 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 9% cpu 13.420 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.50s user 4.56s system 64% cpu 26.265 total
757.98 5855.34
root@fat /home/wfg# bin/test-tar-dd.sh
meta-data=/dev/sda7              isize=256    agcount=4, agsize=6170464 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=24681856, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=12051, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
766.10 5918.93
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 13.0385 s, 80.4 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.30s system 9% cpu 13.103 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.50s user 4.18s system 62% cpu 26.766 total
801.72 6152.51
root@fat /home/wfg#
root@fat /home/wfg# bin/test-tar-dd.sh
meta-data=/dev/sda7              isize=256    agcount=4, agsize=6170464 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=24681856, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=12051, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
994.01 7677.81
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 13.5859 s, 77.2 MB/s
dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.31s system 9% cpu 13.650 total
tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.60s user 4.03s system 60% cpu 27.463 total
1030.08 7917.72

7917.72 - 7677.81 = 239.91

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-19 12:56       ` Wu Fengguang
@ 2011-04-19 13:46           ` Wu Fengguang
  2011-04-20  1:21           ` Dave Chinner
  1 sibling, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19 13:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

> wfg /tmp% g cpu log-* | g dd
> log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 9% cpu 13.658 total
> log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 9% cpu 12.961 total
> log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 9% cpu 13.420 total
> log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.30s system 9% cpu 13.103 total
> log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.31s system 9% cpu 13.650 total
> log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 8% cpu 15.258 total
> log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 8% cpu 14.255 total
> log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 8% cpu 14.443 total
> log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 8% cpu 14.051 total
> log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.27s system 8% cpu 14.648 total
> 
> wfg /tmp% g cpu log-* | g tar
> log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.49s user 3.99s system 60% cpu 27.285 total
> log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.78s user 4.40s system 65% cpu 26.125 total
> log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.50s user 4.56s system 64% cpu 26.265 total
> log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.50s user 4.18s system 62% cpu 26.766 total
> log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.60s user 4.03s system 60% cpu 27.463 total
> log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.42s user 4.17s system 57% cpu 28.688 total
> log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.67s user 4.04s system 58% cpu 28.738 total
> log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.53s user 4.50s system 58% cpu 29.287 total
> log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.38s user 4.28s system 57% cpu 28.861 total
> log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.44s user 4.19s system 56% cpu 29.443 total

Jan, here are the ext4 numbers. It's also doing better now. And I find
the behaviors and hence dd vs. tar numbers are pretty different from XFS.

- XFS like redirtying the inode after writing pages while ext4 not

- ext4 enforces large 128MB write chunk size, while XFS uses the
  adaptive 24-28MB chunk size (close to half disk bandwidth)

log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.34s user 3.46s system 93% cpu 16.848 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.59s user 3.30s system 95% cpu 16.655 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.45s user 3.54s system 94% cpu 16.881 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.62s user 3.38s system 93% cpu 17.187 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.70s user 3.20s system 92% cpu 17.219 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.48s user 3.31s system 86% cpu 18.345 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.89s user 3.35s system 86% cpu 18.730 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.84s user 3.41s system 86% cpu 18.820 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.63s user 3.23s system 84% cpu 18.831 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.76s user 3.41s system 84% cpu 19.026 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.72s user 3.29s system 86% cpu 18.597 total

log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.71s system 9% cpu 19.019 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.77s system 9% cpu 19.053 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.79s system 9% cpu 19.238 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.80s system 9% cpu 19.227 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.76s system 9% cpu 19.439 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.65s system 8% cpu 20.518 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.73s system 8% cpu 20.693 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.78s system 8% cpu 20.745 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.72s system 8% cpu 20.369 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.74s system 8% cpu 20.682 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.74s system 8% cpu 20.593 total

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-19 13:46           ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-19 13:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

> wfg /tmp% g cpu log-* | g dd
> log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 9% cpu 13.658 total
> log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 9% cpu 12.961 total
> log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 9% cpu 13.420 total
> log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.30s system 9% cpu 13.103 total
> log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.31s system 9% cpu 13.650 total
> log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 8% cpu 15.258 total
> log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 8% cpu 14.255 total
> log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 8% cpu 14.443 total
> log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 8% cpu 14.051 total
> log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.27s system 8% cpu 14.648 total
> 
> wfg /tmp% g cpu log-* | g tar
> log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.49s user 3.99s system 60% cpu 27.285 total
> log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.78s user 4.40s system 65% cpu 26.125 total
> log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.50s user 4.56s system 64% cpu 26.265 total
> log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.50s user 4.18s system 62% cpu 26.766 total
> log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.60s user 4.03s system 60% cpu 27.463 total
> log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.42s user 4.17s system 57% cpu 28.688 total
> log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.67s user 4.04s system 58% cpu 28.738 total
> log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.53s user 4.50s system 58% cpu 29.287 total
> log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.38s user 4.28s system 57% cpu 28.861 total
> log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.44s user 4.19s system 56% cpu 29.443 total

Jan, here are the ext4 numbers. It's also doing better now. And I find
the behaviors and hence dd vs. tar numbers are pretty different from XFS.

- XFS like redirtying the inode after writing pages while ext4 not

- ext4 enforces large 128MB write chunk size, while XFS uses the
  adaptive 24-28MB chunk size (close to half disk bandwidth)

log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.34s user 3.46s system 93% cpu 16.848 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.59s user 3.30s system 95% cpu 16.655 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.45s user 3.54s system 94% cpu 16.881 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.62s user 3.38s system 93% cpu 17.187 total
log-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.70s user 3.20s system 92% cpu 17.219 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.48s user 3.31s system 86% cpu 18.345 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.89s user 3.35s system 86% cpu 18.730 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.84s user 3.41s system 86% cpu 18.820 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.63s user 3.23s system 84% cpu 18.831 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.76s user 3.41s system 84% cpu 19.026 total
log-no-moving-expire:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.72s user 3.29s system 86% cpu 18.597 total

log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.71s system 9% cpu 19.019 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.77s system 9% cpu 19.053 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.79s system 9% cpu 19.238 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.80s system 9% cpu 19.227 total
log-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.76s system 9% cpu 19.439 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.65s system 8% cpu 20.518 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.73s system 8% cpu 20.693 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.78s system 8% cpu 20.745 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.72s system 8% cpu 20.369 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.74s system 8% cpu 20.682 total
log-no-moving-expire:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.74s system 8% cpu 20.593 total

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-19 11:16       ` Wu Fengguang
@ 2011-04-19 21:10         ` Jan Kara
  -1 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-19 21:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Tue 19-04-11 19:16:01, Wu Fengguang wrote:
> On Tue, Apr 19, 2011 at 06:20:16PM +0800, Jan Kara wrote:
> > On Tue 19-04-11 11:00:08, Wu Fengguang wrote:
> > > writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> > > they only populate possibly a subset of elegible inodes into b_io at
> > > entrance time. When the queued set of inodes are all synced, they just
> > > return, possibly with all queued inode pages written but still
> > > wbc.nr_to_write > 0.
> > > 
> > > For kupdate and background writeback, there may be more eligible inodes
> > > sitting in b_dirty when the current set of b_io inodes are completed. So
> > > it is necessary to try another round of writeback as long as we made some
> > > progress in this round. When there are no more eligible inodes, no more
> > > inodes will be enqueued in queue_io(), hence nothing could/will be
> > > synced and we may safely bail.
> >   Let me understand your concern here: You are afraid that if we do
> > for_background or for_kupdate writeback and we write less than
> > MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
> > inodes to write at the time we are stopping writeback - the two realistic
> 
> Yes.
> 
> > cases I can think of are:
> > a) when inodes just freshly expired during writeback
> > b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
> >   background threshold due to data on some other bdi. And then while we are
> >   doing writeback someone does dirtying at our bdi.
> > Or do you see some other case as well?
> > 
> > The a) case does not seem like a big issue to me after your changes to
> 
> Yeah (a) is not an issue with kupdate writeback.
> 
> > move_expired_inodes(). The b) case maybe but do you think it will make any
> > difference? 
> 
> (b) seems also weird. What in my mind is this for_background case.
> Imagine 100 inodes
> 
>         i0, i1, i2, ..., i90, i91, i99
> 
> At queue_io() time, i90-i99 happen to be expired and moved to s_io for
> IO. When finished successfully, if their total size is less than
> MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
> quit the background work (w/o this patch) while it's still over
> background threshold.
> 
> This will be a fairly normal/frequent case I guess.
  Ah OK, I see. I missed this case your patch set has added. Also your
changes of
        if (!wbc->for_kupdate || list_empty(&wb->b_io))
to
	if (list_empty(&wb->b_io))
are going to cause more cases when we'd hit nr_to_write > 0 (e.g. when one
pass of b_io does not write all the inodes so some are left in b_io list
and then next call to writeback finds these inodes there but there's less
than MAX_WRITEBACK_PAGES in them). Frankly, it makes me like the above
change even less. I'd rather see writeback_inodes_wb /
__writeback_inodes_sb always work on a fresh set of inodes which is
initialized whenever we enter these functions. It just seems less
surprising to me...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-19 21:10         ` Jan Kara
  0 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-19 21:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Tue 19-04-11 19:16:01, Wu Fengguang wrote:
> On Tue, Apr 19, 2011 at 06:20:16PM +0800, Jan Kara wrote:
> > On Tue 19-04-11 11:00:08, Wu Fengguang wrote:
> > > writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> > > they only populate possibly a subset of elegible inodes into b_io at
> > > entrance time. When the queued set of inodes are all synced, they just
> > > return, possibly with all queued inode pages written but still
> > > wbc.nr_to_write > 0.
> > > 
> > > For kupdate and background writeback, there may be more eligible inodes
> > > sitting in b_dirty when the current set of b_io inodes are completed. So
> > > it is necessary to try another round of writeback as long as we made some
> > > progress in this round. When there are no more eligible inodes, no more
> > > inodes will be enqueued in queue_io(), hence nothing could/will be
> > > synced and we may safely bail.
> >   Let me understand your concern here: You are afraid that if we do
> > for_background or for_kupdate writeback and we write less than
> > MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
> > inodes to write at the time we are stopping writeback - the two realistic
> 
> Yes.
> 
> > cases I can think of are:
> > a) when inodes just freshly expired during writeback
> > b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
> >   background threshold due to data on some other bdi. And then while we are
> >   doing writeback someone does dirtying at our bdi.
> > Or do you see some other case as well?
> > 
> > The a) case does not seem like a big issue to me after your changes to
> 
> Yeah (a) is not an issue with kupdate writeback.
> 
> > move_expired_inodes(). The b) case maybe but do you think it will make any
> > difference? 
> 
> (b) seems also weird. What in my mind is this for_background case.
> Imagine 100 inodes
> 
>         i0, i1, i2, ..., i90, i91, i99
> 
> At queue_io() time, i90-i99 happen to be expired and moved to s_io for
> IO. When finished successfully, if their total size is less than
> MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
> quit the background work (w/o this patch) while it's still over
> background threshold.
> 
> This will be a fairly normal/frequent case I guess.
  Ah OK, I see. I missed this case your patch set has added. Also your
changes of
        if (!wbc->for_kupdate || list_empty(&wb->b_io))
to
	if (list_empty(&wb->b_io))
are going to cause more cases when we'd hit nr_to_write > 0 (e.g. when one
pass of b_io does not write all the inodes so some are left in b_io list
and then next call to writeback finds these inodes there but there's less
than MAX_WRITEBACK_PAGES in them). Frankly, it makes me like the above
change even less. I'd rather see writeback_inodes_wb /
__writeback_inodes_sb always work on a fresh set of inodes which is
initialized whenever we enter these functions. It just seems less
surprising to me...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-19 12:56       ` Wu Fengguang
@ 2011-04-20  1:21           ` Dave Chinner
  2011-04-20  1:21           ` Dave Chinner
  1 sibling, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-20  1:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 08:56:16PM +0800, Wu Fengguang wrote:
> On Tue, Apr 19, 2011 at 05:57:40PM +0800, Jan Kara wrote:
> > On Tue 19-04-11 17:35:23, Dave Chinner wrote:
> > > On Tue, Apr 19, 2011 at 11:00:06AM +0800, Wu Fengguang wrote:
> > > > A background flush work may run for ever. So it's reasonable for it to
> > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > 
> > > > The policy is
> > > > - enqueue all newly expired inodes at each queue_io() time
> > > > - enqueue all dirty inodes if there are no more expired inodes to sync
> > > > 
> > > > This will help reduce the number of dirty pages encountered by page
> > > > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > > > dirty pages, which are more close to the end of the LRU lists. So
> > > > syncing older inodes first helps reducing the dirty pages reached by
> > > > the page reclaim code.
> > > 
> > > Once again I think this is the wrong place to be changing writeback
> > > policy decisions. for_background writeback only goes through
> > > wb_writeback() and writeback_inodes_wb() (same as for_kupdate
> > > writeback), so a decision to change from expired inodes to fresh
> > > inodes, IMO, should be made in wb_writeback.
> > > 
> > > That is, for_background and for_kupdate writeback start with the
> > > same policy (older_than_this set) to writeback expired inodes first,
> > > then when background writeback runs out of expired inodes, it should
> > > switch to all remaining inodes by clearing older_than_this instead
> > > of refreshing it for the next loop.
> >   Yes, I agree with this and my impression is that Fengguang is trying to
> > achieve exactly this behavior.
> > 
> > > This keeps all the policy decisions in the one place, all using the
> > > same (existing) mechanism, and all relatively simple to understand,
> > > and easy to tracepoint for debugging.  Changing writeback policy
> > > deep in the writeback stack is not a good idea as it will make
> > > extending writeback policies in future (e.g. for cgroup awareness)
> > > very messy.
> >   Hmm, I see. I agree the policy decisions should be at one place if
> > reasonably possible. Fengguang moves them from wb_writeback() to inode
> > queueing code which looks like a logical place to me as well - there we
> > have the largest control over what inodes do we decide to write and don't
> > have to pass all the detailed 'instructions' down in wbc structure. So if
> > we later want to add cgroup awareness to writeback, I imagine we just add
> > the knowledge to inode queueing code.
> 
> I actually started with wb_writeback() as a natural choice, and then
> found it much easier to do the expired-only=>all-inodes switching in
> move_expired_inodes() since it needs to know the @b_dirty and @tmp
> lists' emptiness to trigger the switch. It's not sane for
> wb_writeback() to look into such details. And once you do the switch
> part in move_expired_inodes(), the whole policy naturally follows.

Well, not really. You didn't need to modify move_expired_inodes() at
all to implement these changes - all you needed to do was modify how
older_than_this is configured.

writeback policy is defined by the struct writeback_control.
move_expired_inodes() is pure mechanism. What you've done is remove
policy from the struct wbc and moved it to move_expired_inodes(),
which now defines both policy and mechanism.

Furhter, this means that all the tracing that uses the struct wbc no
no longer shows the entire writeback policy that is being worked on,
so we lose visibility into policy decisions that writeback is
making.

This same change is as simple as updating wbc->older_than_this
appropriately after the wb_writeback() call for both background and
kupdate and leaving the lower layers untouched. It's just a policy
change. If you thinkthe mechanism is inefficient, copy
wbc->older_than_this to a local variable inside
move_expired_inodes()....

> > > > @@ -585,7 +597,8 @@ void writeback_inodes_wb(struct bdi_writ
> > > >  	if (!wbc->wb_start)
> > > >  		wbc->wb_start = jiffies; /* livelock avoidance */
> > > >  	spin_lock(&inode_wb_list_lock);
> > > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > > +
> > > > +	if (list_empty(&wb->b_io))
> > > >  		queue_io(wb, wbc);
> > > >  
> > > >  	while (!list_empty(&wb->b_io)) {
> > > > @@ -612,7 +625,7 @@ static void __writeback_inodes_sb(struct
> > > >  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> > > >  
> > > >  	spin_lock(&inode_wb_list_lock);
> > > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > > +	if (list_empty(&wb->b_io))
> > > >  		queue_io(wb, wbc);
> > > >  	writeback_sb_inodes(sb, wb, wbc, true);
> > > >  	spin_unlock(&inode_wb_list_lock);
> > > 
> > > That changes the order in which we queue inodes for writeback.
> > > Instead of calling every time to move b_more_io inodes onto the b_io
> > > list and expiring more aged inodes, we only ever do it when the list
> > > is empty. That is, it seems to me that this will tend to give
> > > b_more_io inodes a smaller share of writeback because they are being
> > > moved back to the b_io list less frequently where there are lots of
> > > other inodes being dirtied. Have you tested the impact of this
> > > change on mixed workload performance? Indeed, can you starve
> > > writeback of a large file simply by creating lots of small files in
> > > another thread?
> >   Yeah, this change looks suspicious to me as well.
> 
> The exact behaviors are indeed rather complex. I personally feel the
> new "always refill iff empty" policy more consistent, clean and easy
> to understand.

That may be so, but that doesn't make the change good from an IO
perspective. You said you'd only done light testing, and that's not
sufficient to guage the impact of such a change.

> It basically says: at each round started by a b_io refill, setup a
> _fixed_ work set with all current expired (or all currently dirtied
> inodes if non is expired) and walk through it. "Fixed" work set means
> no new inodes will be added to the work set during the walk.  When a
> complete walk is done, start over with a new set of inodes that are
> eligible at the time.

Yes, I know what it does - I can read the code. You haven't however,
answered why it is a good change from an IO persepctive, however.

> The figure in page 14 illustrates the "rounds" idea:
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/linux-writeback-queues.pdf
> 
> This procedure provides fairness among the inodes and guarantees each
> inode to be synced once and only once at each round. So it's free from
> starvations.

Perhaps you should add some of this commentary to the commit
message? That talks about the VM and LRU writeback, but that has
nothing to do with writeback fairness. The commit message or
comments in the code need to explain why something is being
changed....

> 
> If you are worried about performance, here is a simple tar+dd benchmark.
> Both commands are actually running faster with this patchset:
.....
> The base kernel is 2.6.39-rc3+ plus IO-less patchset plus large write
> chunk size. The test box has 3G mem and runs XFS. Test script is:

<sigh>

The numbers are meaningless to me - you've got a large number of
other changes that are affecting writeback behaviour, and that's
especially important because, at minimum, the change in write chunk
size will hide any differences in IO patterns that this change will
make. Please test against a vanilla kernel if that is what you are
aiming these patches for. If you aren't aiming for a vanilla kernel,
please say so in the patch series header...

Anyway, I'm going to put some numbers into a hypothetical steady
state situation to demonstrate the differences in algorithms.
Let's say we have lots of inodes with 100 dirty pages being created,
and one large writeback going on. We expire 8 new inodes for every
1024 pages we write back.

With the old code, we do:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)

	writeback  8 small inodes 800 pages
		   1 large inode 224 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)
	.....

Your new code:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 8s)
	writeback  8 small inodes 800 pages

	b_io empty: (1800 pages written)
		b_more_io (large inode) -> b_io (1l)
		14 newly expired inodes -> b_io (1l, 14s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 14s)
	writeback  10 small inodes 1000 pages
		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
	writeback  5 small inodes 500 pages
	b_io empty: (2548 pages written)
		b_more_io (large inode) -> b_io (1l, 1s(24))
		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
	......

Rough progression of pages written at b_io refill:

Old code:

	total	large file	% of writeback
	1024	224		21.9% (fixed)
	
New code:
	total	large file	% of writeback
	1800	1024		~55%
	2550	1024		~40%
	3050	1024		~33%
	3500	1024		~29%
	3950	1024		~26%
	4250	1024		~24%
	4500	1024		~22.7%
	4700	1024		~21.7%
	4800	1024		~21.3%
	4800	1024		~21.3%
	(pretty much steady state from here)

Ok, so the steady state is reached with a similar percentage of
writeback to the large file as the existing code. Ok, that's good,
but providing some evidence that is doesn't change the shared of
writeback to the large should be in the commit message ;)

The other advantage to this is that we always write 1024 page chunks
to the large file, rather than smaller "whatever remains" chunks. I
think this will have a bigger effect on a vanilla kernel than on the
kernel you tested on above because of the smaller writeback chunk
size.

I'm convinced that the refilling only when the queue is empty is a
sane change now. you need to separate this from the
move_expired_inodes() changes because it is doing something very
different to writeback.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-20  1:21           ` Dave Chinner
  0 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-20  1:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 08:56:16PM +0800, Wu Fengguang wrote:
> On Tue, Apr 19, 2011 at 05:57:40PM +0800, Jan Kara wrote:
> > On Tue 19-04-11 17:35:23, Dave Chinner wrote:
> > > On Tue, Apr 19, 2011 at 11:00:06AM +0800, Wu Fengguang wrote:
> > > > A background flush work may run for ever. So it's reasonable for it to
> > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > 
> > > > The policy is
> > > > - enqueue all newly expired inodes at each queue_io() time
> > > > - enqueue all dirty inodes if there are no more expired inodes to sync
> > > > 
> > > > This will help reduce the number of dirty pages encountered by page
> > > > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > > > dirty pages, which are more close to the end of the LRU lists. So
> > > > syncing older inodes first helps reducing the dirty pages reached by
> > > > the page reclaim code.
> > > 
> > > Once again I think this is the wrong place to be changing writeback
> > > policy decisions. for_background writeback only goes through
> > > wb_writeback() and writeback_inodes_wb() (same as for_kupdate
> > > writeback), so a decision to change from expired inodes to fresh
> > > inodes, IMO, should be made in wb_writeback.
> > > 
> > > That is, for_background and for_kupdate writeback start with the
> > > same policy (older_than_this set) to writeback expired inodes first,
> > > then when background writeback runs out of expired inodes, it should
> > > switch to all remaining inodes by clearing older_than_this instead
> > > of refreshing it for the next loop.
> >   Yes, I agree with this and my impression is that Fengguang is trying to
> > achieve exactly this behavior.
> > 
> > > This keeps all the policy decisions in the one place, all using the
> > > same (existing) mechanism, and all relatively simple to understand,
> > > and easy to tracepoint for debugging.  Changing writeback policy
> > > deep in the writeback stack is not a good idea as it will make
> > > extending writeback policies in future (e.g. for cgroup awareness)
> > > very messy.
> >   Hmm, I see. I agree the policy decisions should be at one place if
> > reasonably possible. Fengguang moves them from wb_writeback() to inode
> > queueing code which looks like a logical place to me as well - there we
> > have the largest control over what inodes do we decide to write and don't
> > have to pass all the detailed 'instructions' down in wbc structure. So if
> > we later want to add cgroup awareness to writeback, I imagine we just add
> > the knowledge to inode queueing code.
> 
> I actually started with wb_writeback() as a natural choice, and then
> found it much easier to do the expired-only=>all-inodes switching in
> move_expired_inodes() since it needs to know the @b_dirty and @tmp
> lists' emptiness to trigger the switch. It's not sane for
> wb_writeback() to look into such details. And once you do the switch
> part in move_expired_inodes(), the whole policy naturally follows.

Well, not really. You didn't need to modify move_expired_inodes() at
all to implement these changes - all you needed to do was modify how
older_than_this is configured.

writeback policy is defined by the struct writeback_control.
move_expired_inodes() is pure mechanism. What you've done is remove
policy from the struct wbc and moved it to move_expired_inodes(),
which now defines both policy and mechanism.

Furhter, this means that all the tracing that uses the struct wbc no
no longer shows the entire writeback policy that is being worked on,
so we lose visibility into policy decisions that writeback is
making.

This same change is as simple as updating wbc->older_than_this
appropriately after the wb_writeback() call for both background and
kupdate and leaving the lower layers untouched. It's just a policy
change. If you thinkthe mechanism is inefficient, copy
wbc->older_than_this to a local variable inside
move_expired_inodes()....

> > > > @@ -585,7 +597,8 @@ void writeback_inodes_wb(struct bdi_writ
> > > >  	if (!wbc->wb_start)
> > > >  		wbc->wb_start = jiffies; /* livelock avoidance */
> > > >  	spin_lock(&inode_wb_list_lock);
> > > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > > +
> > > > +	if (list_empty(&wb->b_io))
> > > >  		queue_io(wb, wbc);
> > > >  
> > > >  	while (!list_empty(&wb->b_io)) {
> > > > @@ -612,7 +625,7 @@ static void __writeback_inodes_sb(struct
> > > >  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> > > >  
> > > >  	spin_lock(&inode_wb_list_lock);
> > > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > > +	if (list_empty(&wb->b_io))
> > > >  		queue_io(wb, wbc);
> > > >  	writeback_sb_inodes(sb, wb, wbc, true);
> > > >  	spin_unlock(&inode_wb_list_lock);
> > > 
> > > That changes the order in which we queue inodes for writeback.
> > > Instead of calling every time to move b_more_io inodes onto the b_io
> > > list and expiring more aged inodes, we only ever do it when the list
> > > is empty. That is, it seems to me that this will tend to give
> > > b_more_io inodes a smaller share of writeback because they are being
> > > moved back to the b_io list less frequently where there are lots of
> > > other inodes being dirtied. Have you tested the impact of this
> > > change on mixed workload performance? Indeed, can you starve
> > > writeback of a large file simply by creating lots of small files in
> > > another thread?
> >   Yeah, this change looks suspicious to me as well.
> 
> The exact behaviors are indeed rather complex. I personally feel the
> new "always refill iff empty" policy more consistent, clean and easy
> to understand.

That may be so, but that doesn't make the change good from an IO
perspective. You said you'd only done light testing, and that's not
sufficient to guage the impact of such a change.

> It basically says: at each round started by a b_io refill, setup a
> _fixed_ work set with all current expired (or all currently dirtied
> inodes if non is expired) and walk through it. "Fixed" work set means
> no new inodes will be added to the work set during the walk.  When a
> complete walk is done, start over with a new set of inodes that are
> eligible at the time.

Yes, I know what it does - I can read the code. You haven't however,
answered why it is a good change from an IO persepctive, however.

> The figure in page 14 illustrates the "rounds" idea:
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/linux-writeback-queues.pdf
> 
> This procedure provides fairness among the inodes and guarantees each
> inode to be synced once and only once at each round. So it's free from
> starvations.

Perhaps you should add some of this commentary to the commit
message? That talks about the VM and LRU writeback, but that has
nothing to do with writeback fairness. The commit message or
comments in the code need to explain why something is being
changed....

> 
> If you are worried about performance, here is a simple tar+dd benchmark.
> Both commands are actually running faster with this patchset:
.....
> The base kernel is 2.6.39-rc3+ plus IO-less patchset plus large write
> chunk size. The test box has 3G mem and runs XFS. Test script is:

<sigh>

The numbers are meaningless to me - you've got a large number of
other changes that are affecting writeback behaviour, and that's
especially important because, at minimum, the change in write chunk
size will hide any differences in IO patterns that this change will
make. Please test against a vanilla kernel if that is what you are
aiming these patches for. If you aren't aiming for a vanilla kernel,
please say so in the patch series header...

Anyway, I'm going to put some numbers into a hypothetical steady
state situation to demonstrate the differences in algorithms.
Let's say we have lots of inodes with 100 dirty pages being created,
and one large writeback going on. We expire 8 new inodes for every
1024 pages we write back.

With the old code, we do:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)

	writeback  8 small inodes 800 pages
		   1 large inode 224 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)
	.....

Your new code:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 8s)
	writeback  8 small inodes 800 pages

	b_io empty: (1800 pages written)
		b_more_io (large inode) -> b_io (1l)
		14 newly expired inodes -> b_io (1l, 14s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 14s)
	writeback  10 small inodes 1000 pages
		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
	writeback  5 small inodes 500 pages
	b_io empty: (2548 pages written)
		b_more_io (large inode) -> b_io (1l, 1s(24))
		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
	......

Rough progression of pages written at b_io refill:

Old code:

	total	large file	% of writeback
	1024	224		21.9% (fixed)
	
New code:
	total	large file	% of writeback
	1800	1024		~55%
	2550	1024		~40%
	3050	1024		~33%
	3500	1024		~29%
	3950	1024		~26%
	4250	1024		~24%
	4500	1024		~22.7%
	4700	1024		~21.7%
	4800	1024		~21.3%
	4800	1024		~21.3%
	(pretty much steady state from here)

Ok, so the steady state is reached with a similar percentage of
writeback to the large file as the existing code. Ok, that's good,
but providing some evidence that is doesn't change the shared of
writeback to the large should be in the commit message ;)

The other advantage to this is that we always write 1024 page chunks
to the large file, rather than smaller "whatever remains" chunks. I
think this will have a bigger effect on a vanilla kernel than on the
kernel you tested on above because of the smaller writeback chunk
size.

I'm convinced that the refilling only when the queue is empty is a
sane change now. you need to separate this from the
move_expired_inodes() changes because it is doing something very
different to writeback.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-20  1:21           ` Dave Chinner
@ 2011-04-20  2:53             ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-20  2:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Wed, Apr 20, 2011 at 09:21:20AM +0800, Dave Chinner wrote:
> On Tue, Apr 19, 2011 at 08:56:16PM +0800, Wu Fengguang wrote:
> > On Tue, Apr 19, 2011 at 05:57:40PM +0800, Jan Kara wrote:
> > > On Tue 19-04-11 17:35:23, Dave Chinner wrote:
> > > > On Tue, Apr 19, 2011 at 11:00:06AM +0800, Wu Fengguang wrote:
> > > > > A background flush work may run for ever. So it's reasonable for it to
> > > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > > 
> > > > > The policy is
> > > > > - enqueue all newly expired inodes at each queue_io() time
> > > > > - enqueue all dirty inodes if there are no more expired inodes to sync
> > > > > 
> > > > > This will help reduce the number of dirty pages encountered by page
> > > > > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > > > > dirty pages, which are more close to the end of the LRU lists. So
> > > > > syncing older inodes first helps reducing the dirty pages reached by
> > > > > the page reclaim code.
> > > > 
> > > > Once again I think this is the wrong place to be changing writeback
> > > > policy decisions. for_background writeback only goes through
> > > > wb_writeback() and writeback_inodes_wb() (same as for_kupdate
> > > > writeback), so a decision to change from expired inodes to fresh
> > > > inodes, IMO, should be made in wb_writeback.
> > > > 
> > > > That is, for_background and for_kupdate writeback start with the
> > > > same policy (older_than_this set) to writeback expired inodes first,
> > > > then when background writeback runs out of expired inodes, it should
> > > > switch to all remaining inodes by clearing older_than_this instead
> > > > of refreshing it for the next loop.
> > >   Yes, I agree with this and my impression is that Fengguang is trying to
> > > achieve exactly this behavior.
> > > 
> > > > This keeps all the policy decisions in the one place, all using the
> > > > same (existing) mechanism, and all relatively simple to understand,
> > > > and easy to tracepoint for debugging.  Changing writeback policy
> > > > deep in the writeback stack is not a good idea as it will make
> > > > extending writeback policies in future (e.g. for cgroup awareness)
> > > > very messy.
> > >   Hmm, I see. I agree the policy decisions should be at one place if
> > > reasonably possible. Fengguang moves them from wb_writeback() to inode
> > > queueing code which looks like a logical place to me as well - there we
> > > have the largest control over what inodes do we decide to write and don't
> > > have to pass all the detailed 'instructions' down in wbc structure. So if
> > > we later want to add cgroup awareness to writeback, I imagine we just add
> > > the knowledge to inode queueing code.
> > 
> > I actually started with wb_writeback() as a natural choice, and then
> > found it much easier to do the expired-only=>all-inodes switching in
> > move_expired_inodes() since it needs to know the @b_dirty and @tmp
> > lists' emptiness to trigger the switch. It's not sane for
> > wb_writeback() to look into such details. And once you do the switch
> > part in move_expired_inodes(), the whole policy naturally follows.
> 
> Well, not really. You didn't need to modify move_expired_inodes() at
> all to implement these changes - all you needed to do was modify how
> older_than_this is configured.
> 
> writeback policy is defined by the struct writeback_control.
> move_expired_inodes() is pure mechanism. What you've done is remove
> policy from the struct wbc and moved it to move_expired_inodes(),
> which now defines both policy and mechanism.

> Furhter, this means that all the tracing that uses the struct wbc no
> no longer shows the entire writeback policy that is being worked on,
> so we lose visibility into policy decisions that writeback is
> making.

Good point! I'm convinced, visibility is a necessity for debugging the
complex writeback behaviors.

> This same change is as simple as updating wbc->older_than_this
> appropriately after the wb_writeback() call for both background and
> kupdate and leaving the lower layers untouched. It's just a policy
> change. If you thinkthe mechanism is inefficient, copy
> wbc->older_than_this to a local variable inside
> move_expired_inodes()....

Do you like something like this? (details will change a bit when
rearranging the patchset)

--- linux-next.orig/fs/fs-writeback.c	2011-04-20 10:30:47.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-20 10:40:19.000000000 +0800
@@ -660,11 +660,6 @@ static long wb_writeback(struct bdi_writ
 	long write_chunk;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -713,10 +708,17 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
+		if (work->for_kupdate || work->for_background) {
+			oldest_jif = jiffies -
+				msecs_to_jiffies(dirty_expire_interval * 10);
+			wbc.older_than_this = &oldest_jif;
+		}
+
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 
+retry_all:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
 			__writeback_inodes_sb(work->sb, wb, &wbc);
@@ -733,6 +735,17 @@ static long wb_writeback(struct bdi_writ
 		if (wbc.nr_to_write <= 0)
 			continue;
 		/*
+		 * No expired inode? Try all fresh ones
+		 */
+		if ((work->for_kupdate || work->for_background) &&
+		    wbc.older_than_this &&
+		    wbc.nr_to_write == write_chunk &&
+		    list_empty(&wb->b_io) &&
+		    list_empty(&wb->b_more_io)) {
+			wbc.older_than_this = NULL;
+			goto retry_all;
+		}
+		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
 		if (!wbc.more_io)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-20  2:53             ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-20  2:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Wed, Apr 20, 2011 at 09:21:20AM +0800, Dave Chinner wrote:
> On Tue, Apr 19, 2011 at 08:56:16PM +0800, Wu Fengguang wrote:
> > On Tue, Apr 19, 2011 at 05:57:40PM +0800, Jan Kara wrote:
> > > On Tue 19-04-11 17:35:23, Dave Chinner wrote:
> > > > On Tue, Apr 19, 2011 at 11:00:06AM +0800, Wu Fengguang wrote:
> > > > > A background flush work may run for ever. So it's reasonable for it to
> > > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > > 
> > > > > The policy is
> > > > > - enqueue all newly expired inodes at each queue_io() time
> > > > > - enqueue all dirty inodes if there are no more expired inodes to sync
> > > > > 
> > > > > This will help reduce the number of dirty pages encountered by page
> > > > > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > > > > dirty pages, which are more close to the end of the LRU lists. So
> > > > > syncing older inodes first helps reducing the dirty pages reached by
> > > > > the page reclaim code.
> > > > 
> > > > Once again I think this is the wrong place to be changing writeback
> > > > policy decisions. for_background writeback only goes through
> > > > wb_writeback() and writeback_inodes_wb() (same as for_kupdate
> > > > writeback), so a decision to change from expired inodes to fresh
> > > > inodes, IMO, should be made in wb_writeback.
> > > > 
> > > > That is, for_background and for_kupdate writeback start with the
> > > > same policy (older_than_this set) to writeback expired inodes first,
> > > > then when background writeback runs out of expired inodes, it should
> > > > switch to all remaining inodes by clearing older_than_this instead
> > > > of refreshing it for the next loop.
> > >   Yes, I agree with this and my impression is that Fengguang is trying to
> > > achieve exactly this behavior.
> > > 
> > > > This keeps all the policy decisions in the one place, all using the
> > > > same (existing) mechanism, and all relatively simple to understand,
> > > > and easy to tracepoint for debugging.  Changing writeback policy
> > > > deep in the writeback stack is not a good idea as it will make
> > > > extending writeback policies in future (e.g. for cgroup awareness)
> > > > very messy.
> > >   Hmm, I see. I agree the policy decisions should be at one place if
> > > reasonably possible. Fengguang moves them from wb_writeback() to inode
> > > queueing code which looks like a logical place to me as well - there we
> > > have the largest control over what inodes do we decide to write and don't
> > > have to pass all the detailed 'instructions' down in wbc structure. So if
> > > we later want to add cgroup awareness to writeback, I imagine we just add
> > > the knowledge to inode queueing code.
> > 
> > I actually started with wb_writeback() as a natural choice, and then
> > found it much easier to do the expired-only=>all-inodes switching in
> > move_expired_inodes() since it needs to know the @b_dirty and @tmp
> > lists' emptiness to trigger the switch. It's not sane for
> > wb_writeback() to look into such details. And once you do the switch
> > part in move_expired_inodes(), the whole policy naturally follows.
> 
> Well, not really. You didn't need to modify move_expired_inodes() at
> all to implement these changes - all you needed to do was modify how
> older_than_this is configured.
> 
> writeback policy is defined by the struct writeback_control.
> move_expired_inodes() is pure mechanism. What you've done is remove
> policy from the struct wbc and moved it to move_expired_inodes(),
> which now defines both policy and mechanism.

> Furhter, this means that all the tracing that uses the struct wbc no
> no longer shows the entire writeback policy that is being worked on,
> so we lose visibility into policy decisions that writeback is
> making.

Good point! I'm convinced, visibility is a necessity for debugging the
complex writeback behaviors.

> This same change is as simple as updating wbc->older_than_this
> appropriately after the wb_writeback() call for both background and
> kupdate and leaving the lower layers untouched. It's just a policy
> change. If you thinkthe mechanism is inefficient, copy
> wbc->older_than_this to a local variable inside
> move_expired_inodes()....

Do you like something like this? (details will change a bit when
rearranging the patchset)

--- linux-next.orig/fs/fs-writeback.c	2011-04-20 10:30:47.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-20 10:40:19.000000000 +0800
@@ -660,11 +660,6 @@ static long wb_writeback(struct bdi_writ
 	long write_chunk;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -713,10 +708,17 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
+		if (work->for_kupdate || work->for_background) {
+			oldest_jif = jiffies -
+				msecs_to_jiffies(dirty_expire_interval * 10);
+			wbc.older_than_this = &oldest_jif;
+		}
+
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 
+retry_all:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
 			__writeback_inodes_sb(work->sb, wb, &wbc);
@@ -733,6 +735,17 @@ static long wb_writeback(struct bdi_writ
 		if (wbc.nr_to_write <= 0)
 			continue;
 		/*
+		 * No expired inode? Try all fresh ones
+		 */
+		if ((work->for_kupdate || work->for_background) &&
+		    wbc.older_than_this &&
+		    wbc.nr_to_write == write_chunk &&
+		    list_empty(&wb->b_io) &&
+		    list_empty(&wb->b_more_io)) {
+			wbc.older_than_this = NULL;
+			goto retry_all;
+		}
+		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
 		if (!wbc.more_io)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-20  1:21           ` Dave Chinner
@ 2011-04-20  7:38             ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-20  7:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

> > > > > @@ -585,7 +597,8 @@ void writeback_inodes_wb(struct bdi_writ
> > > > >  	if (!wbc->wb_start)
> > > > >  		wbc->wb_start = jiffies; /* livelock avoidance */
> > > > >  	spin_lock(&inode_wb_list_lock);
> > > > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > > > +
> > > > > +	if (list_empty(&wb->b_io))
> > > > >  		queue_io(wb, wbc);
> > > > >
> > > > >  	while (!list_empty(&wb->b_io)) {
> > > > > @@ -612,7 +625,7 @@ static void __writeback_inodes_sb(struct
> > > > >  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> > > > >
> > > > >  	spin_lock(&inode_wb_list_lock);
> > > > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > > > +	if (list_empty(&wb->b_io))
> > > > >  		queue_io(wb, wbc);
> > > > >  	writeback_sb_inodes(sb, wb, wbc, true);
> > > > >  	spin_unlock(&inode_wb_list_lock);
> > > >
> > > > That changes the order in which we queue inodes for writeback.
> > > > Instead of calling every time to move b_more_io inodes onto the b_io
> > > > list and expiring more aged inodes, we only ever do it when the list
> > > > is empty. That is, it seems to me that this will tend to give
> > > > b_more_io inodes a smaller share of writeback because they are being
> > > > moved back to the b_io list less frequently where there are lots of
> > > > other inodes being dirtied. Have you tested the impact of this
> > > > change on mixed workload performance? Indeed, can you starve
> > > > writeback of a large file simply by creating lots of small files in
> > > > another thread?
> > >   Yeah, this change looks suspicious to me as well.
> >
> > The exact behaviors are indeed rather complex. I personally feel the
> > new "always refill iff empty" policy more consistent, clean and easy
> > to understand.
>
> That may be so, but that doesn't make the change good from an IO
> perspective. You said you'd only done light testing, and that's not
> sufficient to guage the impact of such a change.
>
> > It basically says: at each round started by a b_io refill, setup a
> > _fixed_ work set with all current expired (or all currently dirtied
> > inodes if non is expired) and walk through it. "Fixed" work set means
> > no new inodes will be added to the work set during the walk.  When a
> > complete walk is done, start over with a new set of inodes that are
> > eligible at the time.
>
> Yes, I know what it does - I can read the code. You haven't however,
> answered why it is a good change from an IO persepctive, however.
>
> > The figure in page 14 illustrates the "rounds" idea:
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/linux-writeback-queues.pdf
> >
> > This procedure provides fairness among the inodes and guarantees each
> > inode to be synced once and only once at each round. So it's free from
> > starvations.
>
> Perhaps you should add some of this commentary to the commit
> message? That talks about the VM and LRU writeback, but that has
> nothing to do with writeback fairness. The commit message or
> comments in the code need to explain why something is being
> changed....

OK, added to changelog.

> >
> > If you are worried about performance, here is a simple tar+dd benchmark.
> > Both commands are actually running faster with this patchset:
> .....
> > The base kernel is 2.6.39-rc3+ plus IO-less patchset plus large write
> > chunk size. The test box has 3G mem and runs XFS. Test script is:
>
> <sigh>
>
> The numbers are meaningless to me - you've got a large number of
> other changes that are affecting writeback behaviour, and that's
> especially important because, at minimum, the change in write chunk
> size will hide any differences in IO patterns that this change will

The previous benchmarks are sure valuable and more future proof,
assuming that we are going to do IO-less and larger writeback soon.

> make. Please test against a vanilla kernel if that is what you are
> aiming these patches for. If you aren't aiming for a vanilla kernel,
> please say so in the patch series header...

Here are the test results for vanilla kernel. It's again shows better
numbers for dd, tar and overall run time.

             2.6.39-rc3   2.6.39-rc3-dyn-expire+
------------------------------------------------
all elapsed     256.043      252.367
stddev           24.381       12.530

tar elapsed      30.097       28.808
dd  elapsed      13.214       11.782

wfg /tmp% g cpu log-no-moving-expire-vanilla log-moving-expire-vanilla|g tar
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.59s user 4.00s system 47% cpu 35.221 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.62s user 4.19s system 51% cpu 32.358 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.43s user 4.11s system 51% cpu 32.356 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.28s user 4.09s system 60% cpu 26.914 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.25s user 4.12s system 59% cpu 27.345 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.55s user 4.21s system 63% cpu 26.347 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.39s user 3.97s system 44% cpu 36.360 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.44s user 3.88s system 58% cpu 28.046 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.40s user 4.09s system 56% cpu 29.000 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.50s user 3.95s system 60% cpu 27.020 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.44s user 4.03s system 56% cpu 28.939 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.63s user 4.06s system 56% cpu 29.488 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.43s user 3.95s system 51% cpu 31.666 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.46s user 3.99s system 63% cpu 25.768 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.14s user 4.26s system 54% cpu 29.838 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.43s user 4.09s system 63% cpu 25.855 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.61s user 4.36s system 57% cpu 29.588 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.36s user 4.13s system 63% cpu 25.816 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.49s user 3.94s system 55% cpu 29.499 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.53s user 3.92s system 51% cpu 31.625 total
wfg /tmp% g cpu log-no-moving-expire-vanilla log-moving-expire-vanilla|g dd
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.34s system 9% cpu 14.084 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.27s system 8% cpu 14.240 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 9% cpu 13.437 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.21s system 9% cpu 12.783 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.23s system 9% cpu 12.614 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 9% cpu 12.733 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 10% cpu 12.438 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.21s system 9% cpu 12.356 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.21s system 8% cpu 14.724 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 9% cpu 12.734 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.57s system 13% cpu 12.002 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.30s system 9% cpu 14.049 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.36s system 11% cpu 12.031 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 10% cpu 11.679 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 11% cpu 11.276 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 10% cpu 11.501 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.20s system 10% cpu 11.344 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.24s system 10% cpu 11.345 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.27s system 11% cpu 11.280 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.22s system 10% cpu 11.312 total
wfg /tmp% g elapsed log-no-moving-expire-vanilla log-moving-expire-vanilla
log-no-moving-expire-vanilla:elapsed: 317.59000000000196
log-no-moving-expire-vanilla:elapsed: 269.16999999999825
log-no-moving-expire-vanilla:elapsed: 271.61000000000058
log-no-moving-expire-vanilla:elapsed: 233.08000000000175
log-no-moving-expire-vanilla:elapsed: 238.20000000000073
log-no-moving-expire-vanilla:elapsed: 240.68999999999505
log-no-moving-expire-vanilla:elapsed: 257.43000000000029
log-no-moving-expire-vanilla:elapsed: 249.45000000000437
log-no-moving-expire-vanilla:elapsed: 251.55000000000291
log-no-moving-expire-vanilla:elapsed: 231.65999999999622
log-moving-expire-vanilla:elapsed: 270.54999999999927
log-moving-expire-vanilla:elapsed: 254.34000000000015
log-moving-expire-vanilla:elapsed: 248.61000000000058
log-moving-expire-vanilla:elapsed: 238.18000000000029
log-moving-expire-vanilla:elapsed: 263.5
log-moving-expire-vanilla:elapsed: 234.15999999999985
log-moving-expire-vanilla:elapsed: 266.81000000000131
log-moving-expire-vanilla:elapsed: 238.14999999999782
log-moving-expire-vanilla:elapsed: 263.14999999999782
log-moving-expire-vanilla:elapsed: 246.22000000000116

> Anyway, I'm going to put some numbers into a hypothetical steady
> state situation to demonstrate the differences in algorithms.
> Let's say we have lots of inodes with 100 dirty pages being created,
> and one large writeback going on. We expire 8 new inodes for every
> 1024 pages we write back.
>
> With the old code, we do:
>
> 	b_more_io (large inode) -> b_io (1l)
> 	8 newly expired inodes -> b_io (1l, 8s)
>
> 	writeback  large inode 1024 pages -> b_more_io
>
> 	b_more_io (large inode) -> b_io (8s, 1l)
> 	8 newly expired inodes -> b_io (8s, 1l, 8s)
>
> 	writeback  8 small inodes 800 pages
> 		   1 large inode 224 pages -> b_more_io
>
> 	b_more_io (large inode) -> b_io (8s, 1l)
> 	8 newly expired inodes -> b_io (8s, 1l, 8s)
> 	.....
>
> Your new code:
>
> 	b_more_io (large inode) -> b_io (1l)
> 	8 newly expired inodes -> b_io (1l, 8s)
>
> 	writeback  large inode 1024 pages -> b_more_io
> 	(b_io == 8s)
> 	writeback  8 small inodes 800 pages
>
> 	b_io empty: (1800 pages written)
> 		b_more_io (large inode) -> b_io (1l)
> 		14 newly expired inodes -> b_io (1l, 14s)
>
> 	writeback  large inode 1024 pages -> b_more_io
> 	(b_io == 14s)
> 	writeback  10 small inodes 1000 pages
> 		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
> 	writeback  5 small inodes 500 pages
> 	b_io empty: (2548 pages written)
> 		b_more_io (large inode) -> b_io (1l, 1s(24))
> 		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
> 	......
>
> Rough progression of pages written at b_io refill:
>
> Old code:
>
> 	total	large file	% of writeback
> 	1024	224		21.9% (fixed)
>
> New code:
> 	total	large file	% of writeback
> 	1800	1024		~55%
> 	2550	1024		~40%
> 	3050	1024		~33%
> 	3500	1024		~29%
> 	3950	1024		~26%
> 	4250	1024		~24%
> 	4500	1024		~22.7%
> 	4700	1024		~21.7%
> 	4800	1024		~21.3%
> 	4800	1024		~21.3%
> 	(pretty much steady state from here)
>
> Ok, so the steady state is reached with a similar percentage of
> writeback to the large file as the existing code. Ok, that's good,
> but providing some evidence that is doesn't change the shared of
> writeback to the large should be in the commit message ;)
>
> The other advantage to this is that we always write 1024 page chunks
> to the large file, rather than smaller "whatever remains" chunks. I
> think this will have a bigger effect on a vanilla kernel than on the
> kernel you tested on above because of the smaller writeback chunk
> size.

Good analyze! I've included them to the changelog :)

> I'm convinced that the refilling only when the queue is empty is a
> sane change now. you need to separate this from the
> move_expired_inodes() changes because it is doing something very
> different to writeback.

OK. It actually depends on the patch "writeback: try more writeback as long as
something was written". So I'll include it as the last one in next post.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-20  7:38             ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-20  7:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

> > > > > @@ -585,7 +597,8 @@ void writeback_inodes_wb(struct bdi_writ
> > > > >  	if (!wbc->wb_start)
> > > > >  		wbc->wb_start = jiffies; /* livelock avoidance */
> > > > >  	spin_lock(&inode_wb_list_lock);
> > > > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > > > +
> > > > > +	if (list_empty(&wb->b_io))
> > > > >  		queue_io(wb, wbc);
> > > > >
> > > > >  	while (!list_empty(&wb->b_io)) {
> > > > > @@ -612,7 +625,7 @@ static void __writeback_inodes_sb(struct
> > > > >  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> > > > >
> > > > >  	spin_lock(&inode_wb_list_lock);
> > > > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > > > +	if (list_empty(&wb->b_io))
> > > > >  		queue_io(wb, wbc);
> > > > >  	writeback_sb_inodes(sb, wb, wbc, true);
> > > > >  	spin_unlock(&inode_wb_list_lock);
> > > >
> > > > That changes the order in which we queue inodes for writeback.
> > > > Instead of calling every time to move b_more_io inodes onto the b_io
> > > > list and expiring more aged inodes, we only ever do it when the list
> > > > is empty. That is, it seems to me that this will tend to give
> > > > b_more_io inodes a smaller share of writeback because they are being
> > > > moved back to the b_io list less frequently where there are lots of
> > > > other inodes being dirtied. Have you tested the impact of this
> > > > change on mixed workload performance? Indeed, can you starve
> > > > writeback of a large file simply by creating lots of small files in
> > > > another thread?
> > >   Yeah, this change looks suspicious to me as well.
> >
> > The exact behaviors are indeed rather complex. I personally feel the
> > new "always refill iff empty" policy more consistent, clean and easy
> > to understand.
>
> That may be so, but that doesn't make the change good from an IO
> perspective. You said you'd only done light testing, and that's not
> sufficient to guage the impact of such a change.
>
> > It basically says: at each round started by a b_io refill, setup a
> > _fixed_ work set with all current expired (or all currently dirtied
> > inodes if non is expired) and walk through it. "Fixed" work set means
> > no new inodes will be added to the work set during the walk.  When a
> > complete walk is done, start over with a new set of inodes that are
> > eligible at the time.
>
> Yes, I know what it does - I can read the code. You haven't however,
> answered why it is a good change from an IO persepctive, however.
>
> > The figure in page 14 illustrates the "rounds" idea:
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/linux-writeback-queues.pdf
> >
> > This procedure provides fairness among the inodes and guarantees each
> > inode to be synced once and only once at each round. So it's free from
> > starvations.
>
> Perhaps you should add some of this commentary to the commit
> message? That talks about the VM and LRU writeback, but that has
> nothing to do with writeback fairness. The commit message or
> comments in the code need to explain why something is being
> changed....

OK, added to changelog.

> >
> > If you are worried about performance, here is a simple tar+dd benchmark.
> > Both commands are actually running faster with this patchset:
> .....
> > The base kernel is 2.6.39-rc3+ plus IO-less patchset plus large write
> > chunk size. The test box has 3G mem and runs XFS. Test script is:
>
> <sigh>
>
> The numbers are meaningless to me - you've got a large number of
> other changes that are affecting writeback behaviour, and that's
> especially important because, at minimum, the change in write chunk
> size will hide any differences in IO patterns that this change will

The previous benchmarks are sure valuable and more future proof,
assuming that we are going to do IO-less and larger writeback soon.

> make. Please test against a vanilla kernel if that is what you are
> aiming these patches for. If you aren't aiming for a vanilla kernel,
> please say so in the patch series header...

Here are the test results for vanilla kernel. It's again shows better
numbers for dd, tar and overall run time.

             2.6.39-rc3   2.6.39-rc3-dyn-expire+
------------------------------------------------
all elapsed     256.043      252.367
stddev           24.381       12.530

tar elapsed      30.097       28.808
dd  elapsed      13.214       11.782

wfg /tmp% g cpu log-no-moving-expire-vanilla log-moving-expire-vanilla|g tar
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.59s user 4.00s system 47% cpu 35.221 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.62s user 4.19s system 51% cpu 32.358 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.43s user 4.11s system 51% cpu 32.356 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.28s user 4.09s system 60% cpu 26.914 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.25s user 4.12s system 59% cpu 27.345 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.55s user 4.21s system 63% cpu 26.347 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.39s user 3.97s system 44% cpu 36.360 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.44s user 3.88s system 58% cpu 28.046 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.40s user 4.09s system 56% cpu 29.000 total
log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.50s user 3.95s system 60% cpu 27.020 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.44s user 4.03s system 56% cpu 28.939 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.63s user 4.06s system 56% cpu 29.488 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.43s user 3.95s system 51% cpu 31.666 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.46s user 3.99s system 63% cpu 25.768 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.14s user 4.26s system 54% cpu 29.838 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.43s user 4.09s system 63% cpu 25.855 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.61s user 4.36s system 57% cpu 29.588 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.36s user 4.13s system 63% cpu 25.816 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.49s user 3.94s system 55% cpu 29.499 total
log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2  12.53s user 3.92s system 51% cpu 31.625 total
wfg /tmp% g cpu log-no-moving-expire-vanilla log-moving-expire-vanilla|g dd
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.34s system 9% cpu 14.084 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.27s system 8% cpu 14.240 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 9% cpu 13.437 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.21s system 9% cpu 12.783 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.23s system 9% cpu 12.614 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 9% cpu 12.733 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 10% cpu 12.438 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.21s system 9% cpu 12.356 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.21s system 8% cpu 14.724 total
log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 9% cpu 12.734 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.57s system 13% cpu 12.002 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.30s system 9% cpu 14.049 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.36s system 11% cpu 12.031 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 10% cpu 11.679 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.26s system 11% cpu 11.276 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.25s system 10% cpu 11.501 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.20s system 10% cpu 11.344 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.24s system 10% cpu 11.345 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.27s system 11% cpu 11.280 total
log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000  0.00s user 1.22s system 10% cpu 11.312 total
wfg /tmp% g elapsed log-no-moving-expire-vanilla log-moving-expire-vanilla
log-no-moving-expire-vanilla:elapsed: 317.59000000000196
log-no-moving-expire-vanilla:elapsed: 269.16999999999825
log-no-moving-expire-vanilla:elapsed: 271.61000000000058
log-no-moving-expire-vanilla:elapsed: 233.08000000000175
log-no-moving-expire-vanilla:elapsed: 238.20000000000073
log-no-moving-expire-vanilla:elapsed: 240.68999999999505
log-no-moving-expire-vanilla:elapsed: 257.43000000000029
log-no-moving-expire-vanilla:elapsed: 249.45000000000437
log-no-moving-expire-vanilla:elapsed: 251.55000000000291
log-no-moving-expire-vanilla:elapsed: 231.65999999999622
log-moving-expire-vanilla:elapsed: 270.54999999999927
log-moving-expire-vanilla:elapsed: 254.34000000000015
log-moving-expire-vanilla:elapsed: 248.61000000000058
log-moving-expire-vanilla:elapsed: 238.18000000000029
log-moving-expire-vanilla:elapsed: 263.5
log-moving-expire-vanilla:elapsed: 234.15999999999985
log-moving-expire-vanilla:elapsed: 266.81000000000131
log-moving-expire-vanilla:elapsed: 238.14999999999782
log-moving-expire-vanilla:elapsed: 263.14999999999782
log-moving-expire-vanilla:elapsed: 246.22000000000116

> Anyway, I'm going to put some numbers into a hypothetical steady
> state situation to demonstrate the differences in algorithms.
> Let's say we have lots of inodes with 100 dirty pages being created,
> and one large writeback going on. We expire 8 new inodes for every
> 1024 pages we write back.
>
> With the old code, we do:
>
> 	b_more_io (large inode) -> b_io (1l)
> 	8 newly expired inodes -> b_io (1l, 8s)
>
> 	writeback  large inode 1024 pages -> b_more_io
>
> 	b_more_io (large inode) -> b_io (8s, 1l)
> 	8 newly expired inodes -> b_io (8s, 1l, 8s)
>
> 	writeback  8 small inodes 800 pages
> 		   1 large inode 224 pages -> b_more_io
>
> 	b_more_io (large inode) -> b_io (8s, 1l)
> 	8 newly expired inodes -> b_io (8s, 1l, 8s)
> 	.....
>
> Your new code:
>
> 	b_more_io (large inode) -> b_io (1l)
> 	8 newly expired inodes -> b_io (1l, 8s)
>
> 	writeback  large inode 1024 pages -> b_more_io
> 	(b_io == 8s)
> 	writeback  8 small inodes 800 pages
>
> 	b_io empty: (1800 pages written)
> 		b_more_io (large inode) -> b_io (1l)
> 		14 newly expired inodes -> b_io (1l, 14s)
>
> 	writeback  large inode 1024 pages -> b_more_io
> 	(b_io == 14s)
> 	writeback  10 small inodes 1000 pages
> 		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
> 	writeback  5 small inodes 500 pages
> 	b_io empty: (2548 pages written)
> 		b_more_io (large inode) -> b_io (1l, 1s(24))
> 		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
> 	......
>
> Rough progression of pages written at b_io refill:
>
> Old code:
>
> 	total	large file	% of writeback
> 	1024	224		21.9% (fixed)
>
> New code:
> 	total	large file	% of writeback
> 	1800	1024		~55%
> 	2550	1024		~40%
> 	3050	1024		~33%
> 	3500	1024		~29%
> 	3950	1024		~26%
> 	4250	1024		~24%
> 	4500	1024		~22.7%
> 	4700	1024		~21.7%
> 	4800	1024		~21.3%
> 	4800	1024		~21.3%
> 	(pretty much steady state from here)
>
> Ok, so the steady state is reached with a similar percentage of
> writeback to the large file as the existing code. Ok, that's good,
> but providing some evidence that is doesn't change the shared of
> writeback to the large should be in the commit message ;)
>
> The other advantage to this is that we always write 1024 page chunks
> to the large file, rather than smaller "whatever remains" chunks. I
> think this will have a bigger effect on a vanilla kernel than on the
> kernel you tested on above because of the smaller writeback chunk
> size.

Good analyze! I've included them to the changelog :)

> I'm convinced that the refilling only when the queue is empty is a
> sane change now. you need to separate this from the
> move_expired_inodes() changes because it is doing something very
> different to writeback.

OK. It actually depends on the patch "writeback: try more writeback as long as
something was written". So I'll include it as the last one in next post.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-19 21:10         ` Jan Kara
@ 2011-04-20  7:50           ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-20  7:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Mel Gorman, Dave Chinner, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Wed, Apr 20, 2011 at 05:10:08AM +0800, Jan Kara wrote:
> On Tue 19-04-11 19:16:01, Wu Fengguang wrote:
> > On Tue, Apr 19, 2011 at 06:20:16PM +0800, Jan Kara wrote:
> > > On Tue 19-04-11 11:00:08, Wu Fengguang wrote:
> > > > writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> > > > they only populate possibly a subset of elegible inodes into b_io at
> > > > entrance time. When the queued set of inodes are all synced, they just
> > > > return, possibly with all queued inode pages written but still
> > > > wbc.nr_to_write > 0.
> > > > 
> > > > For kupdate and background writeback, there may be more eligible inodes
> > > > sitting in b_dirty when the current set of b_io inodes are completed. So
> > > > it is necessary to try another round of writeback as long as we made some
> > > > progress in this round. When there are no more eligible inodes, no more
> > > > inodes will be enqueued in queue_io(), hence nothing could/will be
> > > > synced and we may safely bail.
> > >   Let me understand your concern here: You are afraid that if we do
> > > for_background or for_kupdate writeback and we write less than
> > > MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
> > > inodes to write at the time we are stopping writeback - the two realistic
> > 
> > Yes.
> > 
> > > cases I can think of are:
> > > a) when inodes just freshly expired during writeback
> > > b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
> > >   background threshold due to data on some other bdi. And then while we are
> > >   doing writeback someone does dirtying at our bdi.
> > > Or do you see some other case as well?
> > > 
> > > The a) case does not seem like a big issue to me after your changes to
> > 
> > Yeah (a) is not an issue with kupdate writeback.
> > 
> > > move_expired_inodes(). The b) case maybe but do you think it will make any
> > > difference? 
> > 
> > (b) seems also weird. What in my mind is this for_background case.
> > Imagine 100 inodes
> > 
> >         i0, i1, i2, ..., i90, i91, i99
> > 
> > At queue_io() time, i90-i99 happen to be expired and moved to s_io for
> > IO. When finished successfully, if their total size is less than
> > MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
> > quit the background work (w/o this patch) while it's still over
> > background threshold.
> > 
> > This will be a fairly normal/frequent case I guess.
>   Ah OK, I see. I missed this case your patch set has added. Also your
> changes of
>         if (!wbc->for_kupdate || list_empty(&wb->b_io))
> to
> 	if (list_empty(&wb->b_io))
> are going to cause more cases when we'd hit nr_to_write > 0 (e.g. when one
> pass of b_io does not write all the inodes so some are left in b_io list
> and then next call to writeback finds these inodes there but there's less
> than MAX_WRITEBACK_PAGES in them).

Yes. It's exactly the more aggressive retry logic in wb_writeback()
that allows me to comfortably kill that !wbc->for_kupdate test :)

> Frankly, it makes me like the above change even less. I'd rather see
> writeback_inodes_wb / __writeback_inodes_sb always work on a fresh
> set of inodes which is initialized whenever we enter these
> functions. It just seems less surprising to me...

The old aggressive enqueue policy is an ad-hoc workaround to prevent
background work to miss some inodes and quit early. Now that we have
the complete solution, why not killing it for more consistent code and
behavior? And get better performance numbers :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-20  7:50           ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-20  7:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Mel Gorman, Dave Chinner, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Wed, Apr 20, 2011 at 05:10:08AM +0800, Jan Kara wrote:
> On Tue 19-04-11 19:16:01, Wu Fengguang wrote:
> > On Tue, Apr 19, 2011 at 06:20:16PM +0800, Jan Kara wrote:
> > > On Tue 19-04-11 11:00:08, Wu Fengguang wrote:
> > > > writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> > > > they only populate possibly a subset of elegible inodes into b_io at
> > > > entrance time. When the queued set of inodes are all synced, they just
> > > > return, possibly with all queued inode pages written but still
> > > > wbc.nr_to_write > 0.
> > > > 
> > > > For kupdate and background writeback, there may be more eligible inodes
> > > > sitting in b_dirty when the current set of b_io inodes are completed. So
> > > > it is necessary to try another round of writeback as long as we made some
> > > > progress in this round. When there are no more eligible inodes, no more
> > > > inodes will be enqueued in queue_io(), hence nothing could/will be
> > > > synced and we may safely bail.
> > >   Let me understand your concern here: You are afraid that if we do
> > > for_background or for_kupdate writeback and we write less than
> > > MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
> > > inodes to write at the time we are stopping writeback - the two realistic
> > 
> > Yes.
> > 
> > > cases I can think of are:
> > > a) when inodes just freshly expired during writeback
> > > b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
> > >   background threshold due to data on some other bdi. And then while we are
> > >   doing writeback someone does dirtying at our bdi.
> > > Or do you see some other case as well?
> > > 
> > > The a) case does not seem like a big issue to me after your changes to
> > 
> > Yeah (a) is not an issue with kupdate writeback.
> > 
> > > move_expired_inodes(). The b) case maybe but do you think it will make any
> > > difference? 
> > 
> > (b) seems also weird. What in my mind is this for_background case.
> > Imagine 100 inodes
> > 
> >         i0, i1, i2, ..., i90, i91, i99
> > 
> > At queue_io() time, i90-i99 happen to be expired and moved to s_io for
> > IO. When finished successfully, if their total size is less than
> > MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
> > quit the background work (w/o this patch) while it's still over
> > background threshold.
> > 
> > This will be a fairly normal/frequent case I guess.
>   Ah OK, I see. I missed this case your patch set has added. Also your
> changes of
>         if (!wbc->for_kupdate || list_empty(&wb->b_io))
> to
> 	if (list_empty(&wb->b_io))
> are going to cause more cases when we'd hit nr_to_write > 0 (e.g. when one
> pass of b_io does not write all the inodes so some are left in b_io list
> and then next call to writeback finds these inodes there but there's less
> than MAX_WRITEBACK_PAGES in them).

Yes. It's exactly the more aggressive retry logic in wb_writeback()
that allows me to comfortably kill that !wbc->for_kupdate test :)

> Frankly, it makes me like the above change even less. I'd rather see
> writeback_inodes_wb / __writeback_inodes_sb always work on a fresh
> set of inodes which is initialized whenever we enter these
> functions. It just seems less surprising to me...

The old aggressive enqueue policy is an ad-hoc workaround to prevent
background work to miss some inodes and quit early. Now that we have
the complete solution, why not killing it for more consistent code and
behavior? And get better performance numbers :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-20  7:50           ` Wu Fengguang
@ 2011-04-20 15:22             ` Jan Kara
  -1 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-20 15:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Wed 20-04-11 15:50:53, Wu Fengguang wrote:
> > > >   Let me understand your concern here: You are afraid that if we do
> > > > for_background or for_kupdate writeback and we write less than
> > > > MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
> > > > inodes to write at the time we are stopping writeback - the two realistic
> > > 
> > > Yes.
> > > 
> > > > cases I can think of are:
> > > > a) when inodes just freshly expired during writeback
> > > > b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
> > > >   background threshold due to data on some other bdi. And then while we are
> > > >   doing writeback someone does dirtying at our bdi.
> > > > Or do you see some other case as well?
> > > > 
> > > > The a) case does not seem like a big issue to me after your changes to
> > > 
> > > Yeah (a) is not an issue with kupdate writeback.
> > > 
> > > > move_expired_inodes(). The b) case maybe but do you think it will make any
> > > > difference? 
> > > 
> > > (b) seems also weird. What in my mind is this for_background case.
> > > Imagine 100 inodes
> > > 
> > >         i0, i1, i2, ..., i90, i91, i99
> > > 
> > > At queue_io() time, i90-i99 happen to be expired and moved to s_io for
> > > IO. When finished successfully, if their total size is less than
> > > MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
> > > quit the background work (w/o this patch) while it's still over
> > > background threshold.
> > > 
> > > This will be a fairly normal/frequent case I guess.
> >   Ah OK, I see. I missed this case your patch set has added. Also your
> > changes of
> >         if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > to
> > 	if (list_empty(&wb->b_io))
> > are going to cause more cases when we'd hit nr_to_write > 0 (e.g. when one
> > pass of b_io does not write all the inodes so some are left in b_io list
> > and then next call to writeback finds these inodes there but there's less
> > than MAX_WRITEBACK_PAGES in them).
> 
> Yes. It's exactly the more aggressive retry logic in wb_writeback()
> that allows me to comfortably kill that !wbc->for_kupdate test :)
> 
> > Frankly, it makes me like the above change even less. I'd rather see
> > writeback_inodes_wb / __writeback_inodes_sb always work on a fresh
> > set of inodes which is initialized whenever we enter these
> > functions. It just seems less surprising to me...
> 
> The old aggressive enqueue policy is an ad-hoc workaround to prevent
> background work to miss some inodes and quit early. Now that we have
> the complete solution, why not killing it for more consistent code and
> behavior? And get better performance numbers :)
  BTW, have you understood why do you get better numbers? What are we doing
better with this changed logic?

I've though about it and also about Dave's analysis. Now I think it's OK to
not add new inodes to b_io when it's not empty. But what I still don't like
is that the emptiness / non-emptiness of b_io carries hidden internal
state - callers of writeback_inodes_wb() shouldn't have to know or care
about such subtleties (__writeback_inodes_sb() is an internal function so I
don't care about that one too much).

So I'd prefer writeback_inodes_wb() (and also __writeback_inodes_sb() but
that's not too important) to do something like:
	int requeued = 0;
requeue:
	if (list_empty(&wb->b_io)) {
		queue_io(wb, wbc->older_than_this);
		requeued = 1;
	}
	while (!list_empty(&wb->b_io)) {
		... do stuff ...
	}
	if (wbc->nr_to_write > 0 && !requeued)
		goto requeue;

Because if you don't do this, you have to do similar change to all the
callers of writeback_inodes_wb() (Ok, there are just three but still).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-20 15:22             ` Jan Kara
  0 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-20 15:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Wed 20-04-11 15:50:53, Wu Fengguang wrote:
> > > >   Let me understand your concern here: You are afraid that if we do
> > > > for_background or for_kupdate writeback and we write less than
> > > > MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
> > > > inodes to write at the time we are stopping writeback - the two realistic
> > > 
> > > Yes.
> > > 
> > > > cases I can think of are:
> > > > a) when inodes just freshly expired during writeback
> > > > b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
> > > >   background threshold due to data on some other bdi. And then while we are
> > > >   doing writeback someone does dirtying at our bdi.
> > > > Or do you see some other case as well?
> > > > 
> > > > The a) case does not seem like a big issue to me after your changes to
> > > 
> > > Yeah (a) is not an issue with kupdate writeback.
> > > 
> > > > move_expired_inodes(). The b) case maybe but do you think it will make any
> > > > difference? 
> > > 
> > > (b) seems also weird. What in my mind is this for_background case.
> > > Imagine 100 inodes
> > > 
> > >         i0, i1, i2, ..., i90, i91, i99
> > > 
> > > At queue_io() time, i90-i99 happen to be expired and moved to s_io for
> > > IO. When finished successfully, if their total size is less than
> > > MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
> > > quit the background work (w/o this patch) while it's still over
> > > background threshold.
> > > 
> > > This will be a fairly normal/frequent case I guess.
> >   Ah OK, I see. I missed this case your patch set has added. Also your
> > changes of
> >         if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > to
> > 	if (list_empty(&wb->b_io))
> > are going to cause more cases when we'd hit nr_to_write > 0 (e.g. when one
> > pass of b_io does not write all the inodes so some are left in b_io list
> > and then next call to writeback finds these inodes there but there's less
> > than MAX_WRITEBACK_PAGES in them).
> 
> Yes. It's exactly the more aggressive retry logic in wb_writeback()
> that allows me to comfortably kill that !wbc->for_kupdate test :)
> 
> > Frankly, it makes me like the above change even less. I'd rather see
> > writeback_inodes_wb / __writeback_inodes_sb always work on a fresh
> > set of inodes which is initialized whenever we enter these
> > functions. It just seems less surprising to me...
> 
> The old aggressive enqueue policy is an ad-hoc workaround to prevent
> background work to miss some inodes and quit early. Now that we have
> the complete solution, why not killing it for more consistent code and
> behavior? And get better performance numbers :)
  BTW, have you understood why do you get better numbers? What are we doing
better with this changed logic?

I've though about it and also about Dave's analysis. Now I think it's OK to
not add new inodes to b_io when it's not empty. But what I still don't like
is that the emptiness / non-emptiness of b_io carries hidden internal
state - callers of writeback_inodes_wb() shouldn't have to know or care
about such subtleties (__writeback_inodes_sb() is an internal function so I
don't care about that one too much).

So I'd prefer writeback_inodes_wb() (and also __writeback_inodes_sb() but
that's not too important) to do something like:
	int requeued = 0;
requeue:
	if (list_empty(&wb->b_io)) {
		queue_io(wb, wbc->older_than_this);
		requeued = 1;
	}
	while (!list_empty(&wb->b_io)) {
		... do stuff ...
	}
	if (wbc->nr_to_write > 0 && !requeued)
		goto requeue;

Because if you don't do this, you have to do similar change to all the
callers of writeback_inodes_wb() (Ok, there are just three but still).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-20  2:53             ` Wu Fengguang
@ 2011-04-21  0:45               ` Dave Chinner
  -1 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-21  0:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Wed, Apr 20, 2011 at 10:53:21AM +0800, Wu Fengguang wrote:
> On Wed, Apr 20, 2011 at 09:21:20AM +0800, Dave Chinner wrote:
> > On Tue, Apr 19, 2011 at 08:56:16PM +0800, Wu Fengguang wrote:
> > > I actually started with wb_writeback() as a natural choice, and then
> > > found it much easier to do the expired-only=>all-inodes switching in
> > > move_expired_inodes() since it needs to know the @b_dirty and @tmp
> > > lists' emptiness to trigger the switch. It's not sane for
> > > wb_writeback() to look into such details. And once you do the switch
> > > part in move_expired_inodes(), the whole policy naturally follows.
> > 
> > Well, not really. You didn't need to modify move_expired_inodes() at
> > all to implement these changes - all you needed to do was modify how
> > older_than_this is configured.
> > 
> > writeback policy is defined by the struct writeback_control.
> > move_expired_inodes() is pure mechanism. What you've done is remove
> > policy from the struct wbc and moved it to move_expired_inodes(),
> > which now defines both policy and mechanism.
> 
> > Furhter, this means that all the tracing that uses the struct wbc no
> > no longer shows the entire writeback policy that is being worked on,
> > so we lose visibility into policy decisions that writeback is
> > making.
> 
> Good point! I'm convinced, visibility is a necessity for debugging the
> complex writeback behaviors.
> 
> > This same change is as simple as updating wbc->older_than_this
> > appropriately after the wb_writeback() call for both background and
> > kupdate and leaving the lower layers untouched. It's just a policy
> > change. If you thinkthe mechanism is inefficient, copy
> > wbc->older_than_this to a local variable inside
> > move_expired_inodes()....
> 
> Do you like something like this? (details will change a bit when
> rearranging the patchset)

Yeah, this is close to what I had in mind.

> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-20 10:30:47.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-20 10:40:19.000000000 +0800
> @@ -660,11 +660,6 @@ static long wb_writeback(struct bdi_writ
>  	long write_chunk;
>  	struct inode *inode;
>  
> -	if (wbc.for_kupdate) {
> -		wbc.older_than_this = &oldest_jif;
> -		oldest_jif = jiffies -
> -				msecs_to_jiffies(dirty_expire_interval * 10);
> -	}

Right here I'd do:

	if (work->for_kupdate || work->for_background)
		wbc.older_than_this = &oldest_jif;

so that the setting of wbc.older_than_this in the loop can trigger
on whether it is null or not.

>  	if (!wbc.range_cyclic) {
>  		wbc.range_start = 0;
>  		wbc.range_end = LLONG_MAX;
> @@ -713,10 +708,17 @@ static long wb_writeback(struct bdi_writ
>  		if (work->for_background && !over_bground_thresh())
>  			break;
>  
> +		if (work->for_kupdate || work->for_background) {
> +			oldest_jif = jiffies -
> +				msecs_to_jiffies(dirty_expire_interval * 10);
> +			wbc.older_than_this = &oldest_jif;
> +		}
> +

if you change that to:

		if (wbc.older_than_this) {
			*wbc.older_than_this = jiffies -
				msecs_to_jiffies(dirty_expire_interval * 10);
		}

>  		wbc.more_io = 0;
>  		wbc.nr_to_write = write_chunk;
>  		wbc.pages_skipped = 0;
>  
> +retry_all:

You can get rid of this retry_all label and have the changeover in
behaviour re-initialise nr_to_write, etc.

>  		trace_wbc_writeback_start(&wbc, wb->bdi);
>  		if (work->sb)
>  			__writeback_inodes_sb(work->sb, wb, &wbc);
> @@ -733,6 +735,17 @@ static long wb_writeback(struct bdi_writ
>  		if (wbc.nr_to_write <= 0)
>  			continue;
>  		/*
> +		 * No expired inode? Try all fresh ones
> +		 */
> +		if ((work->for_kupdate || work->for_background) &&
> +		    wbc.older_than_this &&
> +		    wbc.nr_to_write == write_chunk &&
> +		    list_empty(&wb->b_io) &&
> +		    list_empty(&wb->b_more_io)) {
> +			wbc.older_than_this = NULL;
> +			goto retry_all;
> +		}

And here only do this for work->for_background as kupdate writeback
stops when we run out of expired inodes (i.e. it doesn't writeback
non-expired inodes).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-21  0:45               ` Dave Chinner
  0 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-21  0:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Wed, Apr 20, 2011 at 10:53:21AM +0800, Wu Fengguang wrote:
> On Wed, Apr 20, 2011 at 09:21:20AM +0800, Dave Chinner wrote:
> > On Tue, Apr 19, 2011 at 08:56:16PM +0800, Wu Fengguang wrote:
> > > I actually started with wb_writeback() as a natural choice, and then
> > > found it much easier to do the expired-only=>all-inodes switching in
> > > move_expired_inodes() since it needs to know the @b_dirty and @tmp
> > > lists' emptiness to trigger the switch. It's not sane for
> > > wb_writeback() to look into such details. And once you do the switch
> > > part in move_expired_inodes(), the whole policy naturally follows.
> > 
> > Well, not really. You didn't need to modify move_expired_inodes() at
> > all to implement these changes - all you needed to do was modify how
> > older_than_this is configured.
> > 
> > writeback policy is defined by the struct writeback_control.
> > move_expired_inodes() is pure mechanism. What you've done is remove
> > policy from the struct wbc and moved it to move_expired_inodes(),
> > which now defines both policy and mechanism.
> 
> > Furhter, this means that all the tracing that uses the struct wbc no
> > no longer shows the entire writeback policy that is being worked on,
> > so we lose visibility into policy decisions that writeback is
> > making.
> 
> Good point! I'm convinced, visibility is a necessity for debugging the
> complex writeback behaviors.
> 
> > This same change is as simple as updating wbc->older_than_this
> > appropriately after the wb_writeback() call for both background and
> > kupdate and leaving the lower layers untouched. It's just a policy
> > change. If you thinkthe mechanism is inefficient, copy
> > wbc->older_than_this to a local variable inside
> > move_expired_inodes()....
> 
> Do you like something like this? (details will change a bit when
> rearranging the patchset)

Yeah, this is close to what I had in mind.

> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-20 10:30:47.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-20 10:40:19.000000000 +0800
> @@ -660,11 +660,6 @@ static long wb_writeback(struct bdi_writ
>  	long write_chunk;
>  	struct inode *inode;
>  
> -	if (wbc.for_kupdate) {
> -		wbc.older_than_this = &oldest_jif;
> -		oldest_jif = jiffies -
> -				msecs_to_jiffies(dirty_expire_interval * 10);
> -	}

Right here I'd do:

	if (work->for_kupdate || work->for_background)
		wbc.older_than_this = &oldest_jif;

so that the setting of wbc.older_than_this in the loop can trigger
on whether it is null or not.

>  	if (!wbc.range_cyclic) {
>  		wbc.range_start = 0;
>  		wbc.range_end = LLONG_MAX;
> @@ -713,10 +708,17 @@ static long wb_writeback(struct bdi_writ
>  		if (work->for_background && !over_bground_thresh())
>  			break;
>  
> +		if (work->for_kupdate || work->for_background) {
> +			oldest_jif = jiffies -
> +				msecs_to_jiffies(dirty_expire_interval * 10);
> +			wbc.older_than_this = &oldest_jif;
> +		}
> +

if you change that to:

		if (wbc.older_than_this) {
			*wbc.older_than_this = jiffies -
				msecs_to_jiffies(dirty_expire_interval * 10);
		}

>  		wbc.more_io = 0;
>  		wbc.nr_to_write = write_chunk;
>  		wbc.pages_skipped = 0;
>  
> +retry_all:

You can get rid of this retry_all label and have the changeover in
behaviour re-initialise nr_to_write, etc.

>  		trace_wbc_writeback_start(&wbc, wb->bdi);
>  		if (work->sb)
>  			__writeback_inodes_sb(work->sb, wb, &wbc);
> @@ -733,6 +735,17 @@ static long wb_writeback(struct bdi_writ
>  		if (wbc.nr_to_write <= 0)
>  			continue;
>  		/*
> +		 * No expired inode? Try all fresh ones
> +		 */
> +		if ((work->for_kupdate || work->for_background) &&
> +		    wbc.older_than_this &&
> +		    wbc.nr_to_write == write_chunk &&
> +		    list_empty(&wb->b_io) &&
> +		    list_empty(&wb->b_more_io)) {
> +			wbc.older_than_this = NULL;
> +			goto retry_all;
> +		}

And here only do this for work->for_background as kupdate writeback
stops when we run out of expired inodes (i.e. it doesn't writeback
non-expired inodes).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-20  7:38             ` Wu Fengguang
@ 2011-04-21  1:01               ` Dave Chinner
  -1 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-21  1:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Wed, Apr 20, 2011 at 03:38:22PM +0800, Wu Fengguang wrote:
> > make. Please test against a vanilla kernel if that is what you are
> > aiming these patches for. If you aren't aiming for a vanilla kernel,
> > please say so in the patch series header...
> 
> Here are the test results for vanilla kernel. It's again shows better
> numbers for dd, tar and overall run time.
> 
>              2.6.39-rc3   2.6.39-rc3-dyn-expire+
> ------------------------------------------------
> all elapsed     256.043      252.367
> stddev           24.381       12.530
> 
> tar elapsed      30.097       28.808
> dd  elapsed      13.214       11.782

The big reduction in run-to-run variance is very convincing - moreso
than the reduction in runtime - That's kind of what I had hoped
would occur once I understood the implications of the change. Thanks
for running the test to close the loop. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-21  1:01               ` Dave Chinner
  0 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-21  1:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Wed, Apr 20, 2011 at 03:38:22PM +0800, Wu Fengguang wrote:
> > make. Please test against a vanilla kernel if that is what you are
> > aiming these patches for. If you aren't aiming for a vanilla kernel,
> > please say so in the patch series header...
> 
> Here are the test results for vanilla kernel. It's again shows better
> numbers for dd, tar and overall run time.
> 
>              2.6.39-rc3   2.6.39-rc3-dyn-expire+
> ------------------------------------------------
> all elapsed     256.043      252.367
> stddev           24.381       12.530
> 
> tar elapsed      30.097       28.808
> dd  elapsed      13.214       11.782

The big reduction in run-to-run variance is very convincing - moreso
than the reduction in runtime - That's kind of what I had hoped
would occur once I understood the implications of the change. Thanks
for running the test to close the loop. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-21  1:01               ` Dave Chinner
@ 2011-04-21  1:47                 ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  1:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Thu, Apr 21, 2011 at 09:01:32AM +0800, Dave Chinner wrote:
> On Wed, Apr 20, 2011 at 03:38:22PM +0800, Wu Fengguang wrote:
> > > make. Please test against a vanilla kernel if that is what you are
> > > aiming these patches for. If you aren't aiming for a vanilla kernel,
> > > please say so in the patch series header...
> > 
> > Here are the test results for vanilla kernel. It's again shows better
> > numbers for dd, tar and overall run time.
> > 
> >              2.6.39-rc3   2.6.39-rc3-dyn-expire+
> > ------------------------------------------------
> > all elapsed     256.043      252.367
> > stddev           24.381       12.530
> > 
> > tar elapsed      30.097       28.808
> > dd  elapsed      13.214       11.782
> 
> The big reduction in run-to-run variance is very convincing - moreso
> than the reduction in runtime - That's kind of what I had hoped
> would occur once I understood the implications of the change. Thanks
> for running the test to close the loop. :)

And you can see how the user perceivable variations in elapsed time
are reduced by the patchsets:

vanilla 
             user       system     %cpu       elapsed
stddev       0.000      0.037      0.539      0.805     dd,  xfs
stddev       0.117      0.102      5.974      3.498     tar, xfs

moving-target
stddev       0.000      0.102      1.025      0.803     dd,  xfs
stddev       0.131      0.136      4.415      2.136     tar, xfs

IO-less + moving-target 
stddev       0.000      0.022      0.000      0.283     dd,  xfs
stddev       0.000      0.031      0.000      0.151     dd,  ext4
stddev       0.111      0.218      2.040      0.532     tar, xfs
stddev       0.129      0.119      1.020      0.215     tar, ext4

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-21  1:47                 ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  1:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Thu, Apr 21, 2011 at 09:01:32AM +0800, Dave Chinner wrote:
> On Wed, Apr 20, 2011 at 03:38:22PM +0800, Wu Fengguang wrote:
> > > make. Please test against a vanilla kernel if that is what you are
> > > aiming these patches for. If you aren't aiming for a vanilla kernel,
> > > please say so in the patch series header...
> > 
> > Here are the test results for vanilla kernel. It's again shows better
> > numbers for dd, tar and overall run time.
> > 
> >              2.6.39-rc3   2.6.39-rc3-dyn-expire+
> > ------------------------------------------------
> > all elapsed     256.043      252.367
> > stddev           24.381       12.530
> > 
> > tar elapsed      30.097       28.808
> > dd  elapsed      13.214       11.782
> 
> The big reduction in run-to-run variance is very convincing - moreso
> than the reduction in runtime - That's kind of what I had hoped
> would occur once I understood the implications of the change. Thanks
> for running the test to close the loop. :)

And you can see how the user perceivable variations in elapsed time
are reduced by the patchsets:

vanilla 
             user       system     %cpu       elapsed
stddev       0.000      0.037      0.539      0.805     dd,  xfs
stddev       0.117      0.102      5.974      3.498     tar, xfs

moving-target
stddev       0.000      0.102      1.025      0.803     dd,  xfs
stddev       0.131      0.136      4.415      2.136     tar, xfs

IO-less + moving-target 
stddev       0.000      0.022      0.000      0.283     dd,  xfs
stddev       0.000      0.031      0.000      0.151     dd,  ext4
stddev       0.111      0.218      2.040      0.532     tar, xfs
stddev       0.129      0.119      1.020      0.215     tar, ext4

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-21  0:45               ` Dave Chinner
@ 2011-04-21  2:06                 ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  2:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Thu, Apr 21, 2011 at 08:45:47AM +0800, Dave Chinner wrote:
> On Wed, Apr 20, 2011 at 10:53:21AM +0800, Wu Fengguang wrote:
> > On Wed, Apr 20, 2011 at 09:21:20AM +0800, Dave Chinner wrote:
> > > On Tue, Apr 19, 2011 at 08:56:16PM +0800, Wu Fengguang wrote:
> > > > I actually started with wb_writeback() as a natural choice, and then
> > > > found it much easier to do the expired-only=>all-inodes switching in
> > > > move_expired_inodes() since it needs to know the @b_dirty and @tmp
> > > > lists' emptiness to trigger the switch. It's not sane for
> > > > wb_writeback() to look into such details. And once you do the switch
> > > > part in move_expired_inodes(), the whole policy naturally follows.
> > > 
> > > Well, not really. You didn't need to modify move_expired_inodes() at
> > > all to implement these changes - all you needed to do was modify how
> > > older_than_this is configured.
> > > 
> > > writeback policy is defined by the struct writeback_control.
> > > move_expired_inodes() is pure mechanism. What you've done is remove
> > > policy from the struct wbc and moved it to move_expired_inodes(),
> > > which now defines both policy and mechanism.
> > 
> > > Furhter, this means that all the tracing that uses the struct wbc no
> > > no longer shows the entire writeback policy that is being worked on,
> > > so we lose visibility into policy decisions that writeback is
> > > making.
> > 
> > Good point! I'm convinced, visibility is a necessity for debugging the
> > complex writeback behaviors.
> > 
> > > This same change is as simple as updating wbc->older_than_this
> > > appropriately after the wb_writeback() call for both background and
> > > kupdate and leaving the lower layers untouched. It's just a policy
> > > change. If you thinkthe mechanism is inefficient, copy
> > > wbc->older_than_this to a local variable inside
> > > move_expired_inodes()....
> > 
> > Do you like something like this? (details will change a bit when
> > rearranging the patchset)
> 
> Yeah, this is close to what I had in mind.
> 
> > 
> > --- linux-next.orig/fs/fs-writeback.c	2011-04-20 10:30:47.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2011-04-20 10:40:19.000000000 +0800
> > @@ -660,11 +660,6 @@ static long wb_writeback(struct bdi_writ
> >  	long write_chunk;
> >  	struct inode *inode;
> >  
> > -	if (wbc.for_kupdate) {
> > -		wbc.older_than_this = &oldest_jif;
> > -		oldest_jif = jiffies -
> > -				msecs_to_jiffies(dirty_expire_interval * 10);
> > -	}
> 
> Right here I'd do:
> 
> 	if (work->for_kupdate || work->for_background)
> 		wbc.older_than_this = &oldest_jif;
> 
> so that the setting of wbc.older_than_this in the loop can trigger
> on whether it is null or not.

That's the tricky part that drove me to change move_expired_inodes()
directly..

One important thing to bear in mind is, the background work can run on
for one hour, one day or whatever. During the time dirty inodes come
and go, expired and cleaned.  If we only reset wbc.older_than_this and
never restore it _inside_ the loop, we'll quickly lose the ability to
"start with expired inodes" shortly after f.g. 5 minutes.

So we need to start with searching for expired inodes at each
queue_io() time.  wbc.older_than_this shall be properly restored to
&oldest_jif inside the loop. Since no expired inodes found in this
loop does not mean no new inodes will be expired in the next loop.

> >  	if (!wbc.range_cyclic) {
> >  		wbc.range_start = 0;
> >  		wbc.range_end = LLONG_MAX;
> > @@ -713,10 +708,17 @@ static long wb_writeback(struct bdi_writ
> >  		if (work->for_background && !over_bground_thresh())
> >  			break;
> >  
> > +		if (work->for_kupdate || work->for_background) {
> > +			oldest_jif = jiffies -
> > +				msecs_to_jiffies(dirty_expire_interval * 10);
> > +			wbc.older_than_this = &oldest_jif;
> > +		}
> > +
> 
> if you change that to:
> 
> 		if (wbc.older_than_this) {
> 			*wbc.older_than_this = jiffies -
> 				msecs_to_jiffies(dirty_expire_interval * 10);
> 		}
> 
> >  		wbc.more_io = 0;
> >  		wbc.nr_to_write = write_chunk;
> >  		wbc.pages_skipped = 0;
> >  
> > +retry_all:
> 
> You can get rid of this retry_all label and have the changeover in
> behaviour re-initialise nr_to_write, etc.
> 
> >  		trace_wbc_writeback_start(&wbc, wb->bdi);
> >  		if (work->sb)
> >  			__writeback_inodes_sb(work->sb, wb, &wbc);
> > @@ -733,6 +735,17 @@ static long wb_writeback(struct bdi_writ
> >  		if (wbc.nr_to_write <= 0)
> >  			continue;
> >  		/*
> > +		 * No expired inode? Try all fresh ones
> > +		 */
> > +		if ((work->for_kupdate || work->for_background) &&
> > +		    wbc.older_than_this &&
> > +		    wbc.nr_to_write == write_chunk &&
> > +		    list_empty(&wb->b_io) &&
> > +		    list_empty(&wb->b_more_io)) {
> > +			wbc.older_than_this = NULL;
> > +			goto retry_all;
> > +		}
> 
> And here only do this for work->for_background as kupdate writeback
> stops when we run out of expired inodes (i.e. it doesn't writeback
> non-expired inodes).

Sorry for the mistake. I've fixed it in the v2 :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-21  2:06                 ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  2:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Thu, Apr 21, 2011 at 08:45:47AM +0800, Dave Chinner wrote:
> On Wed, Apr 20, 2011 at 10:53:21AM +0800, Wu Fengguang wrote:
> > On Wed, Apr 20, 2011 at 09:21:20AM +0800, Dave Chinner wrote:
> > > On Tue, Apr 19, 2011 at 08:56:16PM +0800, Wu Fengguang wrote:
> > > > I actually started with wb_writeback() as a natural choice, and then
> > > > found it much easier to do the expired-only=>all-inodes switching in
> > > > move_expired_inodes() since it needs to know the @b_dirty and @tmp
> > > > lists' emptiness to trigger the switch. It's not sane for
> > > > wb_writeback() to look into such details. And once you do the switch
> > > > part in move_expired_inodes(), the whole policy naturally follows.
> > > 
> > > Well, not really. You didn't need to modify move_expired_inodes() at
> > > all to implement these changes - all you needed to do was modify how
> > > older_than_this is configured.
> > > 
> > > writeback policy is defined by the struct writeback_control.
> > > move_expired_inodes() is pure mechanism. What you've done is remove
> > > policy from the struct wbc and moved it to move_expired_inodes(),
> > > which now defines both policy and mechanism.
> > 
> > > Furhter, this means that all the tracing that uses the struct wbc no
> > > no longer shows the entire writeback policy that is being worked on,
> > > so we lose visibility into policy decisions that writeback is
> > > making.
> > 
> > Good point! I'm convinced, visibility is a necessity for debugging the
> > complex writeback behaviors.
> > 
> > > This same change is as simple as updating wbc->older_than_this
> > > appropriately after the wb_writeback() call for both background and
> > > kupdate and leaving the lower layers untouched. It's just a policy
> > > change. If you thinkthe mechanism is inefficient, copy
> > > wbc->older_than_this to a local variable inside
> > > move_expired_inodes()....
> > 
> > Do you like something like this? (details will change a bit when
> > rearranging the patchset)
> 
> Yeah, this is close to what I had in mind.
> 
> > 
> > --- linux-next.orig/fs/fs-writeback.c	2011-04-20 10:30:47.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2011-04-20 10:40:19.000000000 +0800
> > @@ -660,11 +660,6 @@ static long wb_writeback(struct bdi_writ
> >  	long write_chunk;
> >  	struct inode *inode;
> >  
> > -	if (wbc.for_kupdate) {
> > -		wbc.older_than_this = &oldest_jif;
> > -		oldest_jif = jiffies -
> > -				msecs_to_jiffies(dirty_expire_interval * 10);
> > -	}
> 
> Right here I'd do:
> 
> 	if (work->for_kupdate || work->for_background)
> 		wbc.older_than_this = &oldest_jif;
> 
> so that the setting of wbc.older_than_this in the loop can trigger
> on whether it is null or not.

That's the tricky part that drove me to change move_expired_inodes()
directly..

One important thing to bear in mind is, the background work can run on
for one hour, one day or whatever. During the time dirty inodes come
and go, expired and cleaned.  If we only reset wbc.older_than_this and
never restore it _inside_ the loop, we'll quickly lose the ability to
"start with expired inodes" shortly after f.g. 5 minutes.

So we need to start with searching for expired inodes at each
queue_io() time.  wbc.older_than_this shall be properly restored to
&oldest_jif inside the loop. Since no expired inodes found in this
loop does not mean no new inodes will be expired in the next loop.

> >  	if (!wbc.range_cyclic) {
> >  		wbc.range_start = 0;
> >  		wbc.range_end = LLONG_MAX;
> > @@ -713,10 +708,17 @@ static long wb_writeback(struct bdi_writ
> >  		if (work->for_background && !over_bground_thresh())
> >  			break;
> >  
> > +		if (work->for_kupdate || work->for_background) {
> > +			oldest_jif = jiffies -
> > +				msecs_to_jiffies(dirty_expire_interval * 10);
> > +			wbc.older_than_this = &oldest_jif;
> > +		}
> > +
> 
> if you change that to:
> 
> 		if (wbc.older_than_this) {
> 			*wbc.older_than_this = jiffies -
> 				msecs_to_jiffies(dirty_expire_interval * 10);
> 		}
> 
> >  		wbc.more_io = 0;
> >  		wbc.nr_to_write = write_chunk;
> >  		wbc.pages_skipped = 0;
> >  
> > +retry_all:
> 
> You can get rid of this retry_all label and have the changeover in
> behaviour re-initialise nr_to_write, etc.
> 
> >  		trace_wbc_writeback_start(&wbc, wb->bdi);
> >  		if (work->sb)
> >  			__writeback_inodes_sb(work->sb, wb, &wbc);
> > @@ -733,6 +735,17 @@ static long wb_writeback(struct bdi_writ
> >  		if (wbc.nr_to_write <= 0)
> >  			continue;
> >  		/*
> > +		 * No expired inode? Try all fresh ones
> > +		 */
> > +		if ((work->for_kupdate || work->for_background) &&
> > +		    wbc.older_than_this &&
> > +		    wbc.nr_to_write == write_chunk &&
> > +		    list_empty(&wb->b_io) &&
> > +		    list_empty(&wb->b_more_io)) {
> > +			wbc.older_than_this = NULL;
> > +			goto retry_all;
> > +		}
> 
> And here only do this for work->for_background as kupdate writeback
> stops when we run out of expired inodes (i.e. it doesn't writeback
> non-expired inodes).

Sorry for the mistake. I've fixed it in the v2 :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-21  2:06                 ` Wu Fengguang
@ 2011-04-21  3:01                   ` Dave Chinner
  -1 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-21  3:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Thu, Apr 21, 2011 at 10:06:17AM +0800, Wu Fengguang wrote:
> On Thu, Apr 21, 2011 at 08:45:47AM +0800, Dave Chinner wrote:
> > On Wed, Apr 20, 2011 at 10:53:21AM +0800, Wu Fengguang wrote:
> > > On Wed, Apr 20, 2011 at 09:21:20AM +0800, Dave Chinner wrote:
> > > > On Tue, Apr 19, 2011 at 08:56:16PM +0800, Wu Fengguang wrote:
> > > > > I actually started with wb_writeback() as a natural choice, and then
> > > > > found it much easier to do the expired-only=>all-inodes switching in
> > > > > move_expired_inodes() since it needs to know the @b_dirty and @tmp
> > > > > lists' emptiness to trigger the switch. It's not sane for
> > > > > wb_writeback() to look into such details. And once you do the switch
> > > > > part in move_expired_inodes(), the whole policy naturally follows.
> > > > 
> > > > Well, not really. You didn't need to modify move_expired_inodes() at
> > > > all to implement these changes - all you needed to do was modify how
> > > > older_than_this is configured.
> > > > 
> > > > writeback policy is defined by the struct writeback_control.
> > > > move_expired_inodes() is pure mechanism. What you've done is remove
> > > > policy from the struct wbc and moved it to move_expired_inodes(),
> > > > which now defines both policy and mechanism.
> > > 
> > > > Furhter, this means that all the tracing that uses the struct wbc no
> > > > no longer shows the entire writeback policy that is being worked on,
> > > > so we lose visibility into policy decisions that writeback is
> > > > making.
> > > 
> > > Good point! I'm convinced, visibility is a necessity for debugging the
> > > complex writeback behaviors.
> > > 
> > > > This same change is as simple as updating wbc->older_than_this
> > > > appropriately after the wb_writeback() call for both background and
> > > > kupdate and leaving the lower layers untouched. It's just a policy
> > > > change. If you thinkthe mechanism is inefficient, copy
> > > > wbc->older_than_this to a local variable inside
> > > > move_expired_inodes()....
> > > 
> > > Do you like something like this? (details will change a bit when
> > > rearranging the patchset)
> > 
> > Yeah, this is close to what I had in mind.
> > 
> > > 
> > > --- linux-next.orig/fs/fs-writeback.c	2011-04-20 10:30:47.000000000 +0800
> > > +++ linux-next/fs/fs-writeback.c	2011-04-20 10:40:19.000000000 +0800
> > > @@ -660,11 +660,6 @@ static long wb_writeback(struct bdi_writ
> > >  	long write_chunk;
> > >  	struct inode *inode;
> > >  
> > > -	if (wbc.for_kupdate) {
> > > -		wbc.older_than_this = &oldest_jif;
> > > -		oldest_jif = jiffies -
> > > -				msecs_to_jiffies(dirty_expire_interval * 10);
> > > -	}
> > 
> > Right here I'd do:
> > 
> > 	if (work->for_kupdate || work->for_background)
> > 		wbc.older_than_this = &oldest_jif;
> > 
> > so that the setting of wbc.older_than_this in the loop can trigger
> > on whether it is null or not.
> 
> That's the tricky part that drove me to change move_expired_inodes()
> directly..
> 
> One important thing to bear in mind is, the background work can run on
> for one hour, one day or whatever. During the time dirty inodes come
> and go, expired and cleaned.  If we only reset wbc.older_than_this and
> never restore it _inside_ the loop, we'll quickly lose the ability to
> "start with expired inodes" shortly after f.g. 5 minutes.

However, there's not need to implicity switch back to expired inodes
on the next wb_writeback loop - it only needs to switch back when
b_io is emptied. And I suspect that it really only needs to switch
if there are inodes on b_more_io because if we didn't put any inodes
onto b_more_io, then then we most likely cleaned the entire list of
unexpired inodes in a single write chunk...

That is, something like this when updating the background state in
the loop tail:

	if (work->for_background && list_empty(&wb->b_io)) {
		if (wbc.older_than_this) {
			if (list_empty(&wb->b_more_io)) {
				wbc.older_than_this = NULL;
				continue;
			}
		} else if (!list_empty(&wb->b_more_io)) {
			wbc.older_than_this = &oldest_jif;
			continue;
		}
	}

Still, given wb_writeback() is the only caller of both
__writeback_inodes_sb and writeback_inodes_wb(), I'm wondering if
moving the queue_io calls up into wb_writeback() would clean up this
logic somewhat. I think Jan mentioned doing something like this as
well elsewhere in the thread...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-21  3:01                   ` Dave Chinner
  0 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-21  3:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Thu, Apr 21, 2011 at 10:06:17AM +0800, Wu Fengguang wrote:
> On Thu, Apr 21, 2011 at 08:45:47AM +0800, Dave Chinner wrote:
> > On Wed, Apr 20, 2011 at 10:53:21AM +0800, Wu Fengguang wrote:
> > > On Wed, Apr 20, 2011 at 09:21:20AM +0800, Dave Chinner wrote:
> > > > On Tue, Apr 19, 2011 at 08:56:16PM +0800, Wu Fengguang wrote:
> > > > > I actually started with wb_writeback() as a natural choice, and then
> > > > > found it much easier to do the expired-only=>all-inodes switching in
> > > > > move_expired_inodes() since it needs to know the @b_dirty and @tmp
> > > > > lists' emptiness to trigger the switch. It's not sane for
> > > > > wb_writeback() to look into such details. And once you do the switch
> > > > > part in move_expired_inodes(), the whole policy naturally follows.
> > > > 
> > > > Well, not really. You didn't need to modify move_expired_inodes() at
> > > > all to implement these changes - all you needed to do was modify how
> > > > older_than_this is configured.
> > > > 
> > > > writeback policy is defined by the struct writeback_control.
> > > > move_expired_inodes() is pure mechanism. What you've done is remove
> > > > policy from the struct wbc and moved it to move_expired_inodes(),
> > > > which now defines both policy and mechanism.
> > > 
> > > > Furhter, this means that all the tracing that uses the struct wbc no
> > > > no longer shows the entire writeback policy that is being worked on,
> > > > so we lose visibility into policy decisions that writeback is
> > > > making.
> > > 
> > > Good point! I'm convinced, visibility is a necessity for debugging the
> > > complex writeback behaviors.
> > > 
> > > > This same change is as simple as updating wbc->older_than_this
> > > > appropriately after the wb_writeback() call for both background and
> > > > kupdate and leaving the lower layers untouched. It's just a policy
> > > > change. If you thinkthe mechanism is inefficient, copy
> > > > wbc->older_than_this to a local variable inside
> > > > move_expired_inodes()....
> > > 
> > > Do you like something like this? (details will change a bit when
> > > rearranging the patchset)
> > 
> > Yeah, this is close to what I had in mind.
> > 
> > > 
> > > --- linux-next.orig/fs/fs-writeback.c	2011-04-20 10:30:47.000000000 +0800
> > > +++ linux-next/fs/fs-writeback.c	2011-04-20 10:40:19.000000000 +0800
> > > @@ -660,11 +660,6 @@ static long wb_writeback(struct bdi_writ
> > >  	long write_chunk;
> > >  	struct inode *inode;
> > >  
> > > -	if (wbc.for_kupdate) {
> > > -		wbc.older_than_this = &oldest_jif;
> > > -		oldest_jif = jiffies -
> > > -				msecs_to_jiffies(dirty_expire_interval * 10);
> > > -	}
> > 
> > Right here I'd do:
> > 
> > 	if (work->for_kupdate || work->for_background)
> > 		wbc.older_than_this = &oldest_jif;
> > 
> > so that the setting of wbc.older_than_this in the loop can trigger
> > on whether it is null or not.
> 
> That's the tricky part that drove me to change move_expired_inodes()
> directly..
> 
> One important thing to bear in mind is, the background work can run on
> for one hour, one day or whatever. During the time dirty inodes come
> and go, expired and cleaned.  If we only reset wbc.older_than_this and
> never restore it _inside_ the loop, we'll quickly lose the ability to
> "start with expired inodes" shortly after f.g. 5 minutes.

However, there's not need to implicity switch back to expired inodes
on the next wb_writeback loop - it only needs to switch back when
b_io is emptied. And I suspect that it really only needs to switch
if there are inodes on b_more_io because if we didn't put any inodes
onto b_more_io, then then we most likely cleaned the entire list of
unexpired inodes in a single write chunk...

That is, something like this when updating the background state in
the loop tail:

	if (work->for_background && list_empty(&wb->b_io)) {
		if (wbc.older_than_this) {
			if (list_empty(&wb->b_more_io)) {
				wbc.older_than_this = NULL;
				continue;
			}
		} else if (!list_empty(&wb->b_more_io)) {
			wbc.older_than_this = &oldest_jif;
			continue;
		}
	}

Still, given wb_writeback() is the only caller of both
__writeback_inodes_sb and writeback_inodes_wb(), I'm wondering if
moving the queue_io calls up into wb_writeback() would clean up this
logic somewhat. I think Jan mentioned doing something like this as
well elsewhere in the thread...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-20 15:22             ` Jan Kara
  (?)
@ 2011-04-21  3:33             ` Wu Fengguang
  2011-04-21  4:39                 ` Christoph Hellwig
  2011-04-21  7:09                 ` Dave Chinner
  -1 siblings, 2 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  3:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Mel Gorman, Dave Chinner, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

[-- Attachment #1: Type: text/plain, Size: 7882 bytes --]

On Wed, Apr 20, 2011 at 11:22:11PM +0800, Jan Kara wrote:
> On Wed 20-04-11 15:50:53, Wu Fengguang wrote:
> > > > >   Let me understand your concern here: You are afraid that if we do
> > > > > for_background or for_kupdate writeback and we write less than
> > > > > MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
> > > > > inodes to write at the time we are stopping writeback - the two realistic
> > > > 
> > > > Yes.
> > > > 
> > > > > cases I can think of are:
> > > > > a) when inodes just freshly expired during writeback
> > > > > b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
> > > > >   background threshold due to data on some other bdi. And then while we are
> > > > >   doing writeback someone does dirtying at our bdi.
> > > > > Or do you see some other case as well?
> > > > > 
> > > > > The a) case does not seem like a big issue to me after your changes to
> > > > 
> > > > Yeah (a) is not an issue with kupdate writeback.
> > > > 
> > > > > move_expired_inodes(). The b) case maybe but do you think it will make any
> > > > > difference? 
> > > > 
> > > > (b) seems also weird. What in my mind is this for_background case.
> > > > Imagine 100 inodes
> > > > 
> > > >         i0, i1, i2, ..., i90, i91, i99
> > > > 
> > > > At queue_io() time, i90-i99 happen to be expired and moved to s_io for
> > > > IO. When finished successfully, if their total size is less than
> > > > MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
> > > > quit the background work (w/o this patch) while it's still over
> > > > background threshold.
> > > > 
> > > > This will be a fairly normal/frequent case I guess.
> > >   Ah OK, I see. I missed this case your patch set has added. Also your
> > > changes of
> > >         if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > to
> > > 	if (list_empty(&wb->b_io))
> > > are going to cause more cases when we'd hit nr_to_write > 0 (e.g. when one
> > > pass of b_io does not write all the inodes so some are left in b_io list
> > > and then next call to writeback finds these inodes there but there's less
> > > than MAX_WRITEBACK_PAGES in them).
> > 
> > Yes. It's exactly the more aggressive retry logic in wb_writeback()
> > that allows me to comfortably kill that !wbc->for_kupdate test :)
> > 
> > > Frankly, it makes me like the above change even less. I'd rather see
> > > writeback_inodes_wb / __writeback_inodes_sb always work on a fresh
> > > set of inodes which is initialized whenever we enter these
> > > functions. It just seems less surprising to me...
> > 
> > The old aggressive enqueue policy is an ad-hoc workaround to prevent
> > background work to miss some inodes and quit early. Now that we have
> > the complete solution, why not killing it for more consistent code and
> > behavior? And get better performance numbers :)
>   BTW, have you understood why do you get better numbers? What are we doing
> better with this changed logic?

Good question. I'm also puzzled to find it run consistently better on
4MB, 32MB and 128MB write chunk sizes, with/without the IO-less and
larger chunk size patches.

It's not about pageout(), because I see "nr_vmscan_write 0" in
/proc/vmstat in the tests.

It's not about the full vs. remained chunk size -- it may helped the
vanilla kernel, but the "writeback: make nr_to_write a per-file limit"
as part of the large chunk size patches already guarantee each file
will get the full chunk size.

I collected the writeback_single_inode() traces (patch attached for
your reference) each for several test runs, and find much more
I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
even for small files?

wfg /tmp% g -c I_DIRTY_PAGES trace-*
trace-moving-expire-1:28213
trace-no-moving-expire:6684

wfg /tmp% g -c I_DIRTY_DATASYNC trace-*
trace-moving-expire-1:179
trace-no-moving-expire:193

wfg /tmp% g -c I_DIRTY_SYNC trace-* 
trace-moving-expire-1:29394
trace-no-moving-expire:31593

wfg /tmp% wc -l trace-*
   81108 trace-moving-expire-1
   68562 trace-no-moving-expire

wfg /tmp% head trace-*
==> trace-moving-expire-1 <==
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
           <...>-2982  [000]   633.671746: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1177 wrote=1025 to_write=-1 index=21525
           <...>-2982  [000]   633.672704: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1178 wrote=1025 to_write=-1 index=22550
           <...>-2982  [000]   633.673638: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1179 wrote=1025 to_write=-1 index=23575
           <...>-2982  [000]   633.674573: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1180 wrote=1025 to_write=-1 index=24600
           <...>-2982  [000]   633.880621: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1387 wrote=1025 to_write=-1 index=25625
           <...>-2982  [000]   633.881345: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1388 wrote=1025 to_write=-1 index=26650

==> trace-no-moving-expire <==
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
           <...>-2233  [006]   311.175491: writeback_single_inode: bdi 0:15: ino=1574019 state=I_DIRTY_DATASYNC|I_REFERENCED age=0 wrote=0 to_write=1024 index=0
           <...>-2233  [006]   311.175495: writeback_single_inode: bdi 0:15: ino=1536569 state=I_DIRTY_DATASYNC|I_REFERENCED age=0 wrote=0 to_write=1024 index=0
           <...>-2233  [006]   311.175498: writeback_single_inode: bdi 0:15: ino=1534002 state=I_DIRTY_DATASYNC|I_REFERENCED age=0 wrote=0 to_write=1024 index=0
           <...>-2233  [006]   311.175515: writeback_single_inode: bdi 0:15: ino=1574042 state=I_DIRTY_DATASYNC age=25000 wrote=1 to_write=1023 index=0
           <...>-2233  [006]   311.175522: writeback_single_inode: bdi 0:15: ino=1574028 state=I_DIRTY_DATASYNC age=25000 wrote=1 to_write=1022 index=137685
           <...>-2233  [006]   311.175524: writeback_single_inode: bdi 0:15: ino=1574024 state=I_DIRTY_DATASYNC age=25000 wrote=0 to_write=1022 index=0

> I've though about it and also about Dave's analysis. Now I think it's OK to
> not add new inodes to b_io when it's not empty. But what I still don't like
> is that the emptiness / non-emptiness of b_io carries hidden internal
> state - callers of writeback_inodes_wb() shouldn't have to know or care
> about such subtleties (__writeback_inodes_sb() is an internal function so I
> don't care about that one too much).

That's why we liked the v1 implementation :)

> So I'd prefer writeback_inodes_wb() (and also __writeback_inodes_sb() but
> that's not too important) to do something like:
> 	int requeued = 0;
> requeue:
> 	if (list_empty(&wb->b_io)) {
> 		queue_io(wb, wbc->older_than_this);
> 		requeued = 1;
> 	}
> 	while (!list_empty(&wb->b_io)) {
> 		... do stuff ...
> 	}
> 	if (wbc->nr_to_write > 0 && !requeued)
> 		goto requeue;

But that change must be coupled with older_than_this switch,
and doing it here you both lose the wbc visibility and scatters
the policy around..

> Because if you don't do this, you have to do similar change to all the
> callers of writeback_inodes_wb() (Ok, there are just three but still).

I find only one more caller: bdi_flush_io() and it sets
older_than_this to NULL. In fact wb_writeback() is the only user of
older_than_this, originally for kupdate work and now also for
background work.

Basically we only need the retry when did policy switch, so
it makes sense to do it either completely in wb_writeback() or
in move_expired_inodes()?

Thanks,
Fengguang

[-- Attachment #2: writeback-trace-writeback_single_inode.patch --]
[-- Type: text/x-diff, Size: 3552 bytes --]

Subject: writeback: trace writeback_single_inode
Date: Wed Dec 01 17:33:37 CST 2010

It is valuable to know how the dirty inodes are iterated and their IO size.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   12 +++---
 include/trace/events/writeback.h |   56 +++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-13 17:18:19.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-13 17:18:20.000000000 +0800
@@ -347,7 +347,7 @@ writeback_single_inode(struct inode *ino
 {
 	struct address_space *mapping = inode->i_mapping;
 	long per_file_limit = wbc->per_file_limit;
-	long uninitialized_var(nr_to_write);
+	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -370,7 +370,8 @@ writeback_single_inode(struct inode *ino
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
 			requeue_io(inode);
-			return 0;
+			ret = 0;
+			goto out;
 		}
 
 		/*
@@ -387,10 +388,8 @@ writeback_single_inode(struct inode *ino
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_wb_list_lock);
 
-	if (per_file_limit) {
-		nr_to_write = wbc->nr_to_write;
+	if (per_file_limit)
 		wbc->nr_to_write = per_file_limit;
-	}
 
 	ret = do_writepages(mapping, wbc);
 
@@ -467,6 +466,9 @@ writeback_single_inode(struct inode *ino
 		}
 	}
 	inode_sync_complete(inode);
+out:
+	trace_writeback_single_inode(inode, wbc,
+				     nr_to_write - wbc->nr_to_write);
 	return ret;
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-04-13 17:18:18.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-04-13 17:18:20.000000000 +0800
@@ -10,6 +10,19 @@
 
 struct wb_writeback_work;
 
+#define show_inode_state(state)					\
+	__print_flags(state, "|",				\
+		{I_DIRTY_SYNC,		"I_DIRTY_SYNC"},	\
+		{I_DIRTY_DATASYNC,	"I_DIRTY_DATASYNC"},	\
+		{I_DIRTY_PAGES,		"I_DIRTY_PAGES"},	\
+		{I_NEW,			"I_NEW"},		\
+		{I_WILL_FREE,		"I_WILL_FREE"},		\
+		{I_FREEING,		"I_FREEING"},		\
+		{I_CLEAR,		"I_CLEAR"},		\
+		{I_SYNC,		"I_SYNC"},		\
+		{I_REFERENCED,		"I_REFERENCED"}		\
+		)
+
 DECLARE_EVENT_CLASS(writeback_work_class,
 	TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work),
 	TP_ARGS(bdi, work),
@@ -149,6 +162,49 @@ DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(writeback_single_inode,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long wrote
+	),
+
+	TP_ARGS(inode, wbc, wrote),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, state)
+		__field(unsigned long, age)
+		__field(unsigned long, wrote)
+		__field(long, nr_to_write)
+		__field(unsigned long, writeback_index)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->state		= inode->i_state;
+		__entry->age		= (jiffies - inode->dirtied_when) *
+								1000 / HZ;
+		__entry->wrote		= wrote;
+		__entry->nr_to_write	= wbc->nr_to_write;
+		__entry->writeback_index = inode->i_mapping->writeback_index;
+	),
+
+	TP_printk("bdi %s: ino=%lu state=%s age=%lu "
+		  "wrote=%lu to_write=%ld index=%lu",
+		  __entry->name,
+		  __entry->ino,
+		  show_inode_state(__entry->state),
+		  __entry->age,
+		  __entry->wrote,
+		  __entry->nr_to_write,
+		  __entry->writeback_index
+	)
+);
+
 #define KBps(x)			((x) << (PAGE_SHIFT - 10))
 
 TRACE_EVENT(dirty_ratelimit,

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-21  3:01                   ` Dave Chinner
@ 2011-04-21  3:59                     ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  3:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Thu, Apr 21, 2011 at 11:01:52AM +0800, Dave Chinner wrote:
> Content-Length: 4479
> Lines: 116
> 
> On Thu, Apr 21, 2011 at 10:06:17AM +0800, Wu Fengguang wrote:
> > On Thu, Apr 21, 2011 at 08:45:47AM +0800, Dave Chinner wrote:
> > > On Wed, Apr 20, 2011 at 10:53:21AM +0800, Wu Fengguang wrote:
> > > > On Wed, Apr 20, 2011 at 09:21:20AM +0800, Dave Chinner wrote:
> > > > > On Tue, Apr 19, 2011 at 08:56:16PM +0800, Wu Fengguang wrote:
> > > > > > I actually started with wb_writeback() as a natural choice, and then
> > > > > > found it much easier to do the expired-only=>all-inodes switching in
> > > > > > move_expired_inodes() since it needs to know the @b_dirty and @tmp
> > > > > > lists' emptiness to trigger the switch. It's not sane for
> > > > > > wb_writeback() to look into such details. And once you do the switch
> > > > > > part in move_expired_inodes(), the whole policy naturally follows.
> > > > > 
> > > > > Well, not really. You didn't need to modify move_expired_inodes() at
> > > > > all to implement these changes - all you needed to do was modify how
> > > > > older_than_this is configured.
> > > > > 
> > > > > writeback policy is defined by the struct writeback_control.
> > > > > move_expired_inodes() is pure mechanism. What you've done is remove
> > > > > policy from the struct wbc and moved it to move_expired_inodes(),
> > > > > which now defines both policy and mechanism.
> > > > 
> > > > > Furhter, this means that all the tracing that uses the struct wbc no
> > > > > no longer shows the entire writeback policy that is being worked on,
> > > > > so we lose visibility into policy decisions that writeback is
> > > > > making.
> > > > 
> > > > Good point! I'm convinced, visibility is a necessity for debugging the
> > > > complex writeback behaviors.
> > > > 
> > > > > This same change is as simple as updating wbc->older_than_this
> > > > > appropriately after the wb_writeback() call for both background and
> > > > > kupdate and leaving the lower layers untouched. It's just a policy
> > > > > change. If you thinkthe mechanism is inefficient, copy
> > > > > wbc->older_than_this to a local variable inside
> > > > > move_expired_inodes()....
> > > > 
> > > > Do you like something like this? (details will change a bit when
> > > > rearranging the patchset)
> > > 
> > > Yeah, this is close to what I had in mind.
> > > 
> > > > 
> > > > --- linux-next.orig/fs/fs-writeback.c	2011-04-20 10:30:47.000000000 +0800
> > > > +++ linux-next/fs/fs-writeback.c	2011-04-20 10:40:19.000000000 +0800
> > > > @@ -660,11 +660,6 @@ static long wb_writeback(struct bdi_writ
> > > >  	long write_chunk;
> > > >  	struct inode *inode;
> > > >  
> > > > -	if (wbc.for_kupdate) {
> > > > -		wbc.older_than_this = &oldest_jif;
> > > > -		oldest_jif = jiffies -
> > > > -				msecs_to_jiffies(dirty_expire_interval * 10);
> > > > -	}
> > > 
> > > Right here I'd do:
> > > 
> > > 	if (work->for_kupdate || work->for_background)
> > > 		wbc.older_than_this = &oldest_jif;
> > > 
> > > so that the setting of wbc.older_than_this in the loop can trigger
> > > on whether it is null or not.
> > 
> > That's the tricky part that drove me to change move_expired_inodes()
> > directly..
> > 
> > One important thing to bear in mind is, the background work can run on
> > for one hour, one day or whatever. During the time dirty inodes come
> > and go, expired and cleaned.  If we only reset wbc.older_than_this and
> > never restore it _inside_ the loop, we'll quickly lose the ability to
> > "start with expired inodes" shortly after f.g. 5 minutes.
> 
> However, there's not need to implicity switch back to expired inodes
> on the next wb_writeback loop - it only needs to switch back when
> b_io is emptied.

Right. However my intention is to make simple and safe code :)

> And I suspect that it really only needs to switch
> if there are inodes on b_more_io because if we didn't put any inodes
> onto b_more_io, then then we most likely cleaned the entire list of
> unexpired inodes in a single write chunk...
> 
> That is, something like this when updating the background state in
> the loop tail:
> 
> 	if (work->for_background && list_empty(&wb->b_io)) {
> 		if (wbc.older_than_this) {
> 			if (list_empty(&wb->b_more_io)) {
> 				wbc.older_than_this = NULL;
> 				continue;
> 			}
> 		} else if (!list_empty(&wb->b_more_io)) {
> 			wbc.older_than_this = &oldest_jif;
> 			continue;
> 		}
> 	}

Now how are you going to interpret the call trace? Going through all
the above tests in our little mind and reach the conclusion: ah got it,
older_than_this is changed here because (... && ... && ...)...

Besides, we still need to update oldest_jif inside the loop (you can
sure add more tests to the update rule..).

Took quite some time iterating possible situations through the
tests...ah got a bug: what if it's all small files? older_than_this
will never be restored to &oldest_jif then...

> Still, given wb_writeback() is the only caller of both
> __writeback_inodes_sb and writeback_inodes_wb(), I'm wondering if
> moving the queue_io calls up into wb_writeback() would clean up this
> logic somewhat. I think Jan mentioned doing something like this as
> well elsewhere in the thread...

Unfortunately they call queue_io() inside the lock..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-21  3:59                     ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  3:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Thu, Apr 21, 2011 at 11:01:52AM +0800, Dave Chinner wrote:
> Content-Length: 4479
> Lines: 116
> 
> On Thu, Apr 21, 2011 at 10:06:17AM +0800, Wu Fengguang wrote:
> > On Thu, Apr 21, 2011 at 08:45:47AM +0800, Dave Chinner wrote:
> > > On Wed, Apr 20, 2011 at 10:53:21AM +0800, Wu Fengguang wrote:
> > > > On Wed, Apr 20, 2011 at 09:21:20AM +0800, Dave Chinner wrote:
> > > > > On Tue, Apr 19, 2011 at 08:56:16PM +0800, Wu Fengguang wrote:
> > > > > > I actually started with wb_writeback() as a natural choice, and then
> > > > > > found it much easier to do the expired-only=>all-inodes switching in
> > > > > > move_expired_inodes() since it needs to know the @b_dirty and @tmp
> > > > > > lists' emptiness to trigger the switch. It's not sane for
> > > > > > wb_writeback() to look into such details. And once you do the switch
> > > > > > part in move_expired_inodes(), the whole policy naturally follows.
> > > > > 
> > > > > Well, not really. You didn't need to modify move_expired_inodes() at
> > > > > all to implement these changes - all you needed to do was modify how
> > > > > older_than_this is configured.
> > > > > 
> > > > > writeback policy is defined by the struct writeback_control.
> > > > > move_expired_inodes() is pure mechanism. What you've done is remove
> > > > > policy from the struct wbc and moved it to move_expired_inodes(),
> > > > > which now defines both policy and mechanism.
> > > > 
> > > > > Furhter, this means that all the tracing that uses the struct wbc no
> > > > > no longer shows the entire writeback policy that is being worked on,
> > > > > so we lose visibility into policy decisions that writeback is
> > > > > making.
> > > > 
> > > > Good point! I'm convinced, visibility is a necessity for debugging the
> > > > complex writeback behaviors.
> > > > 
> > > > > This same change is as simple as updating wbc->older_than_this
> > > > > appropriately after the wb_writeback() call for both background and
> > > > > kupdate and leaving the lower layers untouched. It's just a policy
> > > > > change. If you thinkthe mechanism is inefficient, copy
> > > > > wbc->older_than_this to a local variable inside
> > > > > move_expired_inodes()....
> > > > 
> > > > Do you like something like this? (details will change a bit when
> > > > rearranging the patchset)
> > > 
> > > Yeah, this is close to what I had in mind.
> > > 
> > > > 
> > > > --- linux-next.orig/fs/fs-writeback.c	2011-04-20 10:30:47.000000000 +0800
> > > > +++ linux-next/fs/fs-writeback.c	2011-04-20 10:40:19.000000000 +0800
> > > > @@ -660,11 +660,6 @@ static long wb_writeback(struct bdi_writ
> > > >  	long write_chunk;
> > > >  	struct inode *inode;
> > > >  
> > > > -	if (wbc.for_kupdate) {
> > > > -		wbc.older_than_this = &oldest_jif;
> > > > -		oldest_jif = jiffies -
> > > > -				msecs_to_jiffies(dirty_expire_interval * 10);
> > > > -	}
> > > 
> > > Right here I'd do:
> > > 
> > > 	if (work->for_kupdate || work->for_background)
> > > 		wbc.older_than_this = &oldest_jif;
> > > 
> > > so that the setting of wbc.older_than_this in the loop can trigger
> > > on whether it is null or not.
> > 
> > That's the tricky part that drove me to change move_expired_inodes()
> > directly..
> > 
> > One important thing to bear in mind is, the background work can run on
> > for one hour, one day or whatever. During the time dirty inodes come
> > and go, expired and cleaned.  If we only reset wbc.older_than_this and
> > never restore it _inside_ the loop, we'll quickly lose the ability to
> > "start with expired inodes" shortly after f.g. 5 minutes.
> 
> However, there's not need to implicity switch back to expired inodes
> on the next wb_writeback loop - it only needs to switch back when
> b_io is emptied.

Right. However my intention is to make simple and safe code :)

> And I suspect that it really only needs to switch
> if there are inodes on b_more_io because if we didn't put any inodes
> onto b_more_io, then then we most likely cleaned the entire list of
> unexpired inodes in a single write chunk...
> 
> That is, something like this when updating the background state in
> the loop tail:
> 
> 	if (work->for_background && list_empty(&wb->b_io)) {
> 		if (wbc.older_than_this) {
> 			if (list_empty(&wb->b_more_io)) {
> 				wbc.older_than_this = NULL;
> 				continue;
> 			}
> 		} else if (!list_empty(&wb->b_more_io)) {
> 			wbc.older_than_this = &oldest_jif;
> 			continue;
> 		}
> 	}

Now how are you going to interpret the call trace? Going through all
the above tests in our little mind and reach the conclusion: ah got it,
older_than_this is changed here because (... && ... && ...)...

Besides, we still need to update oldest_jif inside the loop (you can
sure add more tests to the update rule..).

Took quite some time iterating possible situations through the
tests...ah got a bug: what if it's all small files? older_than_this
will never be restored to &oldest_jif then...

> Still, given wb_writeback() is the only caller of both
> __writeback_inodes_sb and writeback_inodes_wb(), I'm wondering if
> moving the queue_io calls up into wb_writeback() would clean up this
> logic somewhat. I think Jan mentioned doing something like this as
> well elsewhere in the thread...

Unfortunately they call queue_io() inside the lock..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-21  3:59                     ` Wu Fengguang
@ 2011-04-21  4:10                       ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  4:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

> > Still, given wb_writeback() is the only caller of both
> > __writeback_inodes_sb and writeback_inodes_wb(), I'm wondering if
> > moving the queue_io calls up into wb_writeback() would clean up this
> > logic somewhat. I think Jan mentioned doing something like this as
> > well elsewhere in the thread...
> 
> Unfortunately they call queue_io() inside the lock..

OK, let's try moving up the lock too. Do you like this change? :)

Thanks,
Fengguang
---
 fs/fs-writeback.c |   22 ++++++----------------
 mm/backing-dev.c  |    4 ++++
 2 files changed, 10 insertions(+), 16 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-21 12:04:02.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-21 12:05:54.000000000 +0800
@@ -591,7 +591,6 @@ void writeback_inodes_wb(struct bdi_writ
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&inode_wb_list_lock);
 
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc);
@@ -610,22 +609,9 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&inode_wb_list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
-static void __writeback_inodes_sb(struct super_block *sb,
-		struct bdi_writeback *wb, struct writeback_control *wbc)
-{
-	WARN_ON(!rwsem_is_locked(&sb->s_umount));
-
-	spin_lock(&inode_wb_list_lock);
-	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc);
-	writeback_sb_inodes(sb, wb, wbc, true);
-	spin_unlock(&inode_wb_list_lock);
-}
-
 static inline bool over_bground_thresh(void)
 {
 	unsigned long background_thresh, dirty_thresh;
@@ -652,7 +638,7 @@ static unsigned long writeback_chunk_siz
 	 * The intended call sequence for WB_SYNC_ALL writeback is:
 	 *
 	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
+	 *          writeback_sb_inodes()       <== called only once
 	 *              write_cache_pages()     <== called once for each inode
 	 *                  (quickly) tag currently dirty pages
 	 *                  (maybe slowly) sync all tagged pages
@@ -742,10 +728,14 @@ static long wb_writeback(struct bdi_writ
 
 retry:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
+		spin_lock(&inode_wb_list_lock);
+		if (list_empty(&wb->b_io))
+			queue_io(wb, wbc);
 		if (work->sb)
-			__writeback_inodes_sb(work->sb, wb, &wbc);
+			writeback_sb_inodes(work->sb, wb, &wbc, true);
 		else
 			writeback_inodes_wb(wb, &wbc);
+		spin_unlock(&inode_wb_list_lock);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
 		bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
--- linux-next.orig/mm/backing-dev.c	2011-04-21 12:06:02.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-04-21 12:06:31.000000000 +0800
@@ -268,7 +268,11 @@ static void bdi_flush_io(struct backing_
 		.nr_to_write		= 1024,
 	};
 
+	spin_lock(&inode_wb_list_lock);
+	if (list_empty(&wb->b_io))
+		queue_io(wb, wbc);
 	writeback_inodes_wb(&bdi->wb, &wbc);
+	spin_unlock(&inode_wb_list_lock);
 }
 
 /*

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-21  4:10                       ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  4:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

> > Still, given wb_writeback() is the only caller of both
> > __writeback_inodes_sb and writeback_inodes_wb(), I'm wondering if
> > moving the queue_io calls up into wb_writeback() would clean up this
> > logic somewhat. I think Jan mentioned doing something like this as
> > well elsewhere in the thread...
> 
> Unfortunately they call queue_io() inside the lock..

OK, let's try moving up the lock too. Do you like this change? :)

Thanks,
Fengguang
---
 fs/fs-writeback.c |   22 ++++++----------------
 mm/backing-dev.c  |    4 ++++
 2 files changed, 10 insertions(+), 16 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-21 12:04:02.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-21 12:05:54.000000000 +0800
@@ -591,7 +591,6 @@ void writeback_inodes_wb(struct bdi_writ
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&inode_wb_list_lock);
 
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc);
@@ -610,22 +609,9 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&inode_wb_list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
-static void __writeback_inodes_sb(struct super_block *sb,
-		struct bdi_writeback *wb, struct writeback_control *wbc)
-{
-	WARN_ON(!rwsem_is_locked(&sb->s_umount));
-
-	spin_lock(&inode_wb_list_lock);
-	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc);
-	writeback_sb_inodes(sb, wb, wbc, true);
-	spin_unlock(&inode_wb_list_lock);
-}
-
 static inline bool over_bground_thresh(void)
 {
 	unsigned long background_thresh, dirty_thresh;
@@ -652,7 +638,7 @@ static unsigned long writeback_chunk_siz
 	 * The intended call sequence for WB_SYNC_ALL writeback is:
 	 *
 	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
+	 *          writeback_sb_inodes()       <== called only once
 	 *              write_cache_pages()     <== called once for each inode
 	 *                  (quickly) tag currently dirty pages
 	 *                  (maybe slowly) sync all tagged pages
@@ -742,10 +728,14 @@ static long wb_writeback(struct bdi_writ
 
 retry:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
+		spin_lock(&inode_wb_list_lock);
+		if (list_empty(&wb->b_io))
+			queue_io(wb, wbc);
 		if (work->sb)
-			__writeback_inodes_sb(work->sb, wb, &wbc);
+			writeback_sb_inodes(work->sb, wb, &wbc, true);
 		else
 			writeback_inodes_wb(wb, &wbc);
+		spin_unlock(&inode_wb_list_lock);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
 		bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
--- linux-next.orig/mm/backing-dev.c	2011-04-21 12:06:02.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-04-21 12:06:31.000000000 +0800
@@ -268,7 +268,11 @@ static void bdi_flush_io(struct backing_
 		.nr_to_write		= 1024,
 	};
 
+	spin_lock(&inode_wb_list_lock);
+	if (list_empty(&wb->b_io))
+		queue_io(wb, wbc);
 	writeback_inodes_wb(&bdi->wb, &wbc);
+	spin_unlock(&inode_wb_list_lock);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
  2011-04-19  3:00 ` Wu Fengguang
@ 2011-04-21  4:34   ` Christoph Hellwig
  -1 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  4:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

Hi Wu,

if you're queueing up writeback changes can you look into splitting
inode_wb_list_lock as it was done in earlier versions of the inode
scalability patches?  Especially if we don't get the I/O less
balance_dirty_pages in ASAP it'll at least allows us to scale the
busy waiting for the list manipulationes to one CPU per BDI.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
@ 2011-04-21  4:34   ` Christoph Hellwig
  0 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  4:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

Hi Wu,

if you're queueing up writeback changes can you look into splitting
inode_wb_list_lock as it was done in earlier versions of the inode
scalability patches?  Especially if we don't get the I/O less
balance_dirty_pages in ASAP it'll at least allows us to scale the
busy waiting for the list manipulationes to one CPU per BDI.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-21  4:10                       ` Wu Fengguang
@ 2011-04-21  4:36                         ` Christoph Hellwig
  -1 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  4:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 12:10:11PM +0800, Wu Fengguang wrote:
> OK, let's try moving up the lock too. Do you like this change? :)

I like it, especially as it kills the horribly named
__writeback_inodes_sb that I added a while ago.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-21  4:36                         ` Christoph Hellwig
  0 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  4:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 12:10:11PM +0800, Wu Fengguang wrote:
> OK, let's try moving up the lock too. Do you like this change? :)

I like it, especially as it kills the horribly named
__writeback_inodes_sb that I added a while ago.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21  3:33             ` Wu Fengguang
@ 2011-04-21  4:39                 ` Christoph Hellwig
  2011-04-21  7:09                 ` Dave Chinner
  1 sibling, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  4:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> I collected the writeback_single_inode() traces (patch attached for
> your reference) each for several test runs, and find much more
> I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> even for small files?

What is your defintion of a small file?  As soon as it has multiple
extents or holes there's absolutely no way to clean it with a single
writepage call.  Also XFS tries to operate as non-blocking as possible
if the non-blocking flag is set in the wbc, but that flag actually
seems to be dead these days.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-21  4:39                 ` Christoph Hellwig
  0 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  4:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> I collected the writeback_single_inode() traces (patch attached for
> your reference) each for several test runs, and find much more
> I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> even for small files?

What is your defintion of a small file?  As soon as it has multiple
extents or holes there's absolutely no way to clean it with a single
writepage call.  Also XFS tries to operate as non-blocking as possible
if the non-blocking flag is set in the wbc, but that flag actually
seems to be dead these days.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 6/6] NFS: return -EAGAIN when skipped commit in nfs_commit_unstable_pages()
  2011-04-19  3:00   ` Wu Fengguang
@ 2011-04-21  4:40     ` Christoph Hellwig
  -1 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  4:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Trond Myklebust,
	Dave Chinner, Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 11:00:09AM +0800, Wu Fengguang wrote:
> It's probably not sane to return success while redirtying the inode at
> the same time in ->write_inode().

It is not, as it really confuses the writeback code. 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 6/6] NFS: return -EAGAIN when skipped commit in nfs_commit_unstable_pages()
@ 2011-04-21  4:40     ` Christoph Hellwig
  0 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  4:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Trond Myklebust,
	Dave Chinner, Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 11:00:09AM +0800, Wu Fengguang wrote:
> It's probably not sane to return success while redirtying the inode at
> the same time in ->write_inode().

It is not, as it really confuses the writeback code. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
  2011-04-21  4:34   ` Christoph Hellwig
@ 2011-04-21  5:50     ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  5:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

Hi Christoph,

On Thu, Apr 21, 2011 at 12:34:50PM +0800, Christoph Hellwig wrote:
> Hi Wu,
> 
> if you're queueing up writeback changes can you look into splitting
> inode_wb_list_lock as it was done in earlier versions of the inode
> scalability patches?  Especially if we don't get the I/O less
> balance_dirty_pages in ASAP it'll at least allows us to scale the
> busy waiting for the list manipulationes to one CPU per BDI.

Do you mean to split inode_wb_list_lock into struct bdi_writeback? 
So as to improve at least the JBOD case now and hopefully benefit the
1-bdi case when switching to multiple bdi_writeback per bdi in future?

I've not touched any locking code before, but it looks like some dumb
code replacement. Let me try it :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
@ 2011-04-21  5:50     ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  5:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

Hi Christoph,

On Thu, Apr 21, 2011 at 12:34:50PM +0800, Christoph Hellwig wrote:
> Hi Wu,
> 
> if you're queueing up writeback changes can you look into splitting
> inode_wb_list_lock as it was done in earlier versions of the inode
> scalability patches?  Especially if we don't get the I/O less
> balance_dirty_pages in ASAP it'll at least allows us to scale the
> busy waiting for the list manipulationes to one CPU per BDI.

Do you mean to split inode_wb_list_lock into struct bdi_writeback? 
So as to improve at least the JBOD case now and hopefully benefit the
1-bdi case when switching to multiple bdi_writeback per bdi in future?

I've not touched any locking code before, but it looks like some dumb
code replacement. Let me try it :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
  2011-04-21  5:50     ` Wu Fengguang
@ 2011-04-21  5:56       ` Christoph Hellwig
  -1 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  5:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, Mel Gorman,
	Dave Chinner, Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 01:50:31PM +0800, Wu Fengguang wrote:
> Hi Christoph,
> 
> On Thu, Apr 21, 2011 at 12:34:50PM +0800, Christoph Hellwig wrote:
> > Hi Wu,
> > 
> > if you're queueing up writeback changes can you look into splitting
> > inode_wb_list_lock as it was done in earlier versions of the inode
> > scalability patches?  Especially if we don't get the I/O less
> > balance_dirty_pages in ASAP it'll at least allows us to scale the
> > busy waiting for the list manipulationes to one CPU per BDI.
> 
> Do you mean to split inode_wb_list_lock into struct bdi_writeback? 
> So as to improve at least the JBOD case now and hopefully benefit the
> 1-bdi case when switching to multiple bdi_writeback per bdi in future?
> 
> I've not touched any locking code before, but it looks like some dumb
> code replacement. Let me try it :)

I can do the patch if you want, it would be useful to carry it in your
series to avoid conflicts, though.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
@ 2011-04-21  5:56       ` Christoph Hellwig
  0 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  5:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, Mel Gorman,
	Dave Chinner, Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 01:50:31PM +0800, Wu Fengguang wrote:
> Hi Christoph,
> 
> On Thu, Apr 21, 2011 at 12:34:50PM +0800, Christoph Hellwig wrote:
> > Hi Wu,
> > 
> > if you're queueing up writeback changes can you look into splitting
> > inode_wb_list_lock as it was done in earlier versions of the inode
> > scalability patches?  Especially if we don't get the I/O less
> > balance_dirty_pages in ASAP it'll at least allows us to scale the
> > busy waiting for the list manipulationes to one CPU per BDI.
> 
> Do you mean to split inode_wb_list_lock into struct bdi_writeback? 
> So as to improve at least the JBOD case now and hopefully benefit the
> 1-bdi case when switching to multiple bdi_writeback per bdi in future?
> 
> I've not touched any locking code before, but it looks like some dumb
> code replacement. Let me try it :)

I can do the patch if you want, it would be useful to carry it in your
series to avoid conflicts, though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21  4:39                 ` Christoph Hellwig
@ 2011-04-21  6:05                   ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  6:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 12:39:40PM +0800, Christoph Hellwig wrote:
> On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> > I collected the writeback_single_inode() traces (patch attached for
> > your reference) each for several test runs, and find much more
> > I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> > I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> > even for small files?
> 
> What is your defintion of a small file?  As soon as it has multiple
> extents or holes there's absolutely no way to clean it with a single
> writepage call.

It's writing a kernel source tree to XFS. You can find in the below
trace that it often leaves more dirty pages behind (indicated by the
I_DIRTY_PAGES flag) after writing as less as 1 page (indicated by the
wrote=1 field).

> Also XFS tries to operate as non-blocking as possible
> if the non-blocking flag is set in the wbc, but that flag actually
> seems to be dead these days.

Yeah.

Thanks,
Fengguang
---
wfg /tmp% head -300 trace-dt7-moving-expire-xfs
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
            init-1     [004]  5291.655631: writeback_single_inode: bdi 0:15: ino=1574069 state= age=6 wrote=2 to_write=9223372036854775805 index=179837
            init-1     [004]  5291.657137: writeback_single_inode: bdi 0:15: ino=1574069 state= age=7 wrote=0 to_write=9223372036854775807 index=0
            init-1     [004]  5291.657141: writeback_single_inode: bdi 0:15: ino=1574069 state= age=7 wrote=0 to_write=9223372036854775807 index=0
            init-1     [004]  5291.659716: writeback_single_inode: bdi 0:15: ino=1574069 state= age=3 wrote=1 to_write=9223372036854775806 index=179837
##### CPU 6 buffer started ####
           getty-3417  [006]  5291.661265: writeback_single_inode: bdi 0:15: ino=1574069 state= age=4 wrote=0 to_write=9223372036854775807 index=0
           getty-3417  [006]  5291.661269: writeback_single_inode: bdi 0:15: ino=1574069 state= age=4 wrote=0 to_write=9223372036854775807 index=0
           getty-3417  [006]  5291.663963: writeback_single_inode: bdi 0:15: ino=1574069 state= age=3 wrote=1 to_write=9223372036854775806 index=179837
       flush-8:0-3402  [006]  5291.903857: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_DATASYNC|I_DIRTY_PAGES age=323 wrote=4097 to_write=-1 index=0
       flush-8:0-3402  [006]  5291.919833: writeback_single_inode: bdi 8:0: ino=133 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4095 index=0
       flush-8:0-3402  [006]  5291.919876: writeback_single_inode: bdi 8:0: ino=134 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4093 index=1
       flush-8:0-3402  [006]  5291.919913: writeback_single_inode: bdi 8:0: ino=135 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=4088 index=4
       flush-8:0-3402  [006]  5291.919969: writeback_single_inode: bdi 8:0: ino=136 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=23 to_write=4065 index=13
       flush-8:0-3402  [006]  5291.920008: writeback_single_inode: bdi 8:0: ino=134217857 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4064 index=0
       flush-8:0-3402  [006]  5291.920049: writeback_single_inode: bdi 8:0: ino=134217858 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=4060 index=3
       flush-8:0-3402  [006]  5291.920087: writeback_single_inode: bdi 8:0: ino=268628417 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4059 index=0
       flush-8:0-3402  [006]  5291.920128: writeback_single_inode: bdi 8:0: ino=402653313 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4058 index=0
       flush-8:0-3402  [006]  5291.920160: writeback_single_inode: bdi 8:0: ino=402653314 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4057 index=0
       flush-8:0-3402  [006]  5291.920194: writeback_single_inode: bdi 8:0: ino=402653315 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4056 index=0
       flush-8:0-3402  [006]  5291.920225: writeback_single_inode: bdi 8:0: ino=402653316 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4055 index=0
       flush-8:0-3402  [006]  5291.920260: writeback_single_inode: bdi 8:0: ino=138 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4054 index=0
       flush-8:0-3402  [006]  5291.920291: writeback_single_inode: bdi 8:0: ino=139 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4053 index=0
       flush-8:0-3402  [006]  5291.920325: writeback_single_inode: bdi 8:0: ino=140 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4052 index=0
       flush-8:0-3402  [006]  5291.920356: writeback_single_inode: bdi 8:0: ino=141 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4051 index=0
       flush-8:0-3402  [006]  5291.920393: writeback_single_inode: bdi 8:0: ino=134217860 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4050 index=0
       flush-8:0-3402  [006]  5291.920425: writeback_single_inode: bdi 8:0: ino=134217861 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4049 index=0
       flush-8:0-3402  [006]  5291.920458: writeback_single_inode: bdi 8:0: ino=134217862 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4048 index=0
       flush-8:0-3402  [006]  5291.920489: writeback_single_inode: bdi 8:0: ino=134217863 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4047 index=0
       flush-8:0-3402  [006]  5291.920524: writeback_single_inode: bdi 8:0: ino=134217864 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4045 index=1
       flush-8:0-3402  [006]  5291.920556: writeback_single_inode: bdi 8:0: ino=134217865 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4044 index=0
       flush-8:0-3402  [006]  5291.920589: writeback_single_inode: bdi 8:0: ino=134217866 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4043 index=0
       flush-8:0-3402  [006]  5291.920620: writeback_single_inode: bdi 8:0: ino=134217867 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4042 index=0
       flush-8:0-3402  [006]  5291.920653: writeback_single_inode: bdi 8:0: ino=134217868 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4041 index=0
       flush-8:0-3402  [006]  5291.920718: writeback_single_inode: bdi 8:0: ino=134217869 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4040 index=0
       flush-8:0-3402  [006]  5291.920758: writeback_single_inode: bdi 8:0: ino=268628419 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4039 index=0
       flush-8:0-3402  [006]  5291.920790: writeback_single_inode: bdi 8:0: ino=268628420 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4038 index=0
       flush-8:0-3402  [006]  5291.920823: writeback_single_inode: bdi 8:0: ino=268628421 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4037 index=0
       flush-8:0-3402  [006]  5291.920855: writeback_single_inode: bdi 8:0: ino=268628422 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4036 index=0
       flush-8:0-3402  [006]  5291.920890: writeback_single_inode: bdi 8:0: ino=268628423 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4035 index=0
       flush-8:0-3402  [006]  5291.920924: writeback_single_inode: bdi 8:0: ino=268628424 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4033 index=1
       flush-8:0-3402  [006]  5291.920957: writeback_single_inode: bdi 8:0: ino=268628425 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4032 index=0
       flush-8:0-3402  [006]  5291.920988: writeback_single_inode: bdi 8:0: ino=268628426 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4031 index=0
       flush-8:0-3402  [006]  5291.921021: writeback_single_inode: bdi 8:0: ino=268628427 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4030 index=0
       flush-8:0-3402  [006]  5291.921054: writeback_single_inode: bdi 8:0: ino=268628428 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4028 index=1
       flush-8:0-3402  [006]  5291.921091: writeback_single_inode: bdi 8:0: ino=268628429 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4027 index=0
       flush-8:0-3402  [006]  5291.921122: writeback_single_inode: bdi 8:0: ino=268628430 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4026 index=0
       flush-8:0-3402  [006]  5291.921155: writeback_single_inode: bdi 8:0: ino=268628431 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4025 index=0
       flush-8:0-3402  [006]  5291.921188: writeback_single_inode: bdi 8:0: ino=268628432 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4023 index=1
       flush-8:0-3402  [006]  5291.921224: writeback_single_inode: bdi 8:0: ino=268628433 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4022 index=0
       flush-8:0-3402  [006]  5291.921256: writeback_single_inode: bdi 8:0: ino=268628434 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4021 index=0
       flush-8:0-3402  [006]  5291.921289: writeback_single_inode: bdi 8:0: ino=268628435 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4020 index=0
       flush-8:0-3402  [006]  5291.921320: writeback_single_inode: bdi 8:0: ino=268628436 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4019 index=0
       flush-8:0-3402  [006]  5291.921354: writeback_single_inode: bdi 8:0: ino=268628437 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4018 index=0
       flush-8:0-3402  [006]  5291.921385: writeback_single_inode: bdi 8:0: ino=268628438 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4017 index=0
       flush-8:0-3402  [006]  5291.921421: writeback_single_inode: bdi 8:0: ino=268628439 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4016 index=0
       flush-8:0-3402  [006]  5291.921453: writeback_single_inode: bdi 8:0: ino=268628440 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4015 index=0
       flush-8:0-3402  [006]  5291.921487: writeback_single_inode: bdi 8:0: ino=268628441 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4014 index=0
       flush-8:0-3402  [006]  5291.921518: writeback_single_inode: bdi 8:0: ino=268628442 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4013 index=0
       flush-8:0-3402  [006]  5291.921552: writeback_single_inode: bdi 8:0: ino=268628443 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4012 index=0
       flush-8:0-3402  [006]  5291.921586: writeback_single_inode: bdi 8:0: ino=268628444 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=4009 index=2
       flush-8:0-3402  [006]  5291.921622: writeback_single_inode: bdi 8:0: ino=268628445 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4007 index=1
       flush-8:0-3402  [006]  5291.921653: writeback_single_inode: bdi 8:0: ino=268628446 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4006 index=0
       flush-8:0-3402  [006]  5291.921709: writeback_single_inode: bdi 8:0: ino=268628447 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4005 index=0
       flush-8:0-3402  [006]  5291.921742: writeback_single_inode: bdi 8:0: ino=268628448 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4004 index=0
       flush-8:0-3402  [006]  5291.921775: writeback_single_inode: bdi 8:0: ino=268628449 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4003 index=0
       flush-8:0-3402  [006]  5291.921807: writeback_single_inode: bdi 8:0: ino=268628450 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4002 index=0
       flush-8:0-3402  [006]  5291.921840: writeback_single_inode: bdi 8:0: ino=268628451 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4001 index=0
       flush-8:0-3402  [006]  5291.921874: writeback_single_inode: bdi 8:0: ino=268628452 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3999 index=1
       flush-8:0-3402  [006]  5291.921909: writeback_single_inode: bdi 8:0: ino=268628453 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3997 index=1
       flush-8:0-3402  [006]  5291.921940: writeback_single_inode: bdi 8:0: ino=268628454 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3996 index=0
       flush-8:0-3402  [006]  5291.921974: writeback_single_inode: bdi 8:0: ino=268628455 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3995 index=0
       flush-8:0-3402  [006]  5291.922005: writeback_single_inode: bdi 8:0: ino=268628456 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3994 index=0
       flush-8:0-3402  [006]  5291.922044: writeback_single_inode: bdi 8:0: ino=268628457 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3992 index=1
       flush-8:0-3402  [006]  5291.922077: writeback_single_inode: bdi 8:0: ino=268628458 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3990 index=1
       flush-8:0-3402  [006]  5291.922116: writeback_single_inode: bdi 8:0: ino=268628459 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3988 index=1
       flush-8:0-3402  [006]  5291.922149: writeback_single_inode: bdi 8:0: ino=268628460 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3986 index=1
       flush-8:0-3402  [006]  5291.922182: writeback_single_inode: bdi 8:0: ino=268628461 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3985 index=0
       flush-8:0-3402  [006]  5291.922213: writeback_single_inode: bdi 8:0: ino=268628462 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3984 index=0
       flush-8:0-3402  [006]  5291.922246: writeback_single_inode: bdi 8:0: ino=268628463 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3983 index=0
       flush-8:0-3402  [006]  5291.922277: writeback_single_inode: bdi 8:0: ino=268628464 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3982 index=0
       flush-8:0-3402  [006]  5291.922310: writeback_single_inode: bdi 8:0: ino=268628465 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3981 index=0
       flush-8:0-3402  [006]  5291.922341: writeback_single_inode: bdi 8:0: ino=268628466 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3980 index=0
       flush-8:0-3402  [006]  5291.922375: writeback_single_inode: bdi 8:0: ino=268628467 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3979 index=0
       flush-8:0-3402  [006]  5291.922406: writeback_single_inode: bdi 8:0: ino=268628468 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3978 index=0
       flush-8:0-3402  [006]  5291.922439: writeback_single_inode: bdi 8:0: ino=268628469 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3977 index=0
       flush-8:0-3402  [006]  5291.922474: writeback_single_inode: bdi 8:0: ino=268628470 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3972 index=4
       flush-8:0-3402  [006]  5291.922508: writeback_single_inode: bdi 8:0: ino=268628471 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3971 index=0
       flush-8:0-3402  [006]  5291.922539: writeback_single_inode: bdi 8:0: ino=268628472 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3970 index=0
       flush-8:0-3402  [006]  5291.922572: writeback_single_inode: bdi 8:0: ino=268628473 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3969 index=0
       flush-8:0-3402  [006]  5291.922603: writeback_single_inode: bdi 8:0: ino=268628474 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3968 index=0
       flush-8:0-3402  [006]  5291.922636: writeback_single_inode: bdi 8:0: ino=268628475 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3967 index=0
       flush-8:0-3402  [006]  5291.922673: writeback_single_inode: bdi 8:0: ino=268628476 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3966 index=0
       flush-8:0-3402  [006]  5291.922709: writeback_single_inode: bdi 8:0: ino=268628477 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3965 index=0
       flush-8:0-3402  [006]  5291.922741: writeback_single_inode: bdi 8:0: ino=268628478 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3964 index=0
       flush-8:0-3402  [006]  5291.922777: writeback_single_inode: bdi 8:0: ino=268628479 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3963 index=0
       flush-8:0-3402  [006]  5291.922810: writeback_single_inode: bdi 8:0: ino=268628480 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3961 index=1
       flush-8:0-3402  [006]  5291.922850: writeback_single_inode: bdi 8:0: ino=268628481 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3960 index=0
       flush-8:0-3402  [006]  5291.922882: writeback_single_inode: bdi 8:0: ino=268628482 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3959 index=0
       flush-8:0-3402  [006]  5291.922915: writeback_single_inode: bdi 8:0: ino=268628483 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3958 index=0
       flush-8:0-3402  [006]  5291.922946: writeback_single_inode: bdi 8:0: ino=268628484 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3957 index=0
       flush-8:0-3402  [006]  5291.922980: writeback_single_inode: bdi 8:0: ino=268628485 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3956 index=0
       flush-8:0-3402  [006]  5291.923015: writeback_single_inode: bdi 8:0: ino=134217870 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3953 index=2
       flush-8:0-3402  [006]  5291.923052: writeback_single_inode: bdi 8:0: ino=134217871 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3949 index=3
       flush-8:0-3402  [006]  5291.923090: writeback_single_inode: bdi 8:0: ino=134217872 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3941 index=7
       flush-8:0-3402  [006]  5291.923129: writeback_single_inode: bdi 8:0: ino=134217873 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3934 index=6
       flush-8:0-3402  [006]  5291.923167: writeback_single_inode: bdi 8:0: ino=134217874 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3927 index=6
       flush-8:0-3402  [006]  5291.923202: writeback_single_inode: bdi 8:0: ino=134217875 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3925 index=1
       flush-8:0-3402  [006]  5291.923234: writeback_single_inode: bdi 8:0: ino=134217876 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3924 index=0
       flush-8:0-3402  [006]  5291.923268: writeback_single_inode: bdi 8:0: ino=402653318 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3923 index=0
       flush-8:0-3402  [006]  5291.923305: writeback_single_inode: bdi 8:0: ino=402653319 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=6 to_write=3917 index=5
       flush-8:0-3402  [006]  5291.923341: writeback_single_inode: bdi 8:0: ino=402653320 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3915 index=1
       flush-8:0-3402  [006]  5291.923372: writeback_single_inode: bdi 8:0: ino=402653321 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3914 index=0
       flush-8:0-3402  [006]  5291.923410: writeback_single_inode: bdi 8:0: ino=402653322 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3910 index=3
       flush-8:0-3402  [006]  5291.923444: writeback_single_inode: bdi 8:0: ino=402653323 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3906 index=3
       flush-8:0-3402  [006]  5291.923483: writeback_single_inode: bdi 8:0: ino=402653324 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3903 index=2
       flush-8:0-3402  [006]  5291.923521: writeback_single_inode: bdi 8:0: ino=402653325 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3895 index=7
       flush-8:0-3402  [006]  5291.923556: writeback_single_inode: bdi 8:0: ino=143 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3894 index=0
       flush-8:0-3402  [006]  5291.923595: writeback_single_inode: bdi 8:0: ino=144 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=10 to_write=3884 index=9
       flush-8:0-3402  [006]  5291.923630: writeback_single_inode: bdi 8:0: ino=145 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3882 index=1
       flush-8:0-3402  [006]  5291.923673: writeback_single_inode: bdi 8:0: ino=146 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3875 index=6
       flush-8:0-3402  [006]  5291.923711: writeback_single_inode: bdi 8:0: ino=147 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3874 index=0
       flush-8:0-3402  [006]  5291.923746: writeback_single_inode: bdi 8:0: ino=148 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3870 index=3
       flush-8:0-3402  [006]  5291.923780: writeback_single_inode: bdi 8:0: ino=149 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3869 index=0
       flush-8:0-3402  [006]  5291.923817: writeback_single_inode: bdi 8:0: ino=150 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=6 to_write=3863 index=5
       flush-8:0-3402  [006]  5291.923852: writeback_single_inode: bdi 8:0: ino=151 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3860 index=2
       flush-8:0-3402  [006]  5291.923887: writeback_single_inode: bdi 8:0: ino=152 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3856 index=3
       flush-8:0-3402  [006]  5291.923931: writeback_single_inode: bdi 8:0: ino=153 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=13 to_write=3843 index=12
       flush-8:0-3402  [006]  5291.923964: writeback_single_inode: bdi 8:0: ino=154 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3841 index=1
       flush-8:0-3402  [006]  5291.924014: writeback_single_inode: bdi 8:0: ino=155 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=18 to_write=3823 index=13
       flush-8:0-3402  [006]  5291.924045: writeback_single_inode: bdi 8:0: ino=156 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3822 index=0
       flush-8:0-3402  [006]  5291.924092: writeback_single_inode: bdi 8:0: ino=157 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=13 to_write=3809 index=12
       flush-8:0-3402  [006]  5291.924127: writeback_single_inode: bdi 8:0: ino=402653326 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3805 index=3
       flush-8:0-3402  [006]  5291.924167: writeback_single_inode: bdi 8:0: ino=402653327 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3797 index=7
       flush-8:0-3402  [006]  5291.924203: writeback_single_inode: bdi 8:0: ino=402653328 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3792 index=4
       flush-8:0-3402  [006]  5291.924242: writeback_single_inode: bdi 8:0: ino=402653329 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3789 index=2
       flush-8:0-3402  [006]  5291.924282: writeback_single_inode: bdi 8:0: ino=402653330 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=11 to_write=3778 index=10
       flush-8:0-3402  [006]  5291.924330: writeback_single_inode: bdi 8:0: ino=402653331 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=17 to_write=3761 index=13
       flush-8:0-3402  [006]  5291.924370: writeback_single_inode: bdi 8:0: ino=402653332 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=11 to_write=3750 index=10
       flush-8:0-3402  [006]  5291.924413: writeback_single_inode: bdi 8:0: ino=402653333 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=12 to_write=3738 index=11
       flush-8:0-3402  [006]  5291.924446: writeback_single_inode: bdi 8:0: ino=402653334 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3735 index=2
       flush-8:0-3402  [006]  5291.924483: writeback_single_inode: bdi 8:0: ino=402653335 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3731 index=3
       flush-8:0-3402  [006]  5291.924513: writeback_single_inode: bdi 8:0: ino=402653336 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3730 index=0
       flush-8:0-3402  [006]  5291.924554: writeback_single_inode: bdi 8:0: ino=402653337 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3722 index=7
       flush-8:0-3402  [006]  5291.924588: writeback_single_inode: bdi 8:0: ino=402653338 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3719 index=2
       flush-8:0-3402  [006]  5291.924626: writeback_single_inode: bdi 8:0: ino=402653339 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3717 index=1
       flush-8:0-3402  [006]  5291.924679: writeback_single_inode: bdi 8:0: ino=402653340 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=12 to_write=3705 index=11
       flush-8:0-3402  [006]  5291.924719: writeback_single_inode: bdi 8:0: ino=402653341 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3704 index=0
       flush-8:0-3402  [006]  5291.924751: writeback_single_inode: bdi 8:0: ino=402653342 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3702 index=1
       flush-8:0-3402  [006]  5291.924787: writeback_single_inode: bdi 8:0: ino=402653343 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3699 index=2
       flush-8:0-3402  [006]  5291.924820: writeback_single_inode: bdi 8:0: ino=402653344 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3697 index=1
       flush-8:0-3402  [006]  5291.924856: writeback_single_inode: bdi 8:0: ino=402653345 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3693 index=3
       flush-8:0-3402  [006]  5291.924888: writeback_single_inode: bdi 8:0: ino=402653346 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3692 index=0
       flush-8:0-3402  [006]  5291.924921: writeback_single_inode: bdi 8:0: ino=402653347 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3691 index=0
       flush-8:0-3402  [006]  5291.924952: writeback_single_inode: bdi 8:0: ino=402653348 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3690 index=0
       flush-8:0-3402  [006]  5291.924995: writeback_single_inode: bdi 8:0: ino=402653349 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=9 to_write=3681 index=8
       flush-8:0-3402  [006]  5291.925035: writeback_single_inode: bdi 8:0: ino=402653350 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=10 to_write=3671 index=9
       flush-8:0-3402  [006]  5291.925070: writeback_single_inode: bdi 8:0: ino=134217878 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3670 index=0
       flush-8:0-3402  [006]  5291.925103: writeback_single_inode: bdi 8:0: ino=134217879 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3668 index=1
       flush-8:0-3402  [006]  5291.925140: writeback_single_inode: bdi 8:0: ino=134217880 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3663 index=4
       flush-8:0-3402  [006]  5291.925181: writeback_single_inode: bdi 8:0: ino=134217881 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=12 to_write=3651 index=11
       flush-8:0-3402  [006]  5291.925235: writeback_single_inode: bdi 8:0: ino=134217882 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=24 to_write=3627 index=13
       flush-8:0-3402  [006]  5291.925283: writeback_single_inode: bdi 8:0: ino=134217883 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=20 to_write=3607 index=13
       flush-8:0-3402  [006]  5291.925319: writeback_single_inode: bdi 8:0: ino=134217884 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3605 index=1
       flush-8:0-3402  [006]  5291.925351: writeback_single_inode: bdi 8:0: ino=134217885 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3603 index=1
       flush-8:0-3402  [006]  5291.925386: writeback_single_inode: bdi 8:0: ino=134217886 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3601 index=1
       flush-8:0-3402  [006]  5291.925417: writeback_single_inode: bdi 8:0: ino=134217887 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3600 index=0
       flush-8:0-3402  [006]  5291.925450: writeback_single_inode: bdi 8:0: ino=134217888 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3599 index=0
       flush-8:0-3402  [006]  5291.925481: writeback_single_inode: bdi 8:0: ino=134217889 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3598 index=0
       flush-8:0-3402  [006]  5291.925519: writeback_single_inode: bdi 8:0: ino=134217890 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3596 index=1
       flush-8:0-3402  [006]  5291.925552: writeback_single_inode: bdi 8:0: ino=134217891 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3594 index=1
       flush-8:0-3402  [006]  5291.925594: writeback_single_inode: bdi 8:0: ino=134217892 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3589 index=4
       flush-8:0-3402  [006]  5291.925626: writeback_single_inode: bdi 8:0: ino=134217893 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3588 index=0
       flush-8:0-3402  [006]  5291.925669: writeback_single_inode: bdi 8:0: ino=134217894 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3584 index=3
       flush-8:0-3402  [006]  5291.925703: writeback_single_inode: bdi 8:0: ino=134217895 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3582 index=1
       flush-8:0-3402  [006]  5291.925746: writeback_single_inode: bdi 8:0: ino=134217896 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3574 index=7
       flush-8:0-3402  [006]  5291.925777: writeback_single_inode: bdi 8:0: ino=134217897 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3573 index=0
       flush-8:0-3402  [006]  5291.925813: writeback_single_inode: bdi 8:0: ino=134217898 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3571 index=1
       flush-8:0-3402  [006]  5291.925850: writeback_single_inode: bdi 8:0: ino=134217899 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3564 index=6
       flush-8:0-3402  [006]  5291.925891: writeback_single_inode: bdi 8:0: ino=134217900 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3557 index=6
       flush-8:0-3402  [006]  5291.925925: writeback_single_inode: bdi 8:0: ino=134217901 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3554 index=2
       flush-8:0-3402  [006]  5291.925965: writeback_single_inode: bdi 8:0: ino=134217902 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3547 index=6
       flush-8:0-3402  [006]  5291.925999: writeback_single_inode: bdi 8:0: ino=134217903 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3544 index=2
       flush-8:0-3402  [006]  5291.926033: writeback_single_inode: bdi 8:0: ino=134217904 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3543 index=0
       flush-8:0-3402  [006]  5291.926065: writeback_single_inode: bdi 8:0: ino=134217905 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3541 index=1
       flush-8:0-3402  [006]  5291.926100: writeback_single_inode: bdi 8:0: ino=134217906 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3539 index=1
       flush-8:0-3402  [006]  5291.926131: writeback_single_inode: bdi 8:0: ino=134217907 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3538 index=0
       flush-8:0-3402  [006]  5291.926164: writeback_single_inode: bdi 8:0: ino=134217908 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3537 index=0
       flush-8:0-3402  [006]  5291.926197: writeback_single_inode: bdi 8:0: ino=134217909 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3535 index=1
       flush-8:0-3402  [006]  5291.926232: writeback_single_inode: bdi 8:0: ino=134217910 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3533 index=1
       flush-8:0-3402  [006]  5291.926264: writeback_single_inode: bdi 8:0: ino=134217911 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3531 index=1
       flush-8:0-3402  [006]  5291.926298: writeback_single_inode: bdi 8:0: ino=134217912 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3530 index=0
       flush-8:0-3402  [006]  5291.926338: writeback_single_inode: bdi 8:0: ino=134217913 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=11 to_write=3519 index=10
       flush-8:0-3402  [006]  5291.926376: writeback_single_inode: bdi 8:0: ino=134217914 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3517 index=1
       flush-8:0-3402  [006]  5291.926411: writeback_single_inode: bdi 8:0: ino=134217915 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3514 index=2
       flush-8:0-3402  [006]  5291.926450: writeback_single_inode: bdi 8:0: ino=134217916 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3511 index=2
       flush-8:0-3402  [006]  5291.926482: writeback_single_inode: bdi 8:0: ino=134217917 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3510 index=0
       flush-8:0-3402  [006]  5291.926516: writeback_single_inode: bdi 8:0: ino=134217918 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3508 index=1
       flush-8:0-3402  [006]  5291.926549: writeback_single_inode: bdi 8:0: ino=134217919 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3506 index=1
       flush-8:0-3402  [006]  5291.926594: writeback_single_inode: bdi 8:0: ino=134217984 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=9 to_write=3497 index=8
       flush-8:0-3402  [006]  5291.926627: writeback_single_inode: bdi 8:0: ino=134217985 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3494 index=2
       flush-8:0-3402  [006]  5291.926667: writeback_single_inode: bdi 8:0: ino=134217986 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3493 index=0
       flush-8:0-3402  [006]  5291.926699: writeback_single_inode: bdi 8:0: ino=134217987 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3492 index=0
       flush-8:0-3402  [006]  5291.926732: writeback_single_inode: bdi 8:0: ino=134217988 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3491 index=0
       flush-8:0-3402  [006]  5291.926763: writeback_single_inode: bdi 8:0: ino=134217989 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3490 index=0
       flush-8:0-3402  [006]  5291.926796: writeback_single_inode: bdi 8:0: ino=134217990 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3489 index=0
       flush-8:0-3402  [006]  5291.926827: writeback_single_inode: bdi 8:0: ino=134217991 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3488 index=0
       flush-8:0-3402  [006]  5291.926862: writeback_single_inode: bdi 8:0: ino=134217992 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3486 index=1
       flush-8:0-3402  [006]  5291.926895: writeback_single_inode: bdi 8:0: ino=134217993 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3484 index=1
       flush-8:0-3402  [006]  5291.926928: writeback_single_inode: bdi 8:0: ino=134217994 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3483 index=0
       flush-8:0-3402  [006]  5291.926961: writeback_single_inode: bdi 8:0: ino=134217995 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3482 index=0
       flush-8:0-3402  [006]  5291.926996: writeback_single_inode: bdi 8:0: ino=134217996 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3480 index=1
       flush-8:0-3402  [006]  5291.927029: writeback_single_inode: bdi 8:0: ino=134217997 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3478 index=1
       flush-8:0-3402  [006]  5291.927064: writeback_single_inode: bdi 8:0: ino=134217998 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3476 index=1
       flush-8:0-3402  [006]  5291.927096: writeback_single_inode: bdi 8:0: ino=134217999 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3474 index=1
       flush-8:0-3402  [006]  5291.927135: writeback_single_inode: bdi 8:0: ino=134218000 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3472 index=1
       flush-8:0-3402  [006]  5291.927168: writeback_single_inode: bdi 8:0: ino=134218001 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3470 index=1
       flush-8:0-3402  [006]  5291.927205: writeback_single_inode: bdi 8:0: ino=134218002 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3468 index=1
       flush-8:0-3402  [006]  5291.927247: writeback_single_inode: bdi 8:0: ino=134218003 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3460 index=7
       flush-8:0-3402  [006]  5291.927285: writeback_single_inode: bdi 8:0: ino=134218004 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3456 index=3
       flush-8:0-3402  [006]  5291.927320: writeback_single_inode: bdi 8:0: ino=134218005 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3452 index=3
       flush-8:0-3402  [006]  5291.927355: writeback_single_inode: bdi 8:0: ino=134218006 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3450 index=1
       flush-8:0-3402  [006]  5291.927388: writeback_single_inode: bdi 8:0: ino=134218007 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3448 index=1
       flush-8:0-3402  [006]  5291.927421: writeback_single_inode: bdi 8:0: ino=134218008 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3447 index=0
       flush-8:0-3402  [006]  5291.927454: writeback_single_inode: bdi 8:0: ino=134218009 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3445 index=1
       flush-8:0-3402  [006]  5291.927487: writeback_single_inode: bdi 8:0: ino=134218010 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3444 index=0
       flush-8:0-3402  [006]  5291.927518: writeback_single_inode: bdi 8:0: ino=134218011 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3443 index=0
       flush-8:0-3402  [006]  5291.927555: writeback_single_inode: bdi 8:0: ino=134218012 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3441 index=1
       flush-8:0-3402  [006]  5291.927604: writeback_single_inode: bdi 8:0: ino=134218013 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=22 to_write=3419 index=13
       flush-8:0-3402  [006]  5291.927639: writeback_single_inode: bdi 8:0: ino=134218014 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3417 index=1
       flush-8:0-3402  [006]  5291.927681: writeback_single_inode: bdi 8:0: ino=134218015 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3414 index=2
       flush-8:0-3402  [006]  5291.927717: writeback_single_inode: bdi 8:0: ino=134218016 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3411 index=2
       flush-8:0-3402  [006]  5291.927747: writeback_single_inode: bdi 8:0: ino=134218017 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3410 index=0
       flush-8:0-3402  [006]  5291.927782: writeback_single_inode: bdi 8:0: ino=134218018 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3408 index=1
       flush-8:0-3402  [006]  5291.927815: writeback_single_inode: bdi 8:0: ino=134218019 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3406 index=1
       flush-8:0-3402  [006]  5291.927852: writeback_single_inode: bdi 8:0: ino=134218020 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3404 index=1
       flush-8:0-3402  [006]  5291.927885: writeback_single_inode: bdi 8:0: ino=134218021 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3401 index=2
       flush-8:0-3402  [006]  5291.927921: writeback_single_inode: bdi 8:0: ino=134218022 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3398 index=2
       flush-8:0-3402  [006]  5291.927952: writeback_single_inode: bdi 8:0: ino=134218023 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3397 index=0
       flush-8:0-3402  [006]  5291.927986: writeback_single_inode: bdi 8:0: ino=134218024 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3396 index=0
       flush-8:0-3402  [006]  5291.928020: writeback_single_inode: bdi 8:0: ino=134218025 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3393 index=2
       flush-8:0-3402  [006]  5291.928058: writeback_single_inode: bdi 8:0: ino=134218026 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3391 index=1
       flush-8:0-3402  [006]  5291.928093: writeback_single_inode: bdi 8:0: ino=134218027 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3387 index=3
       flush-8:0-3402  [006]  5291.928130: writeback_single_inode: bdi 8:0: ino=134218028 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3385 index=1
       flush-8:0-3402  [006]  5291.928162: writeback_single_inode: bdi 8:0: ino=134218029 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3383 index=1
       flush-8:0-3402  [006]  5291.928197: writeback_single_inode: bdi 8:0: ino=134218030 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3381 index=1
       flush-8:0-3402  [006]  5291.928228: writeback_single_inode: bdi 8:0: ino=134218031 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3380 index=0
       flush-8:0-3402  [006]  5291.928262: writeback_single_inode: bdi 8:0: ino=134218032 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3379 index=0
       flush-8:0-3402  [006]  5291.928294: writeback_single_inode: bdi 8:0: ino=134218033 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3377 index=1
       flush-8:0-3402  [006]  5291.928333: writeback_single_inode: bdi 8:0: ino=134218034 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3375 index=1
       flush-8:0-3402  [006]  5291.928367: writeback_single_inode: bdi 8:0: ino=134218035 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3372 index=2
       flush-8:0-3402  [006]  5291.928408: writeback_single_inode: bdi 8:0: ino=134218036 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3367 index=4
       flush-8:0-3402  [006]  5291.928441: writeback_single_inode: bdi 8:0: ino=134218037 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3365 index=1
       flush-8:0-3402  [006]  5291.928476: writeback_single_inode: bdi 8:0: ino=134218038 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3363 index=1
       flush-8:0-3402  [006]  5291.928507: writeback_single_inode: bdi 8:0: ino=134218039 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3362 index=0
       flush-8:0-3402  [006]  5291.928545: writeback_single_inode: bdi 8:0: ino=134218040 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3360 index=1
       flush-8:0-3402  [006]  5291.928579: writeback_single_inode: bdi 8:0: ino=134218041 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3357 index=2
       flush-8:0-3402  [006]  5291.928613: writeback_single_inode: bdi 8:0: ino=134218042 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3356 index=0
       flush-8:0-3402  [006]  5291.928653: writeback_single_inode: bdi 8:0: ino=134218043 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3353 index=2
       flush-8:0-3402  [006]  5291.928690: writeback_single_inode: bdi 8:0: ino=134218044 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3351 index=1
       flush-8:0-3402  [006]  5291.928724: writeback_single_inode: bdi 8:0: ino=134218045 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3348 index=2
       flush-8:0-3402  [006]  5291.928758: writeback_single_inode: bdi 8:0: ino=134218046 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3347 index=0
       flush-8:0-3402  [006]  5291.928793: writeback_single_inode: bdi 8:0: ino=134218047 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3342 index=4
       flush-8:0-3402  [006]  5291.928826: writeback_single_inode: bdi 8:0: ino=134218048 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3341 index=0
       flush-8:0-3402  [006]  5291.928858: writeback_single_inode: bdi 8:0: ino=134218049 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3340 index=0
       flush-8:0-3402  [006]  5291.928902: writeback_single_inode: bdi 8:0: ino=134218050 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3338 index=1
       flush-8:0-3402  [006]  5291.928934: writeback_single_inode: bdi 8:0: ino=134218051 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3337 index=0
       flush-8:0-3402  [006]  5291.928967: writeback_single_inode: bdi 8:0: ino=134218052 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3336 index=0
       flush-8:0-3402  [006]  5291.929001: writeback_single_inode: bdi 8:0: ino=134218053 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3333 index=2
       flush-8:0-3402  [006]  5291.929039: writeback_single_inode: bdi 8:0: ino=134218054 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3328 index=4
       flush-8:0-3402  [006]  5291.929070: writeback_single_inode: bdi 8:0: ino=134218055 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3327 index=0
       flush-8:0-3402  [006]  5291.929105: writeback_single_inode: bdi 8:0: ino=134218056 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3325 index=1
       flush-8:0-3402  [006]  5291.929137: writeback_single_inode: bdi 8:0: ino=134218057 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3324 index=0
       flush-8:0-3402  [006]  5291.929170: writeback_single_inode: bdi 8:0: ino=134218058 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3323 index=0
       flush-8:0-3402  [006]  5291.929201: writeback_single_inode: bdi 8:0: ino=134218059 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3322 index=0
       flush-8:0-3402  [006]  5291.929279: writeback_single_inode: bdi 8:0: ino=402653351 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=51 to_write=3271 index=13
       flush-8:0-3402  [006]  5291.929314: writeback_single_inode: bdi 8:0: ino=402653352 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3266 index=4
       flush-8:0-3402  [006]  5291.929352: writeback_single_inode: bdi 8:0: ino=402653353 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3262 index=3
       flush-8:0-3402  [006]  5291.929389: writeback_single_inode: bdi 8:0: ino=134218060 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3255 index=6
       flush-8:0-3402  [006]  5291.929430: writeback_single_inode: bdi 8:0: ino=134218061 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3247 index=7
       flush-8:0-3402  [006]  5291.929461: writeback_single_inode: bdi 8:0: ino=134218062 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3246 index=0
       flush-8:0-3402  [006]  5291.929495: writeback_single_inode: bdi 8:0: ino=134218063 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3245 index=0
       flush-8:0-3402  [006]  5291.929526: writeback_single_inode: bdi 8:0: ino=134218064 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3244 index=0
       flush-8:0-3402  [006]  5291.929559: writeback_single_inode: bdi 8:0: ino=134218065 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3243 index=0
       flush-8:0-3402  [006]  5291.929594: writeback_single_inode: bdi 8:0: ino=134218066 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3239 index=3
       flush-8:0-3402  [006]  5291.929629: writeback_single_inode: bdi 8:0: ino=268628487 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3238 index=0
       flush-8:0-3402  [006]  5291.929671: writeback_single_inode: bdi 8:0: ino=268628488 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3234 index=3
       flush-8:0-3402  [006]  5291.929709: writeback_single_inode: bdi 8:0: ino=268628489 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3231 index=2
       flush-8:0-3402  [006]  5291.929744: writeback_single_inode: bdi 8:0: ino=268628490 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3226 index=4
       flush-8:0-3402  [006]  5291.929780: writeback_single_inode: bdi 8:0: ino=268628491 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3225 index=0
       flush-8:0-3402  [006]  5291.929817: writeback_single_inode: bdi 8:0: ino=268628492 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3218 index=6
       flush-8:0-3402  [006]  5291.929853: writeback_single_inode: bdi 8:0: ino=268628493 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3215 index=2
       flush-8:0-3402  [006]  5291.929885: writeback_single_inode: bdi 8:0: ino=402653355 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3214 index=0
       flush-8:0-3402  [006]  5291.929919: writeback_single_inode: bdi 8:0: ino=402653356 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3212 index=1
       flush-8:0-3402  [006]  5291.929956: writeback_single_inode: bdi 8:0: ino=402653357 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3205 index=6
       flush-8:0-3402  [006]  5291.929994: writeback_single_inode: bdi 8:0: ino=402653358 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3203 index=1
       flush-8:0-3402  [006]  5291.930027: writeback_single_inode: bdi 8:0: ino=402653359 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3201 index=1
wfg /tmp%

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-21  6:05                   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  6:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 12:39:40PM +0800, Christoph Hellwig wrote:
> On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> > I collected the writeback_single_inode() traces (patch attached for
> > your reference) each for several test runs, and find much more
> > I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> > I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> > even for small files?
> 
> What is your defintion of a small file?  As soon as it has multiple
> extents or holes there's absolutely no way to clean it with a single
> writepage call.

It's writing a kernel source tree to XFS. You can find in the below
trace that it often leaves more dirty pages behind (indicated by the
I_DIRTY_PAGES flag) after writing as less as 1 page (indicated by the
wrote=1 field).

> Also XFS tries to operate as non-blocking as possible
> if the non-blocking flag is set in the wbc, but that flag actually
> seems to be dead these days.

Yeah.

Thanks,
Fengguang
---
wfg /tmp% head -300 trace-dt7-moving-expire-xfs
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
            init-1     [004]  5291.655631: writeback_single_inode: bdi 0:15: ino=1574069 state= age=6 wrote=2 to_write=9223372036854775805 index=179837
            init-1     [004]  5291.657137: writeback_single_inode: bdi 0:15: ino=1574069 state= age=7 wrote=0 to_write=9223372036854775807 index=0
            init-1     [004]  5291.657141: writeback_single_inode: bdi 0:15: ino=1574069 state= age=7 wrote=0 to_write=9223372036854775807 index=0
            init-1     [004]  5291.659716: writeback_single_inode: bdi 0:15: ino=1574069 state= age=3 wrote=1 to_write=9223372036854775806 index=179837
##### CPU 6 buffer started ####
           getty-3417  [006]  5291.661265: writeback_single_inode: bdi 0:15: ino=1574069 state= age=4 wrote=0 to_write=9223372036854775807 index=0
           getty-3417  [006]  5291.661269: writeback_single_inode: bdi 0:15: ino=1574069 state= age=4 wrote=0 to_write=9223372036854775807 index=0
           getty-3417  [006]  5291.663963: writeback_single_inode: bdi 0:15: ino=1574069 state= age=3 wrote=1 to_write=9223372036854775806 index=179837
       flush-8:0-3402  [006]  5291.903857: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_DATASYNC|I_DIRTY_PAGES age=323 wrote=4097 to_write=-1 index=0
       flush-8:0-3402  [006]  5291.919833: writeback_single_inode: bdi 8:0: ino=133 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4095 index=0
       flush-8:0-3402  [006]  5291.919876: writeback_single_inode: bdi 8:0: ino=134 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4093 index=1
       flush-8:0-3402  [006]  5291.919913: writeback_single_inode: bdi 8:0: ino=135 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=4088 index=4
       flush-8:0-3402  [006]  5291.919969: writeback_single_inode: bdi 8:0: ino=136 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=23 to_write=4065 index=13
       flush-8:0-3402  [006]  5291.920008: writeback_single_inode: bdi 8:0: ino=134217857 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4064 index=0
       flush-8:0-3402  [006]  5291.920049: writeback_single_inode: bdi 8:0: ino=134217858 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=4060 index=3
       flush-8:0-3402  [006]  5291.920087: writeback_single_inode: bdi 8:0: ino=268628417 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4059 index=0
       flush-8:0-3402  [006]  5291.920128: writeback_single_inode: bdi 8:0: ino=402653313 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4058 index=0
       flush-8:0-3402  [006]  5291.920160: writeback_single_inode: bdi 8:0: ino=402653314 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4057 index=0
       flush-8:0-3402  [006]  5291.920194: writeback_single_inode: bdi 8:0: ino=402653315 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4056 index=0
       flush-8:0-3402  [006]  5291.920225: writeback_single_inode: bdi 8:0: ino=402653316 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4055 index=0
       flush-8:0-3402  [006]  5291.920260: writeback_single_inode: bdi 8:0: ino=138 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4054 index=0
       flush-8:0-3402  [006]  5291.920291: writeback_single_inode: bdi 8:0: ino=139 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4053 index=0
       flush-8:0-3402  [006]  5291.920325: writeback_single_inode: bdi 8:0: ino=140 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4052 index=0
       flush-8:0-3402  [006]  5291.920356: writeback_single_inode: bdi 8:0: ino=141 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4051 index=0
       flush-8:0-3402  [006]  5291.920393: writeback_single_inode: bdi 8:0: ino=134217860 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4050 index=0
       flush-8:0-3402  [006]  5291.920425: writeback_single_inode: bdi 8:0: ino=134217861 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4049 index=0
       flush-8:0-3402  [006]  5291.920458: writeback_single_inode: bdi 8:0: ino=134217862 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4048 index=0
       flush-8:0-3402  [006]  5291.920489: writeback_single_inode: bdi 8:0: ino=134217863 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4047 index=0
       flush-8:0-3402  [006]  5291.920524: writeback_single_inode: bdi 8:0: ino=134217864 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4045 index=1
       flush-8:0-3402  [006]  5291.920556: writeback_single_inode: bdi 8:0: ino=134217865 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4044 index=0
       flush-8:0-3402  [006]  5291.920589: writeback_single_inode: bdi 8:0: ino=134217866 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4043 index=0
       flush-8:0-3402  [006]  5291.920620: writeback_single_inode: bdi 8:0: ino=134217867 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4042 index=0
       flush-8:0-3402  [006]  5291.920653: writeback_single_inode: bdi 8:0: ino=134217868 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4041 index=0
       flush-8:0-3402  [006]  5291.920718: writeback_single_inode: bdi 8:0: ino=134217869 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4040 index=0
       flush-8:0-3402  [006]  5291.920758: writeback_single_inode: bdi 8:0: ino=268628419 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4039 index=0
       flush-8:0-3402  [006]  5291.920790: writeback_single_inode: bdi 8:0: ino=268628420 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4038 index=0
       flush-8:0-3402  [006]  5291.920823: writeback_single_inode: bdi 8:0: ino=268628421 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4037 index=0
       flush-8:0-3402  [006]  5291.920855: writeback_single_inode: bdi 8:0: ino=268628422 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4036 index=0
       flush-8:0-3402  [006]  5291.920890: writeback_single_inode: bdi 8:0: ino=268628423 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4035 index=0
       flush-8:0-3402  [006]  5291.920924: writeback_single_inode: bdi 8:0: ino=268628424 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4033 index=1
       flush-8:0-3402  [006]  5291.920957: writeback_single_inode: bdi 8:0: ino=268628425 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4032 index=0
       flush-8:0-3402  [006]  5291.920988: writeback_single_inode: bdi 8:0: ino=268628426 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4031 index=0
       flush-8:0-3402  [006]  5291.921021: writeback_single_inode: bdi 8:0: ino=268628427 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4030 index=0
       flush-8:0-3402  [006]  5291.921054: writeback_single_inode: bdi 8:0: ino=268628428 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4028 index=1
       flush-8:0-3402  [006]  5291.921091: writeback_single_inode: bdi 8:0: ino=268628429 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4027 index=0
       flush-8:0-3402  [006]  5291.921122: writeback_single_inode: bdi 8:0: ino=268628430 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4026 index=0
       flush-8:0-3402  [006]  5291.921155: writeback_single_inode: bdi 8:0: ino=268628431 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4025 index=0
       flush-8:0-3402  [006]  5291.921188: writeback_single_inode: bdi 8:0: ino=268628432 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4023 index=1
       flush-8:0-3402  [006]  5291.921224: writeback_single_inode: bdi 8:0: ino=268628433 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4022 index=0
       flush-8:0-3402  [006]  5291.921256: writeback_single_inode: bdi 8:0: ino=268628434 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4021 index=0
       flush-8:0-3402  [006]  5291.921289: writeback_single_inode: bdi 8:0: ino=268628435 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4020 index=0
       flush-8:0-3402  [006]  5291.921320: writeback_single_inode: bdi 8:0: ino=268628436 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4019 index=0
       flush-8:0-3402  [006]  5291.921354: writeback_single_inode: bdi 8:0: ino=268628437 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4018 index=0
       flush-8:0-3402  [006]  5291.921385: writeback_single_inode: bdi 8:0: ino=268628438 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4017 index=0
       flush-8:0-3402  [006]  5291.921421: writeback_single_inode: bdi 8:0: ino=268628439 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4016 index=0
       flush-8:0-3402  [006]  5291.921453: writeback_single_inode: bdi 8:0: ino=268628440 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4015 index=0
       flush-8:0-3402  [006]  5291.921487: writeback_single_inode: bdi 8:0: ino=268628441 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4014 index=0
       flush-8:0-3402  [006]  5291.921518: writeback_single_inode: bdi 8:0: ino=268628442 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4013 index=0
       flush-8:0-3402  [006]  5291.921552: writeback_single_inode: bdi 8:0: ino=268628443 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4012 index=0
       flush-8:0-3402  [006]  5291.921586: writeback_single_inode: bdi 8:0: ino=268628444 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=4009 index=2
       flush-8:0-3402  [006]  5291.921622: writeback_single_inode: bdi 8:0: ino=268628445 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4007 index=1
       flush-8:0-3402  [006]  5291.921653: writeback_single_inode: bdi 8:0: ino=268628446 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4006 index=0
       flush-8:0-3402  [006]  5291.921709: writeback_single_inode: bdi 8:0: ino=268628447 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4005 index=0
       flush-8:0-3402  [006]  5291.921742: writeback_single_inode: bdi 8:0: ino=268628448 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4004 index=0
       flush-8:0-3402  [006]  5291.921775: writeback_single_inode: bdi 8:0: ino=268628449 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4003 index=0
       flush-8:0-3402  [006]  5291.921807: writeback_single_inode: bdi 8:0: ino=268628450 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4002 index=0
       flush-8:0-3402  [006]  5291.921840: writeback_single_inode: bdi 8:0: ino=268628451 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4001 index=0
       flush-8:0-3402  [006]  5291.921874: writeback_single_inode: bdi 8:0: ino=268628452 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3999 index=1
       flush-8:0-3402  [006]  5291.921909: writeback_single_inode: bdi 8:0: ino=268628453 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3997 index=1
       flush-8:0-3402  [006]  5291.921940: writeback_single_inode: bdi 8:0: ino=268628454 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3996 index=0
       flush-8:0-3402  [006]  5291.921974: writeback_single_inode: bdi 8:0: ino=268628455 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3995 index=0
       flush-8:0-3402  [006]  5291.922005: writeback_single_inode: bdi 8:0: ino=268628456 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3994 index=0
       flush-8:0-3402  [006]  5291.922044: writeback_single_inode: bdi 8:0: ino=268628457 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3992 index=1
       flush-8:0-3402  [006]  5291.922077: writeback_single_inode: bdi 8:0: ino=268628458 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3990 index=1
       flush-8:0-3402  [006]  5291.922116: writeback_single_inode: bdi 8:0: ino=268628459 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3988 index=1
       flush-8:0-3402  [006]  5291.922149: writeback_single_inode: bdi 8:0: ino=268628460 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3986 index=1
       flush-8:0-3402  [006]  5291.922182: writeback_single_inode: bdi 8:0: ino=268628461 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3985 index=0
       flush-8:0-3402  [006]  5291.922213: writeback_single_inode: bdi 8:0: ino=268628462 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3984 index=0
       flush-8:0-3402  [006]  5291.922246: writeback_single_inode: bdi 8:0: ino=268628463 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3983 index=0
       flush-8:0-3402  [006]  5291.922277: writeback_single_inode: bdi 8:0: ino=268628464 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3982 index=0
       flush-8:0-3402  [006]  5291.922310: writeback_single_inode: bdi 8:0: ino=268628465 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3981 index=0
       flush-8:0-3402  [006]  5291.922341: writeback_single_inode: bdi 8:0: ino=268628466 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3980 index=0
       flush-8:0-3402  [006]  5291.922375: writeback_single_inode: bdi 8:0: ino=268628467 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3979 index=0
       flush-8:0-3402  [006]  5291.922406: writeback_single_inode: bdi 8:0: ino=268628468 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3978 index=0
       flush-8:0-3402  [006]  5291.922439: writeback_single_inode: bdi 8:0: ino=268628469 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3977 index=0
       flush-8:0-3402  [006]  5291.922474: writeback_single_inode: bdi 8:0: ino=268628470 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3972 index=4
       flush-8:0-3402  [006]  5291.922508: writeback_single_inode: bdi 8:0: ino=268628471 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3971 index=0
       flush-8:0-3402  [006]  5291.922539: writeback_single_inode: bdi 8:0: ino=268628472 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3970 index=0
       flush-8:0-3402  [006]  5291.922572: writeback_single_inode: bdi 8:0: ino=268628473 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3969 index=0
       flush-8:0-3402  [006]  5291.922603: writeback_single_inode: bdi 8:0: ino=268628474 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3968 index=0
       flush-8:0-3402  [006]  5291.922636: writeback_single_inode: bdi 8:0: ino=268628475 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3967 index=0
       flush-8:0-3402  [006]  5291.922673: writeback_single_inode: bdi 8:0: ino=268628476 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3966 index=0
       flush-8:0-3402  [006]  5291.922709: writeback_single_inode: bdi 8:0: ino=268628477 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3965 index=0
       flush-8:0-3402  [006]  5291.922741: writeback_single_inode: bdi 8:0: ino=268628478 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3964 index=0
       flush-8:0-3402  [006]  5291.922777: writeback_single_inode: bdi 8:0: ino=268628479 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3963 index=0
       flush-8:0-3402  [006]  5291.922810: writeback_single_inode: bdi 8:0: ino=268628480 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3961 index=1
       flush-8:0-3402  [006]  5291.922850: writeback_single_inode: bdi 8:0: ino=268628481 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3960 index=0
       flush-8:0-3402  [006]  5291.922882: writeback_single_inode: bdi 8:0: ino=268628482 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3959 index=0
       flush-8:0-3402  [006]  5291.922915: writeback_single_inode: bdi 8:0: ino=268628483 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3958 index=0
       flush-8:0-3402  [006]  5291.922946: writeback_single_inode: bdi 8:0: ino=268628484 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3957 index=0
       flush-8:0-3402  [006]  5291.922980: writeback_single_inode: bdi 8:0: ino=268628485 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3956 index=0
       flush-8:0-3402  [006]  5291.923015: writeback_single_inode: bdi 8:0: ino=134217870 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3953 index=2
       flush-8:0-3402  [006]  5291.923052: writeback_single_inode: bdi 8:0: ino=134217871 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3949 index=3
       flush-8:0-3402  [006]  5291.923090: writeback_single_inode: bdi 8:0: ino=134217872 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3941 index=7
       flush-8:0-3402  [006]  5291.923129: writeback_single_inode: bdi 8:0: ino=134217873 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3934 index=6
       flush-8:0-3402  [006]  5291.923167: writeback_single_inode: bdi 8:0: ino=134217874 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3927 index=6
       flush-8:0-3402  [006]  5291.923202: writeback_single_inode: bdi 8:0: ino=134217875 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3925 index=1
       flush-8:0-3402  [006]  5291.923234: writeback_single_inode: bdi 8:0: ino=134217876 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3924 index=0
       flush-8:0-3402  [006]  5291.923268: writeback_single_inode: bdi 8:0: ino=402653318 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3923 index=0
       flush-8:0-3402  [006]  5291.923305: writeback_single_inode: bdi 8:0: ino=402653319 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=6 to_write=3917 index=5
       flush-8:0-3402  [006]  5291.923341: writeback_single_inode: bdi 8:0: ino=402653320 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3915 index=1
       flush-8:0-3402  [006]  5291.923372: writeback_single_inode: bdi 8:0: ino=402653321 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3914 index=0
       flush-8:0-3402  [006]  5291.923410: writeback_single_inode: bdi 8:0: ino=402653322 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3910 index=3
       flush-8:0-3402  [006]  5291.923444: writeback_single_inode: bdi 8:0: ino=402653323 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3906 index=3
       flush-8:0-3402  [006]  5291.923483: writeback_single_inode: bdi 8:0: ino=402653324 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3903 index=2
       flush-8:0-3402  [006]  5291.923521: writeback_single_inode: bdi 8:0: ino=402653325 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3895 index=7
       flush-8:0-3402  [006]  5291.923556: writeback_single_inode: bdi 8:0: ino=143 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3894 index=0
       flush-8:0-3402  [006]  5291.923595: writeback_single_inode: bdi 8:0: ino=144 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=10 to_write=3884 index=9
       flush-8:0-3402  [006]  5291.923630: writeback_single_inode: bdi 8:0: ino=145 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3882 index=1
       flush-8:0-3402  [006]  5291.923673: writeback_single_inode: bdi 8:0: ino=146 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3875 index=6
       flush-8:0-3402  [006]  5291.923711: writeback_single_inode: bdi 8:0: ino=147 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3874 index=0
       flush-8:0-3402  [006]  5291.923746: writeback_single_inode: bdi 8:0: ino=148 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3870 index=3
       flush-8:0-3402  [006]  5291.923780: writeback_single_inode: bdi 8:0: ino=149 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3869 index=0
       flush-8:0-3402  [006]  5291.923817: writeback_single_inode: bdi 8:0: ino=150 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=6 to_write=3863 index=5
       flush-8:0-3402  [006]  5291.923852: writeback_single_inode: bdi 8:0: ino=151 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3860 index=2
       flush-8:0-3402  [006]  5291.923887: writeback_single_inode: bdi 8:0: ino=152 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3856 index=3
       flush-8:0-3402  [006]  5291.923931: writeback_single_inode: bdi 8:0: ino=153 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=13 to_write=3843 index=12
       flush-8:0-3402  [006]  5291.923964: writeback_single_inode: bdi 8:0: ino=154 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3841 index=1
       flush-8:0-3402  [006]  5291.924014: writeback_single_inode: bdi 8:0: ino=155 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=18 to_write=3823 index=13
       flush-8:0-3402  [006]  5291.924045: writeback_single_inode: bdi 8:0: ino=156 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3822 index=0
       flush-8:0-3402  [006]  5291.924092: writeback_single_inode: bdi 8:0: ino=157 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=13 to_write=3809 index=12
       flush-8:0-3402  [006]  5291.924127: writeback_single_inode: bdi 8:0: ino=402653326 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3805 index=3
       flush-8:0-3402  [006]  5291.924167: writeback_single_inode: bdi 8:0: ino=402653327 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3797 index=7
       flush-8:0-3402  [006]  5291.924203: writeback_single_inode: bdi 8:0: ino=402653328 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3792 index=4
       flush-8:0-3402  [006]  5291.924242: writeback_single_inode: bdi 8:0: ino=402653329 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3789 index=2
       flush-8:0-3402  [006]  5291.924282: writeback_single_inode: bdi 8:0: ino=402653330 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=11 to_write=3778 index=10
       flush-8:0-3402  [006]  5291.924330: writeback_single_inode: bdi 8:0: ino=402653331 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=17 to_write=3761 index=13
       flush-8:0-3402  [006]  5291.924370: writeback_single_inode: bdi 8:0: ino=402653332 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=11 to_write=3750 index=10
       flush-8:0-3402  [006]  5291.924413: writeback_single_inode: bdi 8:0: ino=402653333 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=12 to_write=3738 index=11
       flush-8:0-3402  [006]  5291.924446: writeback_single_inode: bdi 8:0: ino=402653334 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3735 index=2
       flush-8:0-3402  [006]  5291.924483: writeback_single_inode: bdi 8:0: ino=402653335 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3731 index=3
       flush-8:0-3402  [006]  5291.924513: writeback_single_inode: bdi 8:0: ino=402653336 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3730 index=0
       flush-8:0-3402  [006]  5291.924554: writeback_single_inode: bdi 8:0: ino=402653337 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3722 index=7
       flush-8:0-3402  [006]  5291.924588: writeback_single_inode: bdi 8:0: ino=402653338 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3719 index=2
       flush-8:0-3402  [006]  5291.924626: writeback_single_inode: bdi 8:0: ino=402653339 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3717 index=1
       flush-8:0-3402  [006]  5291.924679: writeback_single_inode: bdi 8:0: ino=402653340 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=12 to_write=3705 index=11
       flush-8:0-3402  [006]  5291.924719: writeback_single_inode: bdi 8:0: ino=402653341 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3704 index=0
       flush-8:0-3402  [006]  5291.924751: writeback_single_inode: bdi 8:0: ino=402653342 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3702 index=1
       flush-8:0-3402  [006]  5291.924787: writeback_single_inode: bdi 8:0: ino=402653343 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3699 index=2
       flush-8:0-3402  [006]  5291.924820: writeback_single_inode: bdi 8:0: ino=402653344 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3697 index=1
       flush-8:0-3402  [006]  5291.924856: writeback_single_inode: bdi 8:0: ino=402653345 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3693 index=3
       flush-8:0-3402  [006]  5291.924888: writeback_single_inode: bdi 8:0: ino=402653346 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3692 index=0
       flush-8:0-3402  [006]  5291.924921: writeback_single_inode: bdi 8:0: ino=402653347 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3691 index=0
       flush-8:0-3402  [006]  5291.924952: writeback_single_inode: bdi 8:0: ino=402653348 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3690 index=0
       flush-8:0-3402  [006]  5291.924995: writeback_single_inode: bdi 8:0: ino=402653349 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=9 to_write=3681 index=8
       flush-8:0-3402  [006]  5291.925035: writeback_single_inode: bdi 8:0: ino=402653350 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=10 to_write=3671 index=9
       flush-8:0-3402  [006]  5291.925070: writeback_single_inode: bdi 8:0: ino=134217878 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3670 index=0
       flush-8:0-3402  [006]  5291.925103: writeback_single_inode: bdi 8:0: ino=134217879 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3668 index=1
       flush-8:0-3402  [006]  5291.925140: writeback_single_inode: bdi 8:0: ino=134217880 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3663 index=4
       flush-8:0-3402  [006]  5291.925181: writeback_single_inode: bdi 8:0: ino=134217881 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=12 to_write=3651 index=11
       flush-8:0-3402  [006]  5291.925235: writeback_single_inode: bdi 8:0: ino=134217882 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=24 to_write=3627 index=13
       flush-8:0-3402  [006]  5291.925283: writeback_single_inode: bdi 8:0: ino=134217883 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=20 to_write=3607 index=13
       flush-8:0-3402  [006]  5291.925319: writeback_single_inode: bdi 8:0: ino=134217884 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3605 index=1
       flush-8:0-3402  [006]  5291.925351: writeback_single_inode: bdi 8:0: ino=134217885 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3603 index=1
       flush-8:0-3402  [006]  5291.925386: writeback_single_inode: bdi 8:0: ino=134217886 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3601 index=1
       flush-8:0-3402  [006]  5291.925417: writeback_single_inode: bdi 8:0: ino=134217887 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3600 index=0
       flush-8:0-3402  [006]  5291.925450: writeback_single_inode: bdi 8:0: ino=134217888 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3599 index=0
       flush-8:0-3402  [006]  5291.925481: writeback_single_inode: bdi 8:0: ino=134217889 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3598 index=0
       flush-8:0-3402  [006]  5291.925519: writeback_single_inode: bdi 8:0: ino=134217890 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3596 index=1
       flush-8:0-3402  [006]  5291.925552: writeback_single_inode: bdi 8:0: ino=134217891 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3594 index=1
       flush-8:0-3402  [006]  5291.925594: writeback_single_inode: bdi 8:0: ino=134217892 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3589 index=4
       flush-8:0-3402  [006]  5291.925626: writeback_single_inode: bdi 8:0: ino=134217893 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3588 index=0
       flush-8:0-3402  [006]  5291.925669: writeback_single_inode: bdi 8:0: ino=134217894 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3584 index=3
       flush-8:0-3402  [006]  5291.925703: writeback_single_inode: bdi 8:0: ino=134217895 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3582 index=1
       flush-8:0-3402  [006]  5291.925746: writeback_single_inode: bdi 8:0: ino=134217896 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3574 index=7
       flush-8:0-3402  [006]  5291.925777: writeback_single_inode: bdi 8:0: ino=134217897 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3573 index=0
       flush-8:0-3402  [006]  5291.925813: writeback_single_inode: bdi 8:0: ino=134217898 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3571 index=1
       flush-8:0-3402  [006]  5291.925850: writeback_single_inode: bdi 8:0: ino=134217899 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3564 index=6
       flush-8:0-3402  [006]  5291.925891: writeback_single_inode: bdi 8:0: ino=134217900 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3557 index=6
       flush-8:0-3402  [006]  5291.925925: writeback_single_inode: bdi 8:0: ino=134217901 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3554 index=2
       flush-8:0-3402  [006]  5291.925965: writeback_single_inode: bdi 8:0: ino=134217902 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3547 index=6
       flush-8:0-3402  [006]  5291.925999: writeback_single_inode: bdi 8:0: ino=134217903 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3544 index=2
       flush-8:0-3402  [006]  5291.926033: writeback_single_inode: bdi 8:0: ino=134217904 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3543 index=0
       flush-8:0-3402  [006]  5291.926065: writeback_single_inode: bdi 8:0: ino=134217905 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3541 index=1
       flush-8:0-3402  [006]  5291.926100: writeback_single_inode: bdi 8:0: ino=134217906 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3539 index=1
       flush-8:0-3402  [006]  5291.926131: writeback_single_inode: bdi 8:0: ino=134217907 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3538 index=0
       flush-8:0-3402  [006]  5291.926164: writeback_single_inode: bdi 8:0: ino=134217908 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3537 index=0
       flush-8:0-3402  [006]  5291.926197: writeback_single_inode: bdi 8:0: ino=134217909 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3535 index=1
       flush-8:0-3402  [006]  5291.926232: writeback_single_inode: bdi 8:0: ino=134217910 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3533 index=1
       flush-8:0-3402  [006]  5291.926264: writeback_single_inode: bdi 8:0: ino=134217911 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3531 index=1
       flush-8:0-3402  [006]  5291.926298: writeback_single_inode: bdi 8:0: ino=134217912 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3530 index=0
       flush-8:0-3402  [006]  5291.926338: writeback_single_inode: bdi 8:0: ino=134217913 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=11 to_write=3519 index=10
       flush-8:0-3402  [006]  5291.926376: writeback_single_inode: bdi 8:0: ino=134217914 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3517 index=1
       flush-8:0-3402  [006]  5291.926411: writeback_single_inode: bdi 8:0: ino=134217915 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3514 index=2
       flush-8:0-3402  [006]  5291.926450: writeback_single_inode: bdi 8:0: ino=134217916 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3511 index=2
       flush-8:0-3402  [006]  5291.926482: writeback_single_inode: bdi 8:0: ino=134217917 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3510 index=0
       flush-8:0-3402  [006]  5291.926516: writeback_single_inode: bdi 8:0: ino=134217918 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3508 index=1
       flush-8:0-3402  [006]  5291.926549: writeback_single_inode: bdi 8:0: ino=134217919 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3506 index=1
       flush-8:0-3402  [006]  5291.926594: writeback_single_inode: bdi 8:0: ino=134217984 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=9 to_write=3497 index=8
       flush-8:0-3402  [006]  5291.926627: writeback_single_inode: bdi 8:0: ino=134217985 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3494 index=2
       flush-8:0-3402  [006]  5291.926667: writeback_single_inode: bdi 8:0: ino=134217986 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3493 index=0
       flush-8:0-3402  [006]  5291.926699: writeback_single_inode: bdi 8:0: ino=134217987 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3492 index=0
       flush-8:0-3402  [006]  5291.926732: writeback_single_inode: bdi 8:0: ino=134217988 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3491 index=0
       flush-8:0-3402  [006]  5291.926763: writeback_single_inode: bdi 8:0: ino=134217989 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3490 index=0
       flush-8:0-3402  [006]  5291.926796: writeback_single_inode: bdi 8:0: ino=134217990 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3489 index=0
       flush-8:0-3402  [006]  5291.926827: writeback_single_inode: bdi 8:0: ino=134217991 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3488 index=0
       flush-8:0-3402  [006]  5291.926862: writeback_single_inode: bdi 8:0: ino=134217992 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3486 index=1
       flush-8:0-3402  [006]  5291.926895: writeback_single_inode: bdi 8:0: ino=134217993 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3484 index=1
       flush-8:0-3402  [006]  5291.926928: writeback_single_inode: bdi 8:0: ino=134217994 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3483 index=0
       flush-8:0-3402  [006]  5291.926961: writeback_single_inode: bdi 8:0: ino=134217995 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3482 index=0
       flush-8:0-3402  [006]  5291.926996: writeback_single_inode: bdi 8:0: ino=134217996 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3480 index=1
       flush-8:0-3402  [006]  5291.927029: writeback_single_inode: bdi 8:0: ino=134217997 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3478 index=1
       flush-8:0-3402  [006]  5291.927064: writeback_single_inode: bdi 8:0: ino=134217998 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3476 index=1
       flush-8:0-3402  [006]  5291.927096: writeback_single_inode: bdi 8:0: ino=134217999 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3474 index=1
       flush-8:0-3402  [006]  5291.927135: writeback_single_inode: bdi 8:0: ino=134218000 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3472 index=1
       flush-8:0-3402  [006]  5291.927168: writeback_single_inode: bdi 8:0: ino=134218001 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3470 index=1
       flush-8:0-3402  [006]  5291.927205: writeback_single_inode: bdi 8:0: ino=134218002 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3468 index=1
       flush-8:0-3402  [006]  5291.927247: writeback_single_inode: bdi 8:0: ino=134218003 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3460 index=7
       flush-8:0-3402  [006]  5291.927285: writeback_single_inode: bdi 8:0: ino=134218004 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3456 index=3
       flush-8:0-3402  [006]  5291.927320: writeback_single_inode: bdi 8:0: ino=134218005 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3452 index=3
       flush-8:0-3402  [006]  5291.927355: writeback_single_inode: bdi 8:0: ino=134218006 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3450 index=1
       flush-8:0-3402  [006]  5291.927388: writeback_single_inode: bdi 8:0: ino=134218007 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3448 index=1
       flush-8:0-3402  [006]  5291.927421: writeback_single_inode: bdi 8:0: ino=134218008 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3447 index=0
       flush-8:0-3402  [006]  5291.927454: writeback_single_inode: bdi 8:0: ino=134218009 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3445 index=1
       flush-8:0-3402  [006]  5291.927487: writeback_single_inode: bdi 8:0: ino=134218010 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3444 index=0
       flush-8:0-3402  [006]  5291.927518: writeback_single_inode: bdi 8:0: ino=134218011 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3443 index=0
       flush-8:0-3402  [006]  5291.927555: writeback_single_inode: bdi 8:0: ino=134218012 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3441 index=1
       flush-8:0-3402  [006]  5291.927604: writeback_single_inode: bdi 8:0: ino=134218013 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=22 to_write=3419 index=13
       flush-8:0-3402  [006]  5291.927639: writeback_single_inode: bdi 8:0: ino=134218014 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3417 index=1
       flush-8:0-3402  [006]  5291.927681: writeback_single_inode: bdi 8:0: ino=134218015 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3414 index=2
       flush-8:0-3402  [006]  5291.927717: writeback_single_inode: bdi 8:0: ino=134218016 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3411 index=2
       flush-8:0-3402  [006]  5291.927747: writeback_single_inode: bdi 8:0: ino=134218017 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3410 index=0
       flush-8:0-3402  [006]  5291.927782: writeback_single_inode: bdi 8:0: ino=134218018 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3408 index=1
       flush-8:0-3402  [006]  5291.927815: writeback_single_inode: bdi 8:0: ino=134218019 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3406 index=1
       flush-8:0-3402  [006]  5291.927852: writeback_single_inode: bdi 8:0: ino=134218020 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3404 index=1
       flush-8:0-3402  [006]  5291.927885: writeback_single_inode: bdi 8:0: ino=134218021 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3401 index=2
       flush-8:0-3402  [006]  5291.927921: writeback_single_inode: bdi 8:0: ino=134218022 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3398 index=2
       flush-8:0-3402  [006]  5291.927952: writeback_single_inode: bdi 8:0: ino=134218023 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3397 index=0
       flush-8:0-3402  [006]  5291.927986: writeback_single_inode: bdi 8:0: ino=134218024 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3396 index=0
       flush-8:0-3402  [006]  5291.928020: writeback_single_inode: bdi 8:0: ino=134218025 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3393 index=2
       flush-8:0-3402  [006]  5291.928058: writeback_single_inode: bdi 8:0: ino=134218026 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3391 index=1
       flush-8:0-3402  [006]  5291.928093: writeback_single_inode: bdi 8:0: ino=134218027 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3387 index=3
       flush-8:0-3402  [006]  5291.928130: writeback_single_inode: bdi 8:0: ino=134218028 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3385 index=1
       flush-8:0-3402  [006]  5291.928162: writeback_single_inode: bdi 8:0: ino=134218029 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3383 index=1
       flush-8:0-3402  [006]  5291.928197: writeback_single_inode: bdi 8:0: ino=134218030 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3381 index=1
       flush-8:0-3402  [006]  5291.928228: writeback_single_inode: bdi 8:0: ino=134218031 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3380 index=0
       flush-8:0-3402  [006]  5291.928262: writeback_single_inode: bdi 8:0: ino=134218032 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3379 index=0
       flush-8:0-3402  [006]  5291.928294: writeback_single_inode: bdi 8:0: ino=134218033 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3377 index=1
       flush-8:0-3402  [006]  5291.928333: writeback_single_inode: bdi 8:0: ino=134218034 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3375 index=1
       flush-8:0-3402  [006]  5291.928367: writeback_single_inode: bdi 8:0: ino=134218035 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3372 index=2
       flush-8:0-3402  [006]  5291.928408: writeback_single_inode: bdi 8:0: ino=134218036 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3367 index=4
       flush-8:0-3402  [006]  5291.928441: writeback_single_inode: bdi 8:0: ino=134218037 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3365 index=1
       flush-8:0-3402  [006]  5291.928476: writeback_single_inode: bdi 8:0: ino=134218038 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3363 index=1
       flush-8:0-3402  [006]  5291.928507: writeback_single_inode: bdi 8:0: ino=134218039 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3362 index=0
       flush-8:0-3402  [006]  5291.928545: writeback_single_inode: bdi 8:0: ino=134218040 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3360 index=1
       flush-8:0-3402  [006]  5291.928579: writeback_single_inode: bdi 8:0: ino=134218041 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3357 index=2
       flush-8:0-3402  [006]  5291.928613: writeback_single_inode: bdi 8:0: ino=134218042 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3356 index=0
       flush-8:0-3402  [006]  5291.928653: writeback_single_inode: bdi 8:0: ino=134218043 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3353 index=2
       flush-8:0-3402  [006]  5291.928690: writeback_single_inode: bdi 8:0: ino=134218044 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3351 index=1
       flush-8:0-3402  [006]  5291.928724: writeback_single_inode: bdi 8:0: ino=134218045 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3348 index=2
       flush-8:0-3402  [006]  5291.928758: writeback_single_inode: bdi 8:0: ino=134218046 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3347 index=0
       flush-8:0-3402  [006]  5291.928793: writeback_single_inode: bdi 8:0: ino=134218047 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3342 index=4
       flush-8:0-3402  [006]  5291.928826: writeback_single_inode: bdi 8:0: ino=134218048 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3341 index=0
       flush-8:0-3402  [006]  5291.928858: writeback_single_inode: bdi 8:0: ino=134218049 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3340 index=0
       flush-8:0-3402  [006]  5291.928902: writeback_single_inode: bdi 8:0: ino=134218050 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3338 index=1
       flush-8:0-3402  [006]  5291.928934: writeback_single_inode: bdi 8:0: ino=134218051 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3337 index=0
       flush-8:0-3402  [006]  5291.928967: writeback_single_inode: bdi 8:0: ino=134218052 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3336 index=0
       flush-8:0-3402  [006]  5291.929001: writeback_single_inode: bdi 8:0: ino=134218053 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3333 index=2
       flush-8:0-3402  [006]  5291.929039: writeback_single_inode: bdi 8:0: ino=134218054 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3328 index=4
       flush-8:0-3402  [006]  5291.929070: writeback_single_inode: bdi 8:0: ino=134218055 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3327 index=0
       flush-8:0-3402  [006]  5291.929105: writeback_single_inode: bdi 8:0: ino=134218056 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3325 index=1
       flush-8:0-3402  [006]  5291.929137: writeback_single_inode: bdi 8:0: ino=134218057 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3324 index=0
       flush-8:0-3402  [006]  5291.929170: writeback_single_inode: bdi 8:0: ino=134218058 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3323 index=0
       flush-8:0-3402  [006]  5291.929201: writeback_single_inode: bdi 8:0: ino=134218059 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3322 index=0
       flush-8:0-3402  [006]  5291.929279: writeback_single_inode: bdi 8:0: ino=402653351 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=51 to_write=3271 index=13
       flush-8:0-3402  [006]  5291.929314: writeback_single_inode: bdi 8:0: ino=402653352 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3266 index=4
       flush-8:0-3402  [006]  5291.929352: writeback_single_inode: bdi 8:0: ino=402653353 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3262 index=3
       flush-8:0-3402  [006]  5291.929389: writeback_single_inode: bdi 8:0: ino=134218060 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3255 index=6
       flush-8:0-3402  [006]  5291.929430: writeback_single_inode: bdi 8:0: ino=134218061 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3247 index=7
       flush-8:0-3402  [006]  5291.929461: writeback_single_inode: bdi 8:0: ino=134218062 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3246 index=0
       flush-8:0-3402  [006]  5291.929495: writeback_single_inode: bdi 8:0: ino=134218063 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3245 index=0
       flush-8:0-3402  [006]  5291.929526: writeback_single_inode: bdi 8:0: ino=134218064 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3244 index=0
       flush-8:0-3402  [006]  5291.929559: writeback_single_inode: bdi 8:0: ino=134218065 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3243 index=0
       flush-8:0-3402  [006]  5291.929594: writeback_single_inode: bdi 8:0: ino=134218066 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3239 index=3
       flush-8:0-3402  [006]  5291.929629: writeback_single_inode: bdi 8:0: ino=268628487 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3238 index=0
       flush-8:0-3402  [006]  5291.929671: writeback_single_inode: bdi 8:0: ino=268628488 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3234 index=3
       flush-8:0-3402  [006]  5291.929709: writeback_single_inode: bdi 8:0: ino=268628489 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3231 index=2
       flush-8:0-3402  [006]  5291.929744: writeback_single_inode: bdi 8:0: ino=268628490 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3226 index=4
       flush-8:0-3402  [006]  5291.929780: writeback_single_inode: bdi 8:0: ino=268628491 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3225 index=0
       flush-8:0-3402  [006]  5291.929817: writeback_single_inode: bdi 8:0: ino=268628492 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3218 index=6
       flush-8:0-3402  [006]  5291.929853: writeback_single_inode: bdi 8:0: ino=268628493 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3215 index=2
       flush-8:0-3402  [006]  5291.929885: writeback_single_inode: bdi 8:0: ino=402653355 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3214 index=0
       flush-8:0-3402  [006]  5291.929919: writeback_single_inode: bdi 8:0: ino=402653356 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3212 index=1
       flush-8:0-3402  [006]  5291.929956: writeback_single_inode: bdi 8:0: ino=402653357 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3205 index=6
       flush-8:0-3402  [006]  5291.929994: writeback_single_inode: bdi 8:0: ino=402653358 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3203 index=1
       flush-8:0-3402  [006]  5291.930027: writeback_single_inode: bdi 8:0: ino=402653359 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3201 index=1
wfg /tmp%

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
  2011-04-21  5:56       ` Christoph Hellwig
@ 2011-04-21  6:07         ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  6:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 01:56:34PM +0800, Christoph Hellwig wrote:
> On Thu, Apr 21, 2011 at 01:50:31PM +0800, Wu Fengguang wrote:
> > Hi Christoph,
> > 
> > On Thu, Apr 21, 2011 at 12:34:50PM +0800, Christoph Hellwig wrote:
> > > Hi Wu,
> > > 
> > > if you're queueing up writeback changes can you look into splitting
> > > inode_wb_list_lock as it was done in earlier versions of the inode
> > > scalability patches?  Especially if we don't get the I/O less
> > > balance_dirty_pages in ASAP it'll at least allows us to scale the
> > > busy waiting for the list manipulationes to one CPU per BDI.
> > 
> > Do you mean to split inode_wb_list_lock into struct bdi_writeback? 
> > So as to improve at least the JBOD case now and hopefully benefit the
> > 1-bdi case when switching to multiple bdi_writeback per bdi in future?
> > 
> > I've not touched any locking code before, but it looks like some dumb
> > code replacement. Let me try it :)
> 
> I can do the patch if you want, it would be useful to carry it in your
> series to avoid conflicts, though.

I see. I'll do it, thanks!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
@ 2011-04-21  6:07         ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21  6:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 01:56:34PM +0800, Christoph Hellwig wrote:
> On Thu, Apr 21, 2011 at 01:50:31PM +0800, Wu Fengguang wrote:
> > Hi Christoph,
> > 
> > On Thu, Apr 21, 2011 at 12:34:50PM +0800, Christoph Hellwig wrote:
> > > Hi Wu,
> > > 
> > > if you're queueing up writeback changes can you look into splitting
> > > inode_wb_list_lock as it was done in earlier versions of the inode
> > > scalability patches?  Especially if we don't get the I/O less
> > > balance_dirty_pages in ASAP it'll at least allows us to scale the
> > > busy waiting for the list manipulationes to one CPU per BDI.
> > 
> > Do you mean to split inode_wb_list_lock into struct bdi_writeback? 
> > So as to improve at least the JBOD case now and hopefully benefit the
> > 1-bdi case when switching to multiple bdi_writeback per bdi in future?
> > 
> > I've not touched any locking code before, but it looks like some dumb
> > code replacement. Let me try it :)
> 
> I can do the patch if you want, it would be useful to carry it in your
> series to avoid conflicts, though.

I see. I'll do it, thanks!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-21  4:10                       ` Wu Fengguang
@ 2011-04-21  6:36                         ` Dave Chinner
  -1 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-21  6:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Thu, Apr 21, 2011 at 12:10:11PM +0800, Wu Fengguang wrote:
> > > Still, given wb_writeback() is the only caller of both
> > > __writeback_inodes_sb and writeback_inodes_wb(), I'm wondering if
> > > moving the queue_io calls up into wb_writeback() would clean up this
> > > logic somewhat. I think Jan mentioned doing something like this as
> > > well elsewhere in the thread...
> > 
> > Unfortunately they call queue_io() inside the lock..
> 
> OK, let's try moving up the lock too. Do you like this change? :)

Yes, very much ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-21  6:36                         ` Dave Chinner
  0 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-21  6:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Thu, Apr 21, 2011 at 12:10:11PM +0800, Wu Fengguang wrote:
> > > Still, given wb_writeback() is the only caller of both
> > > __writeback_inodes_sb and writeback_inodes_wb(), I'm wondering if
> > > moving the queue_io calls up into wb_writeback() would clean up this
> > > logic somewhat. I think Jan mentioned doing something like this as
> > > well elsewhere in the thread...
> > 
> > Unfortunately they call queue_io() inside the lock..
> 
> OK, let's try moving up the lock too. Do you like this change? :)

Yes, very much ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21  3:33             ` Wu Fengguang
@ 2011-04-21  7:09                 ` Dave Chinner
  2011-04-21  7:09                 ` Dave Chinner
  1 sibling, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-21  7:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> I collected the writeback_single_inode() traces (patch attached for
> your reference) each for several test runs, and find much more
> I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> even for small files?
> 
> wfg /tmp% g -c I_DIRTY_PAGES trace-*
> trace-moving-expire-1:28213
> trace-no-moving-expire:6684
> 
> wfg /tmp% g -c I_DIRTY_DATASYNC trace-*
> trace-moving-expire-1:179
> trace-no-moving-expire:193
> 
> wfg /tmp% g -c I_DIRTY_SYNC trace-* 
> trace-moving-expire-1:29394
> trace-no-moving-expire:31593
> 
> wfg /tmp% wc -l trace-*
>    81108 trace-moving-expire-1
>    68562 trace-no-moving-expire

Likely just timing. When IO completes and updates the inode IO size,
XFS calls mark_inode_dirty() again to ensure that the metadata that
was changed gets written out at a later point in time.
Hence every single file that is created by the test will be marked
dirty again after the first write has returned and disappeared.

Why you see different numbers? it's timing dependent based on Io
completion rates - if you have a fast disk the IO completion can
occur before write_inode() is called and so the inode can be written
and the dirty page state removed in the one writeback_single_inode()
call...

That's my initial guess without looking at it in any real detail,
anyway.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-21  7:09                 ` Dave Chinner
  0 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-21  7:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> I collected the writeback_single_inode() traces (patch attached for
> your reference) each for several test runs, and find much more
> I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> even for small files?
> 
> wfg /tmp% g -c I_DIRTY_PAGES trace-*
> trace-moving-expire-1:28213
> trace-no-moving-expire:6684
> 
> wfg /tmp% g -c I_DIRTY_DATASYNC trace-*
> trace-moving-expire-1:179
> trace-no-moving-expire:193
> 
> wfg /tmp% g -c I_DIRTY_SYNC trace-* 
> trace-moving-expire-1:29394
> trace-no-moving-expire:31593
> 
> wfg /tmp% wc -l trace-*
>    81108 trace-moving-expire-1
>    68562 trace-no-moving-expire

Likely just timing. When IO completes and updates the inode IO size,
XFS calls mark_inode_dirty() again to ensure that the metadata that
was changed gets written out at a later point in time.
Hence every single file that is created by the test will be marked
dirty again after the first write has returned and disappeared.

Why you see different numbers? it's timing dependent based on Io
completion rates - if you have a fast disk the IO completion can
occur before write_inode() is called and so the inode can be written
and the dirty page state removed in the one writeback_single_inode()
call...

That's my initial guess without looking at it in any real detail,
anyway.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21  7:09                 ` Dave Chinner
@ 2011-04-21  7:14                   ` Christoph Hellwig
  -1 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  7:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Jan Kara, Andrew Morton, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 05:09:47PM +1000, Dave Chinner wrote:
> Likely just timing. When IO completes and updates the inode IO size,
> XFS calls mark_inode_dirty() again to ensure that the metadata that
> was changed gets written out at a later point in time.
> Hence every single file that is created by the test will be marked
> dirty again after the first write has returned and disappeared.
> 
> Why you see different numbers? it's timing dependent based on Io
> completion rates - if you have a fast disk the IO completion can
> occur before write_inode() is called and so the inode can be written
> and the dirty page state removed in the one writeback_single_inode()
> call...
> 
> That's my initial guess without looking at it in any real detail,
> anyway.

We shouldn't have I_DIRTY_PAGES set for that case, as we only redirty
metadata.  But we're actually doing a xfs_mark_inode_dirty, which
dirties all of I_DIRTY, which includes I_DIRTY_PAGES.  I guess it
should change to

	__mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_DATASYNC);


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-21  7:14                   ` Christoph Hellwig
  0 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  7:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Jan Kara, Andrew Morton, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 05:09:47PM +1000, Dave Chinner wrote:
> Likely just timing. When IO completes and updates the inode IO size,
> XFS calls mark_inode_dirty() again to ensure that the metadata that
> was changed gets written out at a later point in time.
> Hence every single file that is created by the test will be marked
> dirty again after the first write has returned and disappeared.
> 
> Why you see different numbers? it's timing dependent based on Io
> completion rates - if you have a fast disk the IO completion can
> occur before write_inode() is called and so the inode can be written
> and the dirty page state removed in the one writeback_single_inode()
> call...
> 
> That's my initial guess without looking at it in any real detail,
> anyway.

We shouldn't have I_DIRTY_PAGES set for that case, as we only redirty
metadata.  But we're actually doing a xfs_mark_inode_dirty, which
dirties all of I_DIRTY, which includes I_DIRTY_PAGES.  I guess it
should change to

	__mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_DATASYNC);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
  2011-04-21  6:07         ` Wu Fengguang
@ 2011-04-21  7:17           ` Christoph Hellwig
  -1 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  7:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, Mel Gorman,
	Dave Chinner, Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

Here's the inode_wb_list_lock splitup against current mainline:

---
From: Christoph Hellwig <hch@lst.de>
Subject: [PATCH] writeback: split inode_wb_list_lock

Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads.  I won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2011-04-21 08:31:44.512334499 +0200
+++ linux-2.6/fs/fs-writeback.c	2011-04-21 09:07:05.327511722 +0200
@@ -180,12 +180,13 @@ void bdi_start_background_writeback(stru
  */
 void inode_wb_list_del(struct inode *inode)
 {
-	spin_lock(&inode_wb_list_lock);
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.list_lock);
 	list_del_init(&inode->i_wb_list);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&bdi->wb.list_lock);
 }
 
-
 /*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
@@ -195,11 +196,9 @@ void inode_wb_list_del(struct inode *ino
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
-static void redirty_tail(struct inode *inode)
+static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -213,11 +212,9 @@ static void redirty_tail(struct inode *i
 /*
  * requeue inode for re-scanning after bdi->b_io list is exhausted.
  */
-static void requeue_io(struct inode *inode)
+static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
@@ -225,7 +222,7 @@ static void inode_sync_complete(struct i
 {
 	/*
 	 * Prevent speculative execution through
-	 * spin_unlock(&inode_wb_list_lock);
+	 * spin_unlock(&wb->list_lock);
 	 */
 
 	smp_mb();
@@ -301,7 +298,7 @@ static void move_expired_inodes(struct l
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
@@ -316,7 +313,8 @@ static int write_inode(struct inode *ino
 /*
  * Wait for writeback on an inode to complete.
  */
-static void inode_wait_for_writeback(struct inode *inode)
+static void inode_wait_for_writeback(struct inode *inode,
+		struct bdi_writeback *wb)
 {
 	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
 	wait_queue_head_t *wqh;
@@ -324,15 +322,15 @@ static void inode_wait_for_writeback(str
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		spin_lock(&inode->i_lock);
 	}
 }
 
 /*
- * Write out an inode's dirty pages.  Called under inode_wb_list_lock and
+ * Write out an inode's dirty pages.  Called under wb->list_lock and
  * inode->i_lock.  Either the caller has an active reference on the inode or
  * the inode has I_WILL_FREE set.
  *
@@ -343,13 +341,14 @@ static void inode_wait_for_writeback(str
  * livelocks, etc.
  */
 static int
-writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
+writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
+		struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
 	unsigned dirty;
 	int ret;
 
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
 
 	if (!atomic_read(&inode->i_count))
@@ -367,14 +366,14 @@ writeback_single_inode(struct inode *ino
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			return 0;
 		}
 
 		/*
 		 * It's a data-integrity sync.  We must wait.
 		 */
-		inode_wait_for_writeback(inode);
+		inode_wait_for_writeback(inode, wb);
 	}
 
 	BUG_ON(inode->i_state & I_SYNC);
@@ -383,7 +382,7 @@ writeback_single_inode(struct inode *ino
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 
 	ret = do_writepages(mapping, wbc);
 
@@ -414,7 +413,7 @@ writeback_single_inode(struct inode *ino
 			ret = err;
 	}
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
@@ -428,7 +427,7 @@ writeback_single_inode(struct inode *ino
 				/*
 				 * slice used up: queue for next turn
 				 */
-				requeue_io(inode);
+				requeue_io(inode, wb);
 			} else {
 				/*
 				 * Writeback blocked by something other than
@@ -437,7 +436,7 @@ writeback_single_inode(struct inode *ino
 				 * retrying writeback of the dirty page/inode
 				 * that cannot be performed immediately.
 				 */
-				redirty_tail(inode);
+				redirty_tail(inode, wb);
 			}
 		} else if (inode->i_state & I_DIRTY) {
 			/*
@@ -446,7 +445,7 @@ writeback_single_inode(struct inode *ino
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
-			redirty_tail(inode);
+			redirty_tail(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -510,7 +509,7 @@ static int writeback_sb_inodes(struct su
 				 * superblock, move all inodes not belonging
 				 * to it back onto the dirty list.
 				 */
-				redirty_tail(inode);
+				redirty_tail(inode, wb);
 				continue;
 			}
 
@@ -530,7 +529,7 @@ static int writeback_sb_inodes(struct su
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			continue;
 		}
 
@@ -546,19 +545,19 @@ static int writeback_sb_inodes(struct su
 		__iget(inode);
 
 		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wbc);
+		writeback_single_inode(inode, wb, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
-			redirty_tail(inode);
+			redirty_tail(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 		iput(inode);
 		cond_resched();
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
 			return 1;
@@ -577,7 +576,7 @@ void writeback_inodes_wb(struct bdi_writ
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
@@ -586,7 +585,7 @@ void writeback_inodes_wb(struct bdi_writ
 		struct super_block *sb = inode->i_sb;
 
 		if (!pin_sb_for_writeback(sb)) {
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			continue;
 		}
 		ret = writeback_sb_inodes(sb, wb, wbc, false);
@@ -595,7 +594,7 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
@@ -604,11 +603,11 @@ static void __writeback_inodes_sb(struct
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 }
 
 /*
@@ -747,15 +746,15 @@ static long wb_writeback(struct bdi_writ
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = wb_inode(wb->b_more_io.prev);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode);
+			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 	}
 
 	return wrote;
@@ -1092,10 +1091,10 @@ void __mark_inode_dirty(struct inode *in
 			}
 
 			spin_unlock(&inode->i_lock);
-			spin_lock(&inode_wb_list_lock);
+			spin_lock(&bdi->wb.list_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
-			spin_unlock(&inode_wb_list_lock);
+			spin_unlock(&bdi->wb.list_lock);
 
 			if (wakeup_bdi)
 				bdi_wakeup_thread_delayed(bdi);
@@ -1296,6 +1295,7 @@ EXPORT_SYMBOL(sync_inodes_sb);
  */
 int write_inode_now(struct inode *inode, int sync)
 {
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	int ret;
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
@@ -1308,11 +1308,11 @@ int write_inode_now(struct inode *inode,
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, &wbc);
+	ret = writeback_single_inode(inode, wb, &wbc);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
@@ -1332,13 +1332,14 @@ EXPORT_SYMBOL(write_inode_now);
  */
 int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	int ret;
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, wbc);
+	ret = writeback_single_inode(inode, wb, wbc);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2011-04-21 08:31:40.172358011 +0200
+++ linux-2.6/fs/inode.c	2011-04-21 09:07:05.327511722 +0200
@@ -37,7 +37,7 @@
  *   inode_lru, inode->i_lru
  * inode_sb_list_lock protects:
  *   sb->s_inodes, inode->i_sb_list
- * inode_wb_list_lock protects:
+ * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
  * inode_hash_lock protects:
  *   inode_hashtable, inode->i_hash
@@ -48,7 +48,7 @@
  *   inode->i_lock
  *     inode_lru_lock
  *
- * inode_wb_list_lock
+ * bdi->wb.list_lock
  *   inode->i_lock
  *
  * inode_hash_lock
@@ -111,7 +111,6 @@ static LIST_HEAD(inode_lru);
 static DEFINE_SPINLOCK(inode_lru_lock);
 
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_wb_list_lock);
 
 /*
  * iprune_sem provides exclusion between the icache shrinking and the
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h	2011-04-21 08:31:42.185680435 +0200
+++ linux-2.6/include/linux/writeback.h	2011-04-21 09:07:05.327511722 +0200
@@ -9,8 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t inode_wb_list_lock;
-
 /*
  * fs/fs-writeback.c
  */
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c	2011-04-21 08:31:44.532334389 +0200
+++ linux-2.6/mm/backing-dev.c	2011-04-21 09:07:05.327511722 +0200
@@ -45,6 +45,17 @@ static struct timer_list sync_supers_tim
 static int bdi_sync_supers(void *);
 static void sync_supers_timer_fn(unsigned long);
 
+void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
+{
+	if (wb1 < wb2) {
+		spin_lock(&wb1->list_lock);
+		spin_lock_nested(&wb2->list_lock, 1);
+	} else {
+		spin_lock(&wb2->list_lock);
+		spin_lock_nested(&wb1->list_lock, 1);
+	}
+}
+
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
@@ -67,14 +78,14 @@ static int bdi_debug_stats_show(struct s
 	struct inode *inode;
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
 	list_for_each_entry(inode, &wb->b_io, i_wb_list)
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -628,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	spin_lock_init(&wb->list_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
 
@@ -676,11 +688,12 @@ void bdi_destroy(struct backing_dev_info
 	if (bdi_has_dirty_io(bdi)) {
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
-		spin_lock(&inode_wb_list_lock);
+		bdi_lock_two(&bdi->wb, dst);
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&bdi->wb.list_lock);
+		spin_unlock(&dst->list_lock);
 	}
 
 	bdi_unregister(bdi);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2011-04-21 08:31:42.159013915 +0200
+++ linux-2.6/mm/filemap.c	2011-04-21 09:07:05.330845037 +0200
@@ -80,7 +80,7 @@
  *  ->i_mutex
  *    ->i_alloc_sem             (various)
  *
- *  inode_wb_list_lock
+ *  bdi->wb.list_lock
  *    sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
@@ -98,9 +98,9 @@
  *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
- *    inode_wb_list_lock	(page_remove_rmap->set_page_dirty)
+ *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
  *    ->inode->i_lock		(page_remove_rmap->set_page_dirty)
- *    inode_wb_list_lock	(zap_pte_range->set_page_dirty)
+ *    bdi.wb->list_lock	(zap_pte_range->set_page_dirty)
  *    ->inode->i_lock		(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
  *
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2011-04-21 08:31:41.519017382 +0200
+++ linux-2.6/mm/rmap.c	2011-04-21 09:07:05.330845037 +0200
@@ -32,11 +32,11 @@
  *               mmlist_lock (in mmput, drain_mmlist and others)
  *               mapping->private_lock (in __set_page_dirty_buffers)
  *               inode->i_lock (in set_page_dirty's __mark_inode_dirty)
- *               inode_wb_list_lock (in set_page_dirty's __mark_inode_dirty)
+ *               bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
  *                 sb_lock (within inode_lock in fs/fs-writeback.c)
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
- *                           within inode_wb_list_lock in __sync_single_inode)
+ *                           within bdi.wb->list_lock in __sync_single_inode)
  *
  * (code doesn't rely on that order so it could be switched around)
  * ->tasklist_lock
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c	2011-04-21 08:31:44.522334444 +0200
+++ linux-2.6/fs/block_dev.c	2011-04-21 09:07:05.330845037 +0200
@@ -55,13 +55,16 @@ EXPORT_SYMBOL(I_BDEV);
 static void bdev_inode_switch_bdi(struct inode *inode,
 			struct backing_dev_info *dst)
 {
-	spin_lock(&inode_wb_list_lock);
+	struct backing_dev_info *old = inode->i_data.backing_dev_info;
+
+	bdi_lock_two(&old->wb, &dst->wb);
 	spin_lock(&inode->i_lock);
 	inode->i_data.backing_dev_info = dst;
 	if (inode->i_state & I_DIRTY)
 		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&old->wb.list_lock);
+	spin_unlock(&dst->wb.list_lock);
 }
 
 static sector_t max_block(struct block_device *bdev)
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h	2011-04-21 08:31:42.202347013 +0200
+++ linux-2.6/include/linux/backing-dev.h	2011-04-21 09:07:05.330845037 +0200
@@ -57,6 +57,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	spinlock_t list_lock;		/* protects the b_* lists. */
 };
 
 struct backing_dev_info {
@@ -106,6 +107,7 @@ int bdi_writeback_thread(void *data);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
@ 2011-04-21  7:17           ` Christoph Hellwig
  0 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  7:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, Mel Gorman,
	Dave Chinner, Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

Here's the inode_wb_list_lock splitup against current mainline:

---
From: Christoph Hellwig <hch@lst.de>
Subject: [PATCH] writeback: split inode_wb_list_lock

Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads.  I won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2011-04-21 08:31:44.512334499 +0200
+++ linux-2.6/fs/fs-writeback.c	2011-04-21 09:07:05.327511722 +0200
@@ -180,12 +180,13 @@ void bdi_start_background_writeback(stru
  */
 void inode_wb_list_del(struct inode *inode)
 {
-	spin_lock(&inode_wb_list_lock);
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.list_lock);
 	list_del_init(&inode->i_wb_list);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&bdi->wb.list_lock);
 }
 
-
 /*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
@@ -195,11 +196,9 @@ void inode_wb_list_del(struct inode *ino
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
-static void redirty_tail(struct inode *inode)
+static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -213,11 +212,9 @@ static void redirty_tail(struct inode *i
 /*
  * requeue inode for re-scanning after bdi->b_io list is exhausted.
  */
-static void requeue_io(struct inode *inode)
+static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
@@ -225,7 +222,7 @@ static void inode_sync_complete(struct i
 {
 	/*
 	 * Prevent speculative execution through
-	 * spin_unlock(&inode_wb_list_lock);
+	 * spin_unlock(&wb->list_lock);
 	 */
 
 	smp_mb();
@@ -301,7 +298,7 @@ static void move_expired_inodes(struct l
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
@@ -316,7 +313,8 @@ static int write_inode(struct inode *ino
 /*
  * Wait for writeback on an inode to complete.
  */
-static void inode_wait_for_writeback(struct inode *inode)
+static void inode_wait_for_writeback(struct inode *inode,
+		struct bdi_writeback *wb)
 {
 	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
 	wait_queue_head_t *wqh;
@@ -324,15 +322,15 @@ static void inode_wait_for_writeback(str
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		spin_lock(&inode->i_lock);
 	}
 }
 
 /*
- * Write out an inode's dirty pages.  Called under inode_wb_list_lock and
+ * Write out an inode's dirty pages.  Called under wb->list_lock and
  * inode->i_lock.  Either the caller has an active reference on the inode or
  * the inode has I_WILL_FREE set.
  *
@@ -343,13 +341,14 @@ static void inode_wait_for_writeback(str
  * livelocks, etc.
  */
 static int
-writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
+writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
+		struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
 	unsigned dirty;
 	int ret;
 
-	assert_spin_locked(&inode_wb_list_lock);
+	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
 
 	if (!atomic_read(&inode->i_count))
@@ -367,14 +366,14 @@ writeback_single_inode(struct inode *ino
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			return 0;
 		}
 
 		/*
 		 * It's a data-integrity sync.  We must wait.
 		 */
-		inode_wait_for_writeback(inode);
+		inode_wait_for_writeback(inode, wb);
 	}
 
 	BUG_ON(inode->i_state & I_SYNC);
@@ -383,7 +382,7 @@ writeback_single_inode(struct inode *ino
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 
 	ret = do_writepages(mapping, wbc);
 
@@ -414,7 +413,7 @@ writeback_single_inode(struct inode *ino
 			ret = err;
 	}
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
@@ -428,7 +427,7 @@ writeback_single_inode(struct inode *ino
 				/*
 				 * slice used up: queue for next turn
 				 */
-				requeue_io(inode);
+				requeue_io(inode, wb);
 			} else {
 				/*
 				 * Writeback blocked by something other than
@@ -437,7 +436,7 @@ writeback_single_inode(struct inode *ino
 				 * retrying writeback of the dirty page/inode
 				 * that cannot be performed immediately.
 				 */
-				redirty_tail(inode);
+				redirty_tail(inode, wb);
 			}
 		} else if (inode->i_state & I_DIRTY) {
 			/*
@@ -446,7 +445,7 @@ writeback_single_inode(struct inode *ino
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
-			redirty_tail(inode);
+			redirty_tail(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -510,7 +509,7 @@ static int writeback_sb_inodes(struct su
 				 * superblock, move all inodes not belonging
 				 * to it back onto the dirty list.
 				 */
-				redirty_tail(inode);
+				redirty_tail(inode, wb);
 				continue;
 			}
 
@@ -530,7 +529,7 @@ static int writeback_sb_inodes(struct su
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			continue;
 		}
 
@@ -546,19 +545,19 @@ static int writeback_sb_inodes(struct su
 		__iget(inode);
 
 		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wbc);
+		writeback_single_inode(inode, wb, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
-			redirty_tail(inode);
+			redirty_tail(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 		iput(inode);
 		cond_resched();
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
 			return 1;
@@ -577,7 +576,7 @@ void writeback_inodes_wb(struct bdi_writ
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
@@ -586,7 +585,7 @@ void writeback_inodes_wb(struct bdi_writ
 		struct super_block *sb = inode->i_sb;
 
 		if (!pin_sb_for_writeback(sb)) {
-			requeue_io(inode);
+			requeue_io(inode, wb);
 			continue;
 		}
 		ret = writeback_sb_inodes(sb, wb, wbc, false);
@@ -595,7 +594,7 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
@@ -604,11 +603,11 @@ static void __writeback_inodes_sb(struct
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 }
 
 /*
@@ -747,15 +746,15 @@ static long wb_writeback(struct bdi_writ
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
-		spin_lock(&inode_wb_list_lock);
+		spin_lock(&wb->list_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = wb_inode(wb->b_more_io.prev);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode);
+			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&wb->list_lock);
 	}
 
 	return wrote;
@@ -1092,10 +1091,10 @@ void __mark_inode_dirty(struct inode *in
 			}
 
 			spin_unlock(&inode->i_lock);
-			spin_lock(&inode_wb_list_lock);
+			spin_lock(&bdi->wb.list_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
-			spin_unlock(&inode_wb_list_lock);
+			spin_unlock(&bdi->wb.list_lock);
 
 			if (wakeup_bdi)
 				bdi_wakeup_thread_delayed(bdi);
@@ -1296,6 +1295,7 @@ EXPORT_SYMBOL(sync_inodes_sb);
  */
 int write_inode_now(struct inode *inode, int sync)
 {
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	int ret;
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
@@ -1308,11 +1308,11 @@ int write_inode_now(struct inode *inode,
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, &wbc);
+	ret = writeback_single_inode(inode, wb, &wbc);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
@@ -1332,13 +1332,14 @@ EXPORT_SYMBOL(write_inode_now);
  */
 int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	int ret;
 
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, wbc);
+	ret = writeback_single_inode(inode, wb, wbc);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2011-04-21 08:31:40.172358011 +0200
+++ linux-2.6/fs/inode.c	2011-04-21 09:07:05.327511722 +0200
@@ -37,7 +37,7 @@
  *   inode_lru, inode->i_lru
  * inode_sb_list_lock protects:
  *   sb->s_inodes, inode->i_sb_list
- * inode_wb_list_lock protects:
+ * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
  * inode_hash_lock protects:
  *   inode_hashtable, inode->i_hash
@@ -48,7 +48,7 @@
  *   inode->i_lock
  *     inode_lru_lock
  *
- * inode_wb_list_lock
+ * bdi->wb.list_lock
  *   inode->i_lock
  *
  * inode_hash_lock
@@ -111,7 +111,6 @@ static LIST_HEAD(inode_lru);
 static DEFINE_SPINLOCK(inode_lru_lock);
 
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_wb_list_lock);
 
 /*
  * iprune_sem provides exclusion between the icache shrinking and the
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h	2011-04-21 08:31:42.185680435 +0200
+++ linux-2.6/include/linux/writeback.h	2011-04-21 09:07:05.327511722 +0200
@@ -9,8 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t inode_wb_list_lock;
-
 /*
  * fs/fs-writeback.c
  */
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c	2011-04-21 08:31:44.532334389 +0200
+++ linux-2.6/mm/backing-dev.c	2011-04-21 09:07:05.327511722 +0200
@@ -45,6 +45,17 @@ static struct timer_list sync_supers_tim
 static int bdi_sync_supers(void *);
 static void sync_supers_timer_fn(unsigned long);
 
+void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
+{
+	if (wb1 < wb2) {
+		spin_lock(&wb1->list_lock);
+		spin_lock_nested(&wb2->list_lock, 1);
+	} else {
+		spin_lock(&wb2->list_lock);
+		spin_lock_nested(&wb1->list_lock, 1);
+	}
+}
+
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
@@ -67,14 +78,14 @@ static int bdi_debug_stats_show(struct s
 	struct inode *inode;
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
-	spin_lock(&inode_wb_list_lock);
+	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
 	list_for_each_entry(inode, &wb->b_io, i_wb_list)
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&wb->list_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -628,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	spin_lock_init(&wb->list_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
 
@@ -676,11 +688,12 @@ void bdi_destroy(struct backing_dev_info
 	if (bdi_has_dirty_io(bdi)) {
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 
-		spin_lock(&inode_wb_list_lock);
+		bdi_lock_two(&bdi->wb, dst);
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
-		spin_unlock(&inode_wb_list_lock);
+		spin_unlock(&bdi->wb.list_lock);
+		spin_unlock(&dst->list_lock);
 	}
 
 	bdi_unregister(bdi);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2011-04-21 08:31:42.159013915 +0200
+++ linux-2.6/mm/filemap.c	2011-04-21 09:07:05.330845037 +0200
@@ -80,7 +80,7 @@
  *  ->i_mutex
  *    ->i_alloc_sem             (various)
  *
- *  inode_wb_list_lock
+ *  bdi->wb.list_lock
  *    sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
@@ -98,9 +98,9 @@
  *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
- *    inode_wb_list_lock	(page_remove_rmap->set_page_dirty)
+ *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
  *    ->inode->i_lock		(page_remove_rmap->set_page_dirty)
- *    inode_wb_list_lock	(zap_pte_range->set_page_dirty)
+ *    bdi.wb->list_lock	(zap_pte_range->set_page_dirty)
  *    ->inode->i_lock		(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
  *
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2011-04-21 08:31:41.519017382 +0200
+++ linux-2.6/mm/rmap.c	2011-04-21 09:07:05.330845037 +0200
@@ -32,11 +32,11 @@
  *               mmlist_lock (in mmput, drain_mmlist and others)
  *               mapping->private_lock (in __set_page_dirty_buffers)
  *               inode->i_lock (in set_page_dirty's __mark_inode_dirty)
- *               inode_wb_list_lock (in set_page_dirty's __mark_inode_dirty)
+ *               bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
  *                 sb_lock (within inode_lock in fs/fs-writeback.c)
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
- *                           within inode_wb_list_lock in __sync_single_inode)
+ *                           within bdi.wb->list_lock in __sync_single_inode)
  *
  * (code doesn't rely on that order so it could be switched around)
  * ->tasklist_lock
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c	2011-04-21 08:31:44.522334444 +0200
+++ linux-2.6/fs/block_dev.c	2011-04-21 09:07:05.330845037 +0200
@@ -55,13 +55,16 @@ EXPORT_SYMBOL(I_BDEV);
 static void bdev_inode_switch_bdi(struct inode *inode,
 			struct backing_dev_info *dst)
 {
-	spin_lock(&inode_wb_list_lock);
+	struct backing_dev_info *old = inode->i_data.backing_dev_info;
+
+	bdi_lock_two(&old->wb, &dst->wb);
 	spin_lock(&inode->i_lock);
 	inode->i_data.backing_dev_info = dst;
 	if (inode->i_state & I_DIRTY)
 		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_wb_list_lock);
+	spin_unlock(&old->wb.list_lock);
+	spin_unlock(&dst->wb.list_lock);
 }
 
 static sector_t max_block(struct block_device *bdev)
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h	2011-04-21 08:31:42.202347013 +0200
+++ linux-2.6/include/linux/backing-dev.h	2011-04-21 09:07:05.330845037 +0200
@@ -57,6 +57,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	spinlock_t list_lock;		/* protects the b_* lists. */
 };
 
 struct backing_dev_info {
@@ -106,6 +107,7 @@ int bdi_writeback_thread(void *data);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21  7:14                   ` Christoph Hellwig
@ 2011-04-21  7:52                     ` Dave Chinner
  -1 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-21  7:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Wu Fengguang, Jan Kara, Andrew Morton, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 03:14:26AM -0400, Christoph Hellwig wrote:
> On Thu, Apr 21, 2011 at 05:09:47PM +1000, Dave Chinner wrote:
> > Likely just timing. When IO completes and updates the inode IO size,
> > XFS calls mark_inode_dirty() again to ensure that the metadata that
> > was changed gets written out at a later point in time.
> > Hence every single file that is created by the test will be marked
> > dirty again after the first write has returned and disappeared.
> > 
> > Why you see different numbers? it's timing dependent based on Io
> > completion rates - if you have a fast disk the IO completion can
> > occur before write_inode() is called and so the inode can be written
> > and the dirty page state removed in the one writeback_single_inode()
> > call...
> > 
> > That's my initial guess without looking at it in any real detail,
> > anyway.
> 
> We shouldn't have I_DIRTY_PAGES set for that case, as we only redirty
> metadata.  But we're actually doing a xfs_mark_inode_dirty, which
> dirties all of I_DIRTY, which includes I_DIRTY_PAGES.  I guess it
> should change to
> 
> 	__mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_DATASYNC);

Probably should. Using xfs_mark_inode_dirty_sync() might be the best
thing to do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-21  7:52                     ` Dave Chinner
  0 siblings, 0 replies; 135+ messages in thread
From: Dave Chinner @ 2011-04-21  7:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Wu Fengguang, Jan Kara, Andrew Morton, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 03:14:26AM -0400, Christoph Hellwig wrote:
> On Thu, Apr 21, 2011 at 05:09:47PM +1000, Dave Chinner wrote:
> > Likely just timing. When IO completes and updates the inode IO size,
> > XFS calls mark_inode_dirty() again to ensure that the metadata that
> > was changed gets written out at a later point in time.
> > Hence every single file that is created by the test will be marked
> > dirty again after the first write has returned and disappeared.
> > 
> > Why you see different numbers? it's timing dependent based on Io
> > completion rates - if you have a fast disk the IO completion can
> > occur before write_inode() is called and so the inode can be written
> > and the dirty page state removed in the one writeback_single_inode()
> > call...
> > 
> > That's my initial guess without looking at it in any real detail,
> > anyway.
> 
> We shouldn't have I_DIRTY_PAGES set for that case, as we only redirty
> metadata.  But we're actually doing a xfs_mark_inode_dirty, which
> dirties all of I_DIRTY, which includes I_DIRTY_PAGES.  I guess it
> should change to
> 
> 	__mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_DATASYNC);

Probably should. Using xfs_mark_inode_dirty_sync() might be the best
thing to do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21  7:52                     ` Dave Chinner
@ 2011-04-21  8:00                       ` Christoph Hellwig
  -1 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  8:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Wu Fengguang, Jan Kara, Andrew Morton,
	Mel Gorman, Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 05:52:58PM +1000, Dave Chinner wrote:
> > We shouldn't have I_DIRTY_PAGES set for that case, as we only redirty
> > metadata.  But we're actually doing a xfs_mark_inode_dirty, which
> > dirties all of I_DIRTY, which includes I_DIRTY_PAGES.  I guess it
> > should change to
> > 
> > 	__mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_DATASYNC);
> 
> Probably should. Using xfs_mark_inode_dirty_sync() might be the best
> thing to do.

That's not correct either - we need to set I_DIRTY_DATASYNC so that it
gets caught by fsync and not just fdatasync.

But thinking about it I'm actually not sure we need it at all.  We already
wait for the i_iocount to go to zero both in fsync and ->sync_fs, which will
catch pending I/O completions even without any VFS dirty state.  So just
marking the inode dirty (as I_DIRTY_SYNC | I_DIRTY_DATASYNC) on I/O
completion should be enough these days.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-21  8:00                       ` Christoph Hellwig
  0 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-04-21  8:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Wu Fengguang, Jan Kara, Andrew Morton,
	Mel Gorman, Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 05:52:58PM +1000, Dave Chinner wrote:
> > We shouldn't have I_DIRTY_PAGES set for that case, as we only redirty
> > metadata.  But we're actually doing a xfs_mark_inode_dirty, which
> > dirties all of I_DIRTY, which includes I_DIRTY_PAGES.  I guess it
> > should change to
> > 
> > 	__mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_DATASYNC);
> 
> Probably should. Using xfs_mark_inode_dirty_sync() might be the best
> thing to do.

That's not correct either - we need to set I_DIRTY_DATASYNC so that it
gets caught by fsync and not just fdatasync.

But thinking about it I'm actually not sure we need it at all.  We already
wait for the i_iocount to go to zero both in fsync and ->sync_fs, which will
catch pending I/O completions even without any VFS dirty state.  So just
marking the inode dirty (as I_DIRTY_SYNC | I_DIRTY_DATASYNC) on I/O
completion should be enough these days.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
  2011-04-21  7:17           ` Christoph Hellwig
@ 2011-04-21 10:15             ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21 10:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 03:17:57PM +0800, Christoph Hellwig wrote:
> Here's the inode_wb_list_lock splitup against current mainline:

So quick! I'll carry it and collect some numbers in my tests btw.

Thanks,
Fengguang

> ---
> From: Christoph Hellwig <hch@lst.de>
> Subject: [PATCH] writeback: split inode_wb_list_lock
> 
> Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
> as it's currently the most contended lock in the system for metadata
> heavy workloads.  I won't help for single-filesystem workloads for
> which we'll need the I/O-less balance_dirty_pages, but at least we
> can dedicate a cpu to spinning on each bdi now for larger systems.
> 
> Based on earlier patches from Nick Piggin and Dave Chinner.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: linux-2.6/fs/fs-writeback.c
> ===================================================================
> --- linux-2.6.orig/fs/fs-writeback.c    2011-04-21 08:31:44.512334499 +0200
> +++ linux-2.6/fs/fs-writeback.c 2011-04-21 09:07:05.327511722 +0200
> @@ -180,12 +180,13 @@ void bdi_start_background_writeback(stru
>   */
>  void inode_wb_list_del(struct inode *inode)
>  {
> -       spin_lock(&inode_wb_list_lock);
> +       struct backing_dev_info *bdi = inode_to_bdi(inode);
> +
> +       spin_lock(&bdi->wb.list_lock);
>         list_del_init(&inode->i_wb_list);
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&bdi->wb.list_lock);
>  }
> 
> -
>  /*
>   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
>   * furthest end of its superblock's dirty-inode list.
> @@ -195,11 +196,9 @@ void inode_wb_list_del(struct inode *ino
>   * the case then the inode must have been redirtied while it was being written
>   * out and we don't reset its dirtied_when.
>   */
> -static void redirty_tail(struct inode *inode)
> +static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
>  {
> -       struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
> -
> -       assert_spin_locked(&inode_wb_list_lock);
> +       assert_spin_locked(&wb->list_lock);
>         if (!list_empty(&wb->b_dirty)) {
>                 struct inode *tail;
> 
> @@ -213,11 +212,9 @@ static void redirty_tail(struct inode *i
>  /*
>   * requeue inode for re-scanning after bdi->b_io list is exhausted.
>   */
> -static void requeue_io(struct inode *inode)
> +static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
>  {
> -       struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
> -
> -       assert_spin_locked(&inode_wb_list_lock);
> +       assert_spin_locked(&wb->list_lock);
>         list_move(&inode->i_wb_list, &wb->b_more_io);
>  }
> 
> @@ -225,7 +222,7 @@ static void inode_sync_complete(struct i
>  {
>         /*
>          * Prevent speculative execution through
> -        * spin_unlock(&inode_wb_list_lock);
> +        * spin_unlock(&wb->list_lock);
>          */
> 
>         smp_mb();
> @@ -301,7 +298,7 @@ static void move_expired_inodes(struct l
>   */
>  static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
>  {
> -       assert_spin_locked(&inode_wb_list_lock);
> +       assert_spin_locked(&wb->list_lock);
>         list_splice_init(&wb->b_more_io, &wb->b_io);
>         move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
>  }
> @@ -316,7 +313,8 @@ static int write_inode(struct inode *ino
>  /*
>   * Wait for writeback on an inode to complete.
>   */
> -static void inode_wait_for_writeback(struct inode *inode)
> +static void inode_wait_for_writeback(struct inode *inode,
> +               struct bdi_writeback *wb)
>  {
>         DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
>         wait_queue_head_t *wqh;
> @@ -324,15 +322,15 @@ static void inode_wait_for_writeback(str
>         wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
>         while (inode->i_state & I_SYNC) {
>                 spin_unlock(&inode->i_lock);
> -               spin_unlock(&inode_wb_list_lock);
> +               spin_unlock(&wb->list_lock);
>                 __wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
> -               spin_lock(&inode_wb_list_lock);
> +               spin_lock(&wb->list_lock);
>                 spin_lock(&inode->i_lock);
>         }
>  }
> 
>  /*
> - * Write out an inode's dirty pages.  Called under inode_wb_list_lock and
> + * Write out an inode's dirty pages.  Called under wb->list_lock and
>   * inode->i_lock.  Either the caller has an active reference on the inode or
>   * the inode has I_WILL_FREE set.
>   *
> @@ -343,13 +341,14 @@ static void inode_wait_for_writeback(str
>   * livelocks, etc.
>   */
>  static int
> -writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> +writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
> +               struct writeback_control *wbc)
>  {
>         struct address_space *mapping = inode->i_mapping;
>         unsigned dirty;
>         int ret;
> 
> -       assert_spin_locked(&inode_wb_list_lock);
> +       assert_spin_locked(&wb->list_lock);
>         assert_spin_locked(&inode->i_lock);
> 
>         if (!atomic_read(&inode->i_count))
> @@ -367,14 +366,14 @@ writeback_single_inode(struct inode *ino
>                  * completed a full scan of b_io.
>                  */
>                 if (wbc->sync_mode != WB_SYNC_ALL) {
> -                       requeue_io(inode);
> +                       requeue_io(inode, wb);
>                         return 0;
>                 }
> 
>                 /*
>                  * It's a data-integrity sync.  We must wait.
>                  */
> -               inode_wait_for_writeback(inode);
> +               inode_wait_for_writeback(inode, wb);
>         }
> 
>         BUG_ON(inode->i_state & I_SYNC);
> @@ -383,7 +382,7 @@ writeback_single_inode(struct inode *ino
>         inode->i_state |= I_SYNC;
>         inode->i_state &= ~I_DIRTY_PAGES;
>         spin_unlock(&inode->i_lock);
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&wb->list_lock);
> 
>         ret = do_writepages(mapping, wbc);
> 
> @@ -414,7 +413,7 @@ writeback_single_inode(struct inode *ino
>                         ret = err;
>         }
> 
> -       spin_lock(&inode_wb_list_lock);
> +       spin_lock(&wb->list_lock);
>         spin_lock(&inode->i_lock);
>         inode->i_state &= ~I_SYNC;
>         if (!(inode->i_state & I_FREEING)) {
> @@ -428,7 +427,7 @@ writeback_single_inode(struct inode *ino
>                                 /*
>                                  * slice used up: queue for next turn
>                                  */
> -                               requeue_io(inode);
> +                               requeue_io(inode, wb);
>                         } else {
>                                 /*
>                                  * Writeback blocked by something other than
> @@ -437,7 +436,7 @@ writeback_single_inode(struct inode *ino
>                                  * retrying writeback of the dirty page/inode
>                                  * that cannot be performed immediately.
>                                  */
> -                               redirty_tail(inode);
> +                               redirty_tail(inode, wb);
>                         }
>                 } else if (inode->i_state & I_DIRTY) {
>                         /*
> @@ -446,7 +445,7 @@ writeback_single_inode(struct inode *ino
>                          * submission or metadata updates after data IO
>                          * completion.
>                          */
> -                       redirty_tail(inode);
> +                       redirty_tail(inode, wb);
>                 } else {
>                         /*
>                          * The inode is clean.  At this point we either have
> @@ -510,7 +509,7 @@ static int writeback_sb_inodes(struct su
>                                  * superblock, move all inodes not belonging
>                                  * to it back onto the dirty list.
>                                  */
> -                               redirty_tail(inode);
> +                               redirty_tail(inode, wb);
>                                 continue;
>                         }
> 
> @@ -530,7 +529,7 @@ static int writeback_sb_inodes(struct su
>                 spin_lock(&inode->i_lock);
>                 if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
>                         spin_unlock(&inode->i_lock);
> -                       requeue_io(inode);
> +                       requeue_io(inode, wb);
>                         continue;
>                 }
> 
> @@ -546,19 +545,19 @@ static int writeback_sb_inodes(struct su
>                 __iget(inode);
> 
>                 pages_skipped = wbc->pages_skipped;
> -               writeback_single_inode(inode, wbc);
> +               writeback_single_inode(inode, wb, wbc);
>                 if (wbc->pages_skipped != pages_skipped) {
>                         /*
>                          * writeback is not making progress due to locked
>                          * buffers.  Skip this inode for now.
>                          */
> -                       redirty_tail(inode);
> +                       redirty_tail(inode, wb);
>                 }
>                 spin_unlock(&inode->i_lock);
> -               spin_unlock(&inode_wb_list_lock);
> +               spin_unlock(&wb->list_lock);
>                 iput(inode);
>                 cond_resched();
> -               spin_lock(&inode_wb_list_lock);
> +               spin_lock(&wb->list_lock);
>                 if (wbc->nr_to_write <= 0) {
>                         wbc->more_io = 1;
>                         return 1;
> @@ -577,7 +576,7 @@ void writeback_inodes_wb(struct bdi_writ
> 
>         if (!wbc->wb_start)
>                 wbc->wb_start = jiffies; /* livelock avoidance */
> -       spin_lock(&inode_wb_list_lock);
> +       spin_lock(&wb->list_lock);
>         if (!wbc->for_kupdate || list_empty(&wb->b_io))
>                 queue_io(wb, wbc->older_than_this);
> 
> @@ -586,7 +585,7 @@ void writeback_inodes_wb(struct bdi_writ
>                 struct super_block *sb = inode->i_sb;
> 
>                 if (!pin_sb_for_writeback(sb)) {
> -                       requeue_io(inode);
> +                       requeue_io(inode, wb);
>                         continue;
>                 }
>                 ret = writeback_sb_inodes(sb, wb, wbc, false);
> @@ -595,7 +594,7 @@ void writeback_inodes_wb(struct bdi_writ
>                 if (ret)
>                         break;
>         }
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&wb->list_lock);
>         /* Leave any unwritten inodes on b_io */
>  }
> 
> @@ -604,11 +603,11 @@ static void __writeback_inodes_sb(struct
>  {
>         WARN_ON(!rwsem_is_locked(&sb->s_umount));
> 
> -       spin_lock(&inode_wb_list_lock);
> +       spin_lock(&wb->list_lock);
>         if (!wbc->for_kupdate || list_empty(&wb->b_io))
>                 queue_io(wb, wbc->older_than_this);
>         writeback_sb_inodes(sb, wb, wbc, true);
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&wb->list_lock);
>  }
> 
>  /*
> @@ -747,15 +746,15 @@ static long wb_writeback(struct bdi_writ
>                  * become available for writeback. Otherwise
>                  * we'll just busyloop.
>                  */
> -               spin_lock(&inode_wb_list_lock);
> +               spin_lock(&wb->list_lock);
>                 if (!list_empty(&wb->b_more_io))  {
>                         inode = wb_inode(wb->b_more_io.prev);
>                         trace_wbc_writeback_wait(&wbc, wb->bdi);
>                         spin_lock(&inode->i_lock);
> -                       inode_wait_for_writeback(inode);
> +                       inode_wait_for_writeback(inode, wb);
>                         spin_unlock(&inode->i_lock);
>                 }
> -               spin_unlock(&inode_wb_list_lock);
> +               spin_unlock(&wb->list_lock);
>         }
> 
>         return wrote;
> @@ -1092,10 +1091,10 @@ void __mark_inode_dirty(struct inode *in
>                         }
> 
>                         spin_unlock(&inode->i_lock);
> -                       spin_lock(&inode_wb_list_lock);
> +                       spin_lock(&bdi->wb.list_lock);
>                         inode->dirtied_when = jiffies;
>                         list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
> -                       spin_unlock(&inode_wb_list_lock);
> +                       spin_unlock(&bdi->wb.list_lock);
> 
>                         if (wakeup_bdi)
>                                 bdi_wakeup_thread_delayed(bdi);
> @@ -1296,6 +1295,7 @@ EXPORT_SYMBOL(sync_inodes_sb);
>   */
>  int write_inode_now(struct inode *inode, int sync)
>  {
> +       struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
>         int ret;
>         struct writeback_control wbc = {
>                 .nr_to_write = LONG_MAX,
> @@ -1308,11 +1308,11 @@ int write_inode_now(struct inode *inode,
>                 wbc.nr_to_write = 0;
> 
>         might_sleep();
> -       spin_lock(&inode_wb_list_lock);
> +       spin_lock(&wb->list_lock);
>         spin_lock(&inode->i_lock);
> -       ret = writeback_single_inode(inode, &wbc);
> +       ret = writeback_single_inode(inode, wb, &wbc);
>         spin_unlock(&inode->i_lock);
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&wb->list_lock);
>         if (sync)
>                 inode_sync_wait(inode);
>         return ret;
> @@ -1332,13 +1332,14 @@ EXPORT_SYMBOL(write_inode_now);
>   */
>  int sync_inode(struct inode *inode, struct writeback_control *wbc)
>  {
> +       struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
>         int ret;
> 
> -       spin_lock(&inode_wb_list_lock);
> +       spin_lock(&wb->list_lock);
>         spin_lock(&inode->i_lock);
> -       ret = writeback_single_inode(inode, wbc);
> +       ret = writeback_single_inode(inode, wb, wbc);
>         spin_unlock(&inode->i_lock);
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&wb->list_lock);
>         return ret;
>  }
>  EXPORT_SYMBOL(sync_inode);
> Index: linux-2.6/fs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/inode.c   2011-04-21 08:31:40.172358011 +0200
> +++ linux-2.6/fs/inode.c        2011-04-21 09:07:05.327511722 +0200
> @@ -37,7 +37,7 @@
>   *   inode_lru, inode->i_lru
>   * inode_sb_list_lock protects:
>   *   sb->s_inodes, inode->i_sb_list
> - * inode_wb_list_lock protects:
> + * bdi->wb.list_lock protects:
>   *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
>   * inode_hash_lock protects:
>   *   inode_hashtable, inode->i_hash
> @@ -48,7 +48,7 @@
>   *   inode->i_lock
>   *     inode_lru_lock
>   *
> - * inode_wb_list_lock
> + * bdi->wb.list_lock
>   *   inode->i_lock
>   *
>   * inode_hash_lock
> @@ -111,7 +111,6 @@ static LIST_HEAD(inode_lru);
>  static DEFINE_SPINLOCK(inode_lru_lock);
> 
>  __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
> -__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_wb_list_lock);
> 
>  /*
>   * iprune_sem provides exclusion between the icache shrinking and the
> Index: linux-2.6/include/linux/writeback.h
> ===================================================================
> --- linux-2.6.orig/include/linux/writeback.h    2011-04-21 08:31:42.185680435 +0200
> +++ linux-2.6/include/linux/writeback.h 2011-04-21 09:07:05.327511722 +0200
> @@ -9,8 +9,6 @@
> 
>  struct backing_dev_info;
> 
> -extern spinlock_t inode_wb_list_lock;
> -
>  /*
>   * fs/fs-writeback.c
>   */
> Index: linux-2.6/mm/backing-dev.c
> ===================================================================
> --- linux-2.6.orig/mm/backing-dev.c     2011-04-21 08:31:44.532334389 +0200
> +++ linux-2.6/mm/backing-dev.c  2011-04-21 09:07:05.327511722 +0200
> @@ -45,6 +45,17 @@ static struct timer_list sync_supers_tim
>  static int bdi_sync_supers(void *);
>  static void sync_supers_timer_fn(unsigned long);
> 
> +void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
> +{
> +       if (wb1 < wb2) {
> +               spin_lock(&wb1->list_lock);
> +               spin_lock_nested(&wb2->list_lock, 1);
> +       } else {
> +               spin_lock(&wb2->list_lock);
> +               spin_lock_nested(&wb1->list_lock, 1);
> +       }
> +}
> +
>  #ifdef CONFIG_DEBUG_FS
>  #include <linux/debugfs.h>
>  #include <linux/seq_file.h>
> @@ -67,14 +78,14 @@ static int bdi_debug_stats_show(struct s
>         struct inode *inode;
> 
>         nr_wb = nr_dirty = nr_io = nr_more_io = 0;
> -       spin_lock(&inode_wb_list_lock);
> +       spin_lock(&wb->list_lock);
>         list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
>                 nr_dirty++;
>         list_for_each_entry(inode, &wb->b_io, i_wb_list)
>                 nr_io++;
>         list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
>                 nr_more_io++;
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&wb->list_lock);
> 
>         global_dirty_limits(&background_thresh, &dirty_thresh);
>         bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -628,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
>         INIT_LIST_HEAD(&wb->b_dirty);
>         INIT_LIST_HEAD(&wb->b_io);
>         INIT_LIST_HEAD(&wb->b_more_io);
> +       spin_lock_init(&wb->list_lock);
>         setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
>  }
> 
> @@ -676,11 +688,12 @@ void bdi_destroy(struct backing_dev_info
>         if (bdi_has_dirty_io(bdi)) {
>                 struct bdi_writeback *dst = &default_backing_dev_info.wb;
> 
> -               spin_lock(&inode_wb_list_lock);
> +               bdi_lock_two(&bdi->wb, dst);
>                 list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
>                 list_splice(&bdi->wb.b_io, &dst->b_io);
>                 list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
> -               spin_unlock(&inode_wb_list_lock);
> +               spin_unlock(&bdi->wb.list_lock);
> +               spin_unlock(&dst->list_lock);
>         }
> 
>         bdi_unregister(bdi);
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c 2011-04-21 08:31:42.159013915 +0200
> +++ linux-2.6/mm/filemap.c      2011-04-21 09:07:05.330845037 +0200
> @@ -80,7 +80,7 @@
>   *  ->i_mutex
>   *    ->i_alloc_sem             (various)
>   *
> - *  inode_wb_list_lock
> + *  bdi->wb.list_lock
>   *    sb_lock                  (fs/fs-writeback.c)
>   *    ->mapping->tree_lock     (__sync_single_inode)
>   *
> @@ -98,9 +98,9 @@
>   *    ->zone.lru_lock          (check_pte_range->isolate_lru_page)
>   *    ->private_lock           (page_remove_rmap->set_page_dirty)
>   *    ->tree_lock              (page_remove_rmap->set_page_dirty)
> - *    inode_wb_list_lock       (page_remove_rmap->set_page_dirty)
> + *    bdi.wb->list_lock                (page_remove_rmap->set_page_dirty)
>   *    ->inode->i_lock          (page_remove_rmap->set_page_dirty)
> - *    inode_wb_list_lock       (zap_pte_range->set_page_dirty)
> + *    bdi.wb->list_lock        (zap_pte_range->set_page_dirty)
>   *    ->inode->i_lock          (zap_pte_range->set_page_dirty)
>   *    ->private_lock           (zap_pte_range->__set_page_dirty_buffers)
>   *
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c    2011-04-21 08:31:41.519017382 +0200
> +++ linux-2.6/mm/rmap.c 2011-04-21 09:07:05.330845037 +0200
> @@ -32,11 +32,11 @@
>   *               mmlist_lock (in mmput, drain_mmlist and others)
>   *               mapping->private_lock (in __set_page_dirty_buffers)
>   *               inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> - *               inode_wb_list_lock (in set_page_dirty's __mark_inode_dirty)
> + *               bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
>   *                 sb_lock (within inode_lock in fs/fs-writeback.c)
>   *                 mapping->tree_lock (widely used, in set_page_dirty,
>   *                           in arch-dependent flush_dcache_mmap_lock,
> - *                           within inode_wb_list_lock in __sync_single_inode)
> + *                           within bdi.wb->list_lock in __sync_single_inode)
>   *
>   * (code doesn't rely on that order so it could be switched around)
>   * ->tasklist_lock
> Index: linux-2.6/fs/block_dev.c
> ===================================================================
> --- linux-2.6.orig/fs/block_dev.c       2011-04-21 08:31:44.522334444 +0200
> +++ linux-2.6/fs/block_dev.c    2011-04-21 09:07:05.330845037 +0200
> @@ -55,13 +55,16 @@ EXPORT_SYMBOL(I_BDEV);
>  static void bdev_inode_switch_bdi(struct inode *inode,
>                         struct backing_dev_info *dst)
>  {
> -       spin_lock(&inode_wb_list_lock);
> +       struct backing_dev_info *old = inode->i_data.backing_dev_info;
> +
> +       bdi_lock_two(&old->wb, &dst->wb);
>         spin_lock(&inode->i_lock);
>         inode->i_data.backing_dev_info = dst;
>         if (inode->i_state & I_DIRTY)
>                 list_move(&inode->i_wb_list, &dst->wb.b_dirty);
>         spin_unlock(&inode->i_lock);
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&old->wb.list_lock);
> +       spin_unlock(&dst->wb.list_lock);
>  }
> 
>  static sector_t max_block(struct block_device *bdev)
> Index: linux-2.6/include/linux/backing-dev.h
> ===================================================================
> --- linux-2.6.orig/include/linux/backing-dev.h  2011-04-21 08:31:42.202347013 +0200
> +++ linux-2.6/include/linux/backing-dev.h       2011-04-21 09:07:05.330845037 +0200
> @@ -57,6 +57,7 @@ struct bdi_writeback {
>         struct list_head b_dirty;       /* dirty inodes */
>         struct list_head b_io;          /* parked for writeback */
>         struct list_head b_more_io;     /* parked for more writeback */
> +       spinlock_t list_lock;           /* protects the b_* lists. */
>  };
> 
>  struct backing_dev_info {
> @@ -106,6 +107,7 @@ int bdi_writeback_thread(void *data);
>  int bdi_has_dirty_io(struct backing_dev_info *bdi);
>  void bdi_arm_supers_timer(void);
>  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
> +void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
> 
>  extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 0/6] writeback: moving expire targets for background/kupdate works
@ 2011-04-21 10:15             ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-21 10:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu, Apr 21, 2011 at 03:17:57PM +0800, Christoph Hellwig wrote:
> Here's the inode_wb_list_lock splitup against current mainline:

So quick! I'll carry it and collect some numbers in my tests btw.

Thanks,
Fengguang

> ---
> From: Christoph Hellwig <hch@lst.de>
> Subject: [PATCH] writeback: split inode_wb_list_lock
> 
> Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
> as it's currently the most contended lock in the system for metadata
> heavy workloads.  I won't help for single-filesystem workloads for
> which we'll need the I/O-less balance_dirty_pages, but at least we
> can dedicate a cpu to spinning on each bdi now for larger systems.
> 
> Based on earlier patches from Nick Piggin and Dave Chinner.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: linux-2.6/fs/fs-writeback.c
> ===================================================================
> --- linux-2.6.orig/fs/fs-writeback.c    2011-04-21 08:31:44.512334499 +0200
> +++ linux-2.6/fs/fs-writeback.c 2011-04-21 09:07:05.327511722 +0200
> @@ -180,12 +180,13 @@ void bdi_start_background_writeback(stru
>   */
>  void inode_wb_list_del(struct inode *inode)
>  {
> -       spin_lock(&inode_wb_list_lock);
> +       struct backing_dev_info *bdi = inode_to_bdi(inode);
> +
> +       spin_lock(&bdi->wb.list_lock);
>         list_del_init(&inode->i_wb_list);
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&bdi->wb.list_lock);
>  }
> 
> -
>  /*
>   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
>   * furthest end of its superblock's dirty-inode list.
> @@ -195,11 +196,9 @@ void inode_wb_list_del(struct inode *ino
>   * the case then the inode must have been redirtied while it was being written
>   * out and we don't reset its dirtied_when.
>   */
> -static void redirty_tail(struct inode *inode)
> +static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
>  {
> -       struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
> -
> -       assert_spin_locked(&inode_wb_list_lock);
> +       assert_spin_locked(&wb->list_lock);
>         if (!list_empty(&wb->b_dirty)) {
>                 struct inode *tail;
> 
> @@ -213,11 +212,9 @@ static void redirty_tail(struct inode *i
>  /*
>   * requeue inode for re-scanning after bdi->b_io list is exhausted.
>   */
> -static void requeue_io(struct inode *inode)
> +static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
>  {
> -       struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
> -
> -       assert_spin_locked(&inode_wb_list_lock);
> +       assert_spin_locked(&wb->list_lock);
>         list_move(&inode->i_wb_list, &wb->b_more_io);
>  }
> 
> @@ -225,7 +222,7 @@ static void inode_sync_complete(struct i
>  {
>         /*
>          * Prevent speculative execution through
> -        * spin_unlock(&inode_wb_list_lock);
> +        * spin_unlock(&wb->list_lock);
>          */
> 
>         smp_mb();
> @@ -301,7 +298,7 @@ static void move_expired_inodes(struct l
>   */
>  static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
>  {
> -       assert_spin_locked(&inode_wb_list_lock);
> +       assert_spin_locked(&wb->list_lock);
>         list_splice_init(&wb->b_more_io, &wb->b_io);
>         move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
>  }
> @@ -316,7 +313,8 @@ static int write_inode(struct inode *ino
>  /*
>   * Wait for writeback on an inode to complete.
>   */
> -static void inode_wait_for_writeback(struct inode *inode)
> +static void inode_wait_for_writeback(struct inode *inode,
> +               struct bdi_writeback *wb)
>  {
>         DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
>         wait_queue_head_t *wqh;
> @@ -324,15 +322,15 @@ static void inode_wait_for_writeback(str
>         wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
>         while (inode->i_state & I_SYNC) {
>                 spin_unlock(&inode->i_lock);
> -               spin_unlock(&inode_wb_list_lock);
> +               spin_unlock(&wb->list_lock);
>                 __wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
> -               spin_lock(&inode_wb_list_lock);
> +               spin_lock(&wb->list_lock);
>                 spin_lock(&inode->i_lock);
>         }
>  }
> 
>  /*
> - * Write out an inode's dirty pages.  Called under inode_wb_list_lock and
> + * Write out an inode's dirty pages.  Called under wb->list_lock and
>   * inode->i_lock.  Either the caller has an active reference on the inode or
>   * the inode has I_WILL_FREE set.
>   *
> @@ -343,13 +341,14 @@ static void inode_wait_for_writeback(str
>   * livelocks, etc.
>   */
>  static int
> -writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> +writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
> +               struct writeback_control *wbc)
>  {
>         struct address_space *mapping = inode->i_mapping;
>         unsigned dirty;
>         int ret;
> 
> -       assert_spin_locked(&inode_wb_list_lock);
> +       assert_spin_locked(&wb->list_lock);
>         assert_spin_locked(&inode->i_lock);
> 
>         if (!atomic_read(&inode->i_count))
> @@ -367,14 +366,14 @@ writeback_single_inode(struct inode *ino
>                  * completed a full scan of b_io.
>                  */
>                 if (wbc->sync_mode != WB_SYNC_ALL) {
> -                       requeue_io(inode);
> +                       requeue_io(inode, wb);
>                         return 0;
>                 }
> 
>                 /*
>                  * It's a data-integrity sync.  We must wait.
>                  */
> -               inode_wait_for_writeback(inode);
> +               inode_wait_for_writeback(inode, wb);
>         }
> 
>         BUG_ON(inode->i_state & I_SYNC);
> @@ -383,7 +382,7 @@ writeback_single_inode(struct inode *ino
>         inode->i_state |= I_SYNC;
>         inode->i_state &= ~I_DIRTY_PAGES;
>         spin_unlock(&inode->i_lock);
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&wb->list_lock);
> 
>         ret = do_writepages(mapping, wbc);
> 
> @@ -414,7 +413,7 @@ writeback_single_inode(struct inode *ino
>                         ret = err;
>         }
> 
> -       spin_lock(&inode_wb_list_lock);
> +       spin_lock(&wb->list_lock);
>         spin_lock(&inode->i_lock);
>         inode->i_state &= ~I_SYNC;
>         if (!(inode->i_state & I_FREEING)) {
> @@ -428,7 +427,7 @@ writeback_single_inode(struct inode *ino
>                                 /*
>                                  * slice used up: queue for next turn
>                                  */
> -                               requeue_io(inode);
> +                               requeue_io(inode, wb);
>                         } else {
>                                 /*
>                                  * Writeback blocked by something other than
> @@ -437,7 +436,7 @@ writeback_single_inode(struct inode *ino
>                                  * retrying writeback of the dirty page/inode
>                                  * that cannot be performed immediately.
>                                  */
> -                               redirty_tail(inode);
> +                               redirty_tail(inode, wb);
>                         }
>                 } else if (inode->i_state & I_DIRTY) {
>                         /*
> @@ -446,7 +445,7 @@ writeback_single_inode(struct inode *ino
>                          * submission or metadata updates after data IO
>                          * completion.
>                          */
> -                       redirty_tail(inode);
> +                       redirty_tail(inode, wb);
>                 } else {
>                         /*
>                          * The inode is clean.  At this point we either have
> @@ -510,7 +509,7 @@ static int writeback_sb_inodes(struct su
>                                  * superblock, move all inodes not belonging
>                                  * to it back onto the dirty list.
>                                  */
> -                               redirty_tail(inode);
> +                               redirty_tail(inode, wb);
>                                 continue;
>                         }
> 
> @@ -530,7 +529,7 @@ static int writeback_sb_inodes(struct su
>                 spin_lock(&inode->i_lock);
>                 if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
>                         spin_unlock(&inode->i_lock);
> -                       requeue_io(inode);
> +                       requeue_io(inode, wb);
>                         continue;
>                 }
> 
> @@ -546,19 +545,19 @@ static int writeback_sb_inodes(struct su
>                 __iget(inode);
> 
>                 pages_skipped = wbc->pages_skipped;
> -               writeback_single_inode(inode, wbc);
> +               writeback_single_inode(inode, wb, wbc);
>                 if (wbc->pages_skipped != pages_skipped) {
>                         /*
>                          * writeback is not making progress due to locked
>                          * buffers.  Skip this inode for now.
>                          */
> -                       redirty_tail(inode);
> +                       redirty_tail(inode, wb);
>                 }
>                 spin_unlock(&inode->i_lock);
> -               spin_unlock(&inode_wb_list_lock);
> +               spin_unlock(&wb->list_lock);
>                 iput(inode);
>                 cond_resched();
> -               spin_lock(&inode_wb_list_lock);
> +               spin_lock(&wb->list_lock);
>                 if (wbc->nr_to_write <= 0) {
>                         wbc->more_io = 1;
>                         return 1;
> @@ -577,7 +576,7 @@ void writeback_inodes_wb(struct bdi_writ
> 
>         if (!wbc->wb_start)
>                 wbc->wb_start = jiffies; /* livelock avoidance */
> -       spin_lock(&inode_wb_list_lock);
> +       spin_lock(&wb->list_lock);
>         if (!wbc->for_kupdate || list_empty(&wb->b_io))
>                 queue_io(wb, wbc->older_than_this);
> 
> @@ -586,7 +585,7 @@ void writeback_inodes_wb(struct bdi_writ
>                 struct super_block *sb = inode->i_sb;
> 
>                 if (!pin_sb_for_writeback(sb)) {
> -                       requeue_io(inode);
> +                       requeue_io(inode, wb);
>                         continue;
>                 }
>                 ret = writeback_sb_inodes(sb, wb, wbc, false);
> @@ -595,7 +594,7 @@ void writeback_inodes_wb(struct bdi_writ
>                 if (ret)
>                         break;
>         }
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&wb->list_lock);
>         /* Leave any unwritten inodes on b_io */
>  }
> 
> @@ -604,11 +603,11 @@ static void __writeback_inodes_sb(struct
>  {
>         WARN_ON(!rwsem_is_locked(&sb->s_umount));
> 
> -       spin_lock(&inode_wb_list_lock);
> +       spin_lock(&wb->list_lock);
>         if (!wbc->for_kupdate || list_empty(&wb->b_io))
>                 queue_io(wb, wbc->older_than_this);
>         writeback_sb_inodes(sb, wb, wbc, true);
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&wb->list_lock);
>  }
> 
>  /*
> @@ -747,15 +746,15 @@ static long wb_writeback(struct bdi_writ
>                  * become available for writeback. Otherwise
>                  * we'll just busyloop.
>                  */
> -               spin_lock(&inode_wb_list_lock);
> +               spin_lock(&wb->list_lock);
>                 if (!list_empty(&wb->b_more_io))  {
>                         inode = wb_inode(wb->b_more_io.prev);
>                         trace_wbc_writeback_wait(&wbc, wb->bdi);
>                         spin_lock(&inode->i_lock);
> -                       inode_wait_for_writeback(inode);
> +                       inode_wait_for_writeback(inode, wb);
>                         spin_unlock(&inode->i_lock);
>                 }
> -               spin_unlock(&inode_wb_list_lock);
> +               spin_unlock(&wb->list_lock);
>         }
> 
>         return wrote;
> @@ -1092,10 +1091,10 @@ void __mark_inode_dirty(struct inode *in
>                         }
> 
>                         spin_unlock(&inode->i_lock);
> -                       spin_lock(&inode_wb_list_lock);
> +                       spin_lock(&bdi->wb.list_lock);
>                         inode->dirtied_when = jiffies;
>                         list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
> -                       spin_unlock(&inode_wb_list_lock);
> +                       spin_unlock(&bdi->wb.list_lock);
> 
>                         if (wakeup_bdi)
>                                 bdi_wakeup_thread_delayed(bdi);
> @@ -1296,6 +1295,7 @@ EXPORT_SYMBOL(sync_inodes_sb);
>   */
>  int write_inode_now(struct inode *inode, int sync)
>  {
> +       struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
>         int ret;
>         struct writeback_control wbc = {
>                 .nr_to_write = LONG_MAX,
> @@ -1308,11 +1308,11 @@ int write_inode_now(struct inode *inode,
>                 wbc.nr_to_write = 0;
> 
>         might_sleep();
> -       spin_lock(&inode_wb_list_lock);
> +       spin_lock(&wb->list_lock);
>         spin_lock(&inode->i_lock);
> -       ret = writeback_single_inode(inode, &wbc);
> +       ret = writeback_single_inode(inode, wb, &wbc);
>         spin_unlock(&inode->i_lock);
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&wb->list_lock);
>         if (sync)
>                 inode_sync_wait(inode);
>         return ret;
> @@ -1332,13 +1332,14 @@ EXPORT_SYMBOL(write_inode_now);
>   */
>  int sync_inode(struct inode *inode, struct writeback_control *wbc)
>  {
> +       struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
>         int ret;
> 
> -       spin_lock(&inode_wb_list_lock);
> +       spin_lock(&wb->list_lock);
>         spin_lock(&inode->i_lock);
> -       ret = writeback_single_inode(inode, wbc);
> +       ret = writeback_single_inode(inode, wb, wbc);
>         spin_unlock(&inode->i_lock);
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&wb->list_lock);
>         return ret;
>  }
>  EXPORT_SYMBOL(sync_inode);
> Index: linux-2.6/fs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/inode.c   2011-04-21 08:31:40.172358011 +0200
> +++ linux-2.6/fs/inode.c        2011-04-21 09:07:05.327511722 +0200
> @@ -37,7 +37,7 @@
>   *   inode_lru, inode->i_lru
>   * inode_sb_list_lock protects:
>   *   sb->s_inodes, inode->i_sb_list
> - * inode_wb_list_lock protects:
> + * bdi->wb.list_lock protects:
>   *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
>   * inode_hash_lock protects:
>   *   inode_hashtable, inode->i_hash
> @@ -48,7 +48,7 @@
>   *   inode->i_lock
>   *     inode_lru_lock
>   *
> - * inode_wb_list_lock
> + * bdi->wb.list_lock
>   *   inode->i_lock
>   *
>   * inode_hash_lock
> @@ -111,7 +111,6 @@ static LIST_HEAD(inode_lru);
>  static DEFINE_SPINLOCK(inode_lru_lock);
> 
>  __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
> -__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_wb_list_lock);
> 
>  /*
>   * iprune_sem provides exclusion between the icache shrinking and the
> Index: linux-2.6/include/linux/writeback.h
> ===================================================================
> --- linux-2.6.orig/include/linux/writeback.h    2011-04-21 08:31:42.185680435 +0200
> +++ linux-2.6/include/linux/writeback.h 2011-04-21 09:07:05.327511722 +0200
> @@ -9,8 +9,6 @@
> 
>  struct backing_dev_info;
> 
> -extern spinlock_t inode_wb_list_lock;
> -
>  /*
>   * fs/fs-writeback.c
>   */
> Index: linux-2.6/mm/backing-dev.c
> ===================================================================
> --- linux-2.6.orig/mm/backing-dev.c     2011-04-21 08:31:44.532334389 +0200
> +++ linux-2.6/mm/backing-dev.c  2011-04-21 09:07:05.327511722 +0200
> @@ -45,6 +45,17 @@ static struct timer_list sync_supers_tim
>  static int bdi_sync_supers(void *);
>  static void sync_supers_timer_fn(unsigned long);
> 
> +void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2)
> +{
> +       if (wb1 < wb2) {
> +               spin_lock(&wb1->list_lock);
> +               spin_lock_nested(&wb2->list_lock, 1);
> +       } else {
> +               spin_lock(&wb2->list_lock);
> +               spin_lock_nested(&wb1->list_lock, 1);
> +       }
> +}
> +
>  #ifdef CONFIG_DEBUG_FS
>  #include <linux/debugfs.h>
>  #include <linux/seq_file.h>
> @@ -67,14 +78,14 @@ static int bdi_debug_stats_show(struct s
>         struct inode *inode;
> 
>         nr_wb = nr_dirty = nr_io = nr_more_io = 0;
> -       spin_lock(&inode_wb_list_lock);
> +       spin_lock(&wb->list_lock);
>         list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
>                 nr_dirty++;
>         list_for_each_entry(inode, &wb->b_io, i_wb_list)
>                 nr_io++;
>         list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
>                 nr_more_io++;
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&wb->list_lock);
> 
>         global_dirty_limits(&background_thresh, &dirty_thresh);
>         bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -628,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
>         INIT_LIST_HEAD(&wb->b_dirty);
>         INIT_LIST_HEAD(&wb->b_io);
>         INIT_LIST_HEAD(&wb->b_more_io);
> +       spin_lock_init(&wb->list_lock);
>         setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
>  }
> 
> @@ -676,11 +688,12 @@ void bdi_destroy(struct backing_dev_info
>         if (bdi_has_dirty_io(bdi)) {
>                 struct bdi_writeback *dst = &default_backing_dev_info.wb;
> 
> -               spin_lock(&inode_wb_list_lock);
> +               bdi_lock_two(&bdi->wb, dst);
>                 list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
>                 list_splice(&bdi->wb.b_io, &dst->b_io);
>                 list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
> -               spin_unlock(&inode_wb_list_lock);
> +               spin_unlock(&bdi->wb.list_lock);
> +               spin_unlock(&dst->list_lock);
>         }
> 
>         bdi_unregister(bdi);
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c 2011-04-21 08:31:42.159013915 +0200
> +++ linux-2.6/mm/filemap.c      2011-04-21 09:07:05.330845037 +0200
> @@ -80,7 +80,7 @@
>   *  ->i_mutex
>   *    ->i_alloc_sem             (various)
>   *
> - *  inode_wb_list_lock
> + *  bdi->wb.list_lock
>   *    sb_lock                  (fs/fs-writeback.c)
>   *    ->mapping->tree_lock     (__sync_single_inode)
>   *
> @@ -98,9 +98,9 @@
>   *    ->zone.lru_lock          (check_pte_range->isolate_lru_page)
>   *    ->private_lock           (page_remove_rmap->set_page_dirty)
>   *    ->tree_lock              (page_remove_rmap->set_page_dirty)
> - *    inode_wb_list_lock       (page_remove_rmap->set_page_dirty)
> + *    bdi.wb->list_lock                (page_remove_rmap->set_page_dirty)
>   *    ->inode->i_lock          (page_remove_rmap->set_page_dirty)
> - *    inode_wb_list_lock       (zap_pte_range->set_page_dirty)
> + *    bdi.wb->list_lock        (zap_pte_range->set_page_dirty)
>   *    ->inode->i_lock          (zap_pte_range->set_page_dirty)
>   *    ->private_lock           (zap_pte_range->__set_page_dirty_buffers)
>   *
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c    2011-04-21 08:31:41.519017382 +0200
> +++ linux-2.6/mm/rmap.c 2011-04-21 09:07:05.330845037 +0200
> @@ -32,11 +32,11 @@
>   *               mmlist_lock (in mmput, drain_mmlist and others)
>   *               mapping->private_lock (in __set_page_dirty_buffers)
>   *               inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> - *               inode_wb_list_lock (in set_page_dirty's __mark_inode_dirty)
> + *               bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
>   *                 sb_lock (within inode_lock in fs/fs-writeback.c)
>   *                 mapping->tree_lock (widely used, in set_page_dirty,
>   *                           in arch-dependent flush_dcache_mmap_lock,
> - *                           within inode_wb_list_lock in __sync_single_inode)
> + *                           within bdi.wb->list_lock in __sync_single_inode)
>   *
>   * (code doesn't rely on that order so it could be switched around)
>   * ->tasklist_lock
> Index: linux-2.6/fs/block_dev.c
> ===================================================================
> --- linux-2.6.orig/fs/block_dev.c       2011-04-21 08:31:44.522334444 +0200
> +++ linux-2.6/fs/block_dev.c    2011-04-21 09:07:05.330845037 +0200
> @@ -55,13 +55,16 @@ EXPORT_SYMBOL(I_BDEV);
>  static void bdev_inode_switch_bdi(struct inode *inode,
>                         struct backing_dev_info *dst)
>  {
> -       spin_lock(&inode_wb_list_lock);
> +       struct backing_dev_info *old = inode->i_data.backing_dev_info;
> +
> +       bdi_lock_two(&old->wb, &dst->wb);
>         spin_lock(&inode->i_lock);
>         inode->i_data.backing_dev_info = dst;
>         if (inode->i_state & I_DIRTY)
>                 list_move(&inode->i_wb_list, &dst->wb.b_dirty);
>         spin_unlock(&inode->i_lock);
> -       spin_unlock(&inode_wb_list_lock);
> +       spin_unlock(&old->wb.list_lock);
> +       spin_unlock(&dst->wb.list_lock);
>  }
> 
>  static sector_t max_block(struct block_device *bdev)
> Index: linux-2.6/include/linux/backing-dev.h
> ===================================================================
> --- linux-2.6.orig/include/linux/backing-dev.h  2011-04-21 08:31:42.202347013 +0200
> +++ linux-2.6/include/linux/backing-dev.h       2011-04-21 09:07:05.330845037 +0200
> @@ -57,6 +57,7 @@ struct bdi_writeback {
>         struct list_head b_dirty;       /* dirty inodes */
>         struct list_head b_io;          /* parked for writeback */
>         struct list_head b_more_io;     /* parked for more writeback */
> +       spinlock_t list_lock;           /* protects the b_* lists. */
>  };
> 
>  struct backing_dev_info {
> @@ -106,6 +107,7 @@ int bdi_writeback_thread(void *data);
>  int bdi_has_dirty_io(struct backing_dev_info *bdi);
>  void bdi_arm_supers_timer(void);
>  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
> +void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
> 
>  extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-21  4:10                       ` Wu Fengguang
@ 2011-04-21 16:04                         ` Jan Kara
  -1 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-21 16:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu 21-04-11 12:10:11, Wu Fengguang wrote:
> > > Still, given wb_writeback() is the only caller of both
> > > __writeback_inodes_sb and writeback_inodes_wb(), I'm wondering if
> > > moving the queue_io calls up into wb_writeback() would clean up this
> > > logic somewhat. I think Jan mentioned doing something like this as
> > > well elsewhere in the thread...
> > 
> > Unfortunately they call queue_io() inside the lock..
> 
> OK, let's try moving up the lock too. Do you like this change? :)
> 
> Thanks,
> Fengguang
> ---
>  fs/fs-writeback.c |   22 ++++++----------------
>  mm/backing-dev.c  |    4 ++++
>  2 files changed, 10 insertions(+), 16 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-21 12:04:02.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-21 12:05:54.000000000 +0800
> @@ -591,7 +591,6 @@ void writeback_inodes_wb(struct bdi_writ
>  
>  	if (!wbc->wb_start)
>  		wbc->wb_start = jiffies; /* livelock avoidance */
> -	spin_lock(&inode_wb_list_lock);
>  
>  	if (list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
> @@ -610,22 +609,9 @@ void writeback_inodes_wb(struct bdi_writ
>  		if (ret)
>  			break;
>  	}
> -	spin_unlock(&inode_wb_list_lock);
>  	/* Leave any unwritten inodes on b_io */
>  }
>  
> -static void __writeback_inodes_sb(struct super_block *sb,
> -		struct bdi_writeback *wb, struct writeback_control *wbc)
> -{
> -	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> -
> -	spin_lock(&inode_wb_list_lock);
> -	if (list_empty(&wb->b_io))
> -		queue_io(wb, wbc);
> -	writeback_sb_inodes(sb, wb, wbc, true);
> -	spin_unlock(&inode_wb_list_lock);
> -}
> -
>  static inline bool over_bground_thresh(void)
>  {
>  	unsigned long background_thresh, dirty_thresh;
> @@ -652,7 +638,7 @@ static unsigned long writeback_chunk_siz
>  	 * The intended call sequence for WB_SYNC_ALL writeback is:
>  	 *
>  	 *      wb_writeback()
> -	 *          __writeback_inodes_sb()     <== called only once
> +	 *          writeback_sb_inodes()       <== called only once
>  	 *              write_cache_pages()     <== called once for each inode
>  	 *                  (quickly) tag currently dirty pages
>  	 *                  (maybe slowly) sync all tagged pages
> @@ -742,10 +728,14 @@ static long wb_writeback(struct bdi_writ
>  
>  retry:
>  		trace_wbc_writeback_start(&wbc, wb->bdi);
> +		spin_lock(&inode_wb_list_lock);
> +		if (list_empty(&wb->b_io))
> +			queue_io(wb, wbc);
>  		if (work->sb)
> -			__writeback_inodes_sb(work->sb, wb, &wbc);
> +			writeback_sb_inodes(work->sb, wb, &wbc, true);
>  		else
>  			writeback_inodes_wb(wb, &wbc);
> +		spin_unlock(&inode_wb_list_lock);
>  		trace_wbc_writeback_written(&wbc, wb->bdi);
>  
>  		bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
> --- linux-next.orig/mm/backing-dev.c	2011-04-21 12:06:02.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2011-04-21 12:06:31.000000000 +0800
> @@ -268,7 +268,11 @@ static void bdi_flush_io(struct backing_
>  		.nr_to_write		= 1024,
>  	};
>  
> +	spin_lock(&inode_wb_list_lock);
> +	if (list_empty(&wb->b_io))
> +		queue_io(wb, wbc);
>  	writeback_inodes_wb(&bdi->wb, &wbc);
> +	spin_unlock(&inode_wb_list_lock);
>  }
  Three notes here:
1) You are missing the call to writeback_inodes_wb() in
balance_dirty_pages() (the patch should really work for vanilla kernels).
2) The intention of both bdi_flush_io() and balance_dirty_pages() is to
write .nr_to_write pages. So they should either do queue_io()
unconditionally (I kind of like that for simplicity) or they should requeue
once if they have not written enough - otherwise it could happen that they
are called just at the moment when b_io contains a single inode with a few
dirty pages and they end up doing almost nothing.
3) I guess your patch does not compile because queue_io() is static ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-21 16:04                         ` Jan Kara
  0 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-21 16:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu 21-04-11 12:10:11, Wu Fengguang wrote:
> > > Still, given wb_writeback() is the only caller of both
> > > __writeback_inodes_sb and writeback_inodes_wb(), I'm wondering if
> > > moving the queue_io calls up into wb_writeback() would clean up this
> > > logic somewhat. I think Jan mentioned doing something like this as
> > > well elsewhere in the thread...
> > 
> > Unfortunately they call queue_io() inside the lock..
> 
> OK, let's try moving up the lock too. Do you like this change? :)
> 
> Thanks,
> Fengguang
> ---
>  fs/fs-writeback.c |   22 ++++++----------------
>  mm/backing-dev.c  |    4 ++++
>  2 files changed, 10 insertions(+), 16 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-21 12:04:02.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-21 12:05:54.000000000 +0800
> @@ -591,7 +591,6 @@ void writeback_inodes_wb(struct bdi_writ
>  
>  	if (!wbc->wb_start)
>  		wbc->wb_start = jiffies; /* livelock avoidance */
> -	spin_lock(&inode_wb_list_lock);
>  
>  	if (list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
> @@ -610,22 +609,9 @@ void writeback_inodes_wb(struct bdi_writ
>  		if (ret)
>  			break;
>  	}
> -	spin_unlock(&inode_wb_list_lock);
>  	/* Leave any unwritten inodes on b_io */
>  }
>  
> -static void __writeback_inodes_sb(struct super_block *sb,
> -		struct bdi_writeback *wb, struct writeback_control *wbc)
> -{
> -	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> -
> -	spin_lock(&inode_wb_list_lock);
> -	if (list_empty(&wb->b_io))
> -		queue_io(wb, wbc);
> -	writeback_sb_inodes(sb, wb, wbc, true);
> -	spin_unlock(&inode_wb_list_lock);
> -}
> -
>  static inline bool over_bground_thresh(void)
>  {
>  	unsigned long background_thresh, dirty_thresh;
> @@ -652,7 +638,7 @@ static unsigned long writeback_chunk_siz
>  	 * The intended call sequence for WB_SYNC_ALL writeback is:
>  	 *
>  	 *      wb_writeback()
> -	 *          __writeback_inodes_sb()     <== called only once
> +	 *          writeback_sb_inodes()       <== called only once
>  	 *              write_cache_pages()     <== called once for each inode
>  	 *                  (quickly) tag currently dirty pages
>  	 *                  (maybe slowly) sync all tagged pages
> @@ -742,10 +728,14 @@ static long wb_writeback(struct bdi_writ
>  
>  retry:
>  		trace_wbc_writeback_start(&wbc, wb->bdi);
> +		spin_lock(&inode_wb_list_lock);
> +		if (list_empty(&wb->b_io))
> +			queue_io(wb, wbc);
>  		if (work->sb)
> -			__writeback_inodes_sb(work->sb, wb, &wbc);
> +			writeback_sb_inodes(work->sb, wb, &wbc, true);
>  		else
>  			writeback_inodes_wb(wb, &wbc);
> +		spin_unlock(&inode_wb_list_lock);
>  		trace_wbc_writeback_written(&wbc, wb->bdi);
>  
>  		bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
> --- linux-next.orig/mm/backing-dev.c	2011-04-21 12:06:02.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2011-04-21 12:06:31.000000000 +0800
> @@ -268,7 +268,11 @@ static void bdi_flush_io(struct backing_
>  		.nr_to_write		= 1024,
>  	};
>  
> +	spin_lock(&inode_wb_list_lock);
> +	if (list_empty(&wb->b_io))
> +		queue_io(wb, wbc);
>  	writeback_inodes_wb(&bdi->wb, &wbc);
> +	spin_unlock(&inode_wb_list_lock);
>  }
  Three notes here:
1) You are missing the call to writeback_inodes_wb() in
balance_dirty_pages() (the patch should really work for vanilla kernels).
2) The intention of both bdi_flush_io() and balance_dirty_pages() is to
write .nr_to_write pages. So they should either do queue_io()
unconditionally (I kind of like that for simplicity) or they should requeue
once if they have not written enough - otherwise it could happen that they
are called just at the moment when b_io contains a single inode with a few
dirty pages and they end up doing almost nothing.
3) I guess your patch does not compile because queue_io() is static ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21  6:05                   ` Wu Fengguang
@ 2011-04-21 16:41                     ` Jan Kara
  -1 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-21 16:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Jan Kara, Andrew Morton, Mel Gorman,
	Dave Chinner, Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu 21-04-11 14:05:56, Wu Fengguang wrote:
> On Thu, Apr 21, 2011 at 12:39:40PM +0800, Christoph Hellwig wrote:
> > On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> > > I collected the writeback_single_inode() traces (patch attached for
> > > your reference) each for several test runs, and find much more
> > > I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> > > I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> > > even for small files?
> > 
> > What is your defintion of a small file?  As soon as it has multiple
> > extents or holes there's absolutely no way to clean it with a single
> > writepage call.
> 
> It's writing a kernel source tree to XFS. You can find in the below
> trace that it often leaves more dirty pages behind (indicated by the
> I_DIRTY_PAGES flag) after writing as less as 1 page (indicated by the
> wrote=1 field).
  As Dave said, it's probably just a race since XFS redirties the inode on
IO completion. So I think the inodes are just small so they have only a few
dirty pages so you don't have much to write and they are written and
redirtied before you check the I_DIRTY flags. You could use radix tree
dirty tag to verify whether there are really dirty pages or not...

  BTW a quick check of kernel tree shows the following distribution of
sizes (in KB):
  Count KB  Cumulative Percent
    257 0   0.9%
  13309 4   45%
   5553 8   63%
   2997 12  73%
   1879 16  80%
   1275 20  83%
    987 24  87%
    685 28  89%
    540 32  91%
    387 36  ...
    309 40
    264 44
    249 48
    170 52
    143 56
    144 60
    132 64
    100 68
    ...
Total 30155

And the distribution of your 'wrote=xxx' roughly corresponds to this...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-21 16:41                     ` Jan Kara
  0 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-21 16:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Jan Kara, Andrew Morton, Mel Gorman,
	Dave Chinner, Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Thu 21-04-11 14:05:56, Wu Fengguang wrote:
> On Thu, Apr 21, 2011 at 12:39:40PM +0800, Christoph Hellwig wrote:
> > On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> > > I collected the writeback_single_inode() traces (patch attached for
> > > your reference) each for several test runs, and find much more
> > > I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> > > I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> > > even for small files?
> > 
> > What is your defintion of a small file?  As soon as it has multiple
> > extents or holes there's absolutely no way to clean it with a single
> > writepage call.
> 
> It's writing a kernel source tree to XFS. You can find in the below
> trace that it often leaves more dirty pages behind (indicated by the
> I_DIRTY_PAGES flag) after writing as less as 1 page (indicated by the
> wrote=1 field).
  As Dave said, it's probably just a race since XFS redirties the inode on
IO completion. So I think the inodes are just small so they have only a few
dirty pages so you don't have much to write and they are written and
redirtied before you check the I_DIRTY flags. You could use radix tree
dirty tag to verify whether there are really dirty pages or not...

  BTW a quick check of kernel tree shows the following distribution of
sizes (in KB):
  Count KB  Cumulative Percent
    257 0   0.9%
  13309 4   45%
   5553 8   63%
   2997 12  73%
   1879 16  80%
   1275 20  83%
    987 24  87%
    685 28  89%
    540 32  91%
    387 36  ...
    309 40
    264 44
    249 48
    170 52
    143 56
    144 60
    132 64
    100 68
    ...
Total 30155

And the distribution of your 'wrote=xxx' roughly corresponds to this...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-21 16:04                         ` Jan Kara
@ 2011-04-22  2:24                           ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-22  2:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Fri, Apr 22, 2011 at 12:04:05AM +0800, Jan Kara wrote:
> On Thu 21-04-11 12:10:11, Wu Fengguang wrote:
> > > > Still, given wb_writeback() is the only caller of both
> > > > __writeback_inodes_sb and writeback_inodes_wb(), I'm wondering if
> > > > moving the queue_io calls up into wb_writeback() would clean up this
> > > > logic somewhat. I think Jan mentioned doing something like this as
> > > > well elsewhere in the thread...
> > > 
> > > Unfortunately they call queue_io() inside the lock..
> > 
> > OK, let's try moving up the lock too. Do you like this change? :)
> > 
> > Thanks,
> > Fengguang
> > ---
> >  fs/fs-writeback.c |   22 ++++++----------------
> >  mm/backing-dev.c  |    4 ++++
> >  2 files changed, 10 insertions(+), 16 deletions(-)
> > 
> > --- linux-next.orig/fs/fs-writeback.c	2011-04-21 12:04:02.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2011-04-21 12:05:54.000000000 +0800
> > @@ -591,7 +591,6 @@ void writeback_inodes_wb(struct bdi_writ
> >  
> >  	if (!wbc->wb_start)
> >  		wbc->wb_start = jiffies; /* livelock avoidance */
> > -	spin_lock(&inode_wb_list_lock);
> >  
> >  	if (list_empty(&wb->b_io))
> >  		queue_io(wb, wbc);
> > @@ -610,22 +609,9 @@ void writeback_inodes_wb(struct bdi_writ
> >  		if (ret)
> >  			break;
> >  	}
> > -	spin_unlock(&inode_wb_list_lock);
> >  	/* Leave any unwritten inodes on b_io */
> >  }
> >  
> > -static void __writeback_inodes_sb(struct super_block *sb,
> > -		struct bdi_writeback *wb, struct writeback_control *wbc)
> > -{
> > -	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> > -
> > -	spin_lock(&inode_wb_list_lock);
> > -	if (list_empty(&wb->b_io))
> > -		queue_io(wb, wbc);
> > -	writeback_sb_inodes(sb, wb, wbc, true);
> > -	spin_unlock(&inode_wb_list_lock);
> > -}
> > -
> >  static inline bool over_bground_thresh(void)
> >  {
> >  	unsigned long background_thresh, dirty_thresh;
> > @@ -652,7 +638,7 @@ static unsigned long writeback_chunk_siz
> >  	 * The intended call sequence for WB_SYNC_ALL writeback is:
> >  	 *
> >  	 *      wb_writeback()
> > -	 *          __writeback_inodes_sb()     <== called only once
> > +	 *          writeback_sb_inodes()       <== called only once
> >  	 *              write_cache_pages()     <== called once for each inode
> >  	 *                  (quickly) tag currently dirty pages
> >  	 *                  (maybe slowly) sync all tagged pages
> > @@ -742,10 +728,14 @@ static long wb_writeback(struct bdi_writ
> >  
> >  retry:
> >  		trace_wbc_writeback_start(&wbc, wb->bdi);
> > +		spin_lock(&inode_wb_list_lock);
> > +		if (list_empty(&wb->b_io))
> > +			queue_io(wb, wbc);
> >  		if (work->sb)
> > -			__writeback_inodes_sb(work->sb, wb, &wbc);
> > +			writeback_sb_inodes(work->sb, wb, &wbc, true);
> >  		else
> >  			writeback_inodes_wb(wb, &wbc);
> > +		spin_unlock(&inode_wb_list_lock);
> >  		trace_wbc_writeback_written(&wbc, wb->bdi);
> >  
> >  		bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
> > --- linux-next.orig/mm/backing-dev.c	2011-04-21 12:06:02.000000000 +0800
> > +++ linux-next/mm/backing-dev.c	2011-04-21 12:06:31.000000000 +0800
> > @@ -268,7 +268,11 @@ static void bdi_flush_io(struct backing_
> >  		.nr_to_write		= 1024,
> >  	};
> >  
> > +	spin_lock(&inode_wb_list_lock);
> > +	if (list_empty(&wb->b_io))
> > +		queue_io(wb, wbc);
> >  	writeback_inodes_wb(&bdi->wb, &wbc);
> > +	spin_unlock(&inode_wb_list_lock);
> >  }
>   Three notes here:
> 1) You are missing the call to writeback_inodes_wb() in
> balance_dirty_pages() (the patch should really work for vanilla kernels).

Good catch! I'm using old cscope index so missed it..

> 2) The intention of both bdi_flush_io() and balance_dirty_pages() is to
> write .nr_to_write pages. So they should either do queue_io()
> unconditionally (I kind of like that for simplicity) or they should requeue
> once if they have not written enough - otherwise it could happen that they
> are called just at the moment when b_io contains a single inode with a few
> dirty pages and they end up doing almost nothing.

It makes much more sense to keep the policy consistent. When the
flusher and the throttled tasks are both actively manipulating the
shared lists but in different ways, how are we going to analyze the
resulted mixture behavior?

Note that bdi_flush_io() and balance_dirty_pages() both have outer
loops to retry writeout, so smallish b_io is not a problem at all.

> 3) I guess your patch does not compile because queue_io() is static ;).

Yeah, good spot~ :) Here is the updated patch. I feel like moving
bdi_flush_io() to fs-writeback.c rather than exporting the low level
queue_io() (and enable others to conveniently change the queue policy!).

balance_dirty_pages() cannot be moved.. so I plan to submit it after
any IO-less merges. It's a cleanup patch after all.

Thanks,
Fengguang
---
Subject: writeback: move queue_io() up
Date: Thu Apr 21 12:06:32 CST 2011

Refactor code for more logical code layout.
No behavior change. 

- kill __writeback_inodes_sb()
- move bdi_flush_io() to fs-writeback.c
- elevate queue_io() and locking up to wb_writeback() and bdi_flush_io()

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   33 ++++++++++++++++++---------------
 include/linux/writeback.h |    1 +
 mm/backing-dev.c          |   12 ------------
 3 files changed, 19 insertions(+), 27 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-21 20:11:53.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-21 21:11:02.000000000 +0800
@@ -577,10 +577,6 @@ void writeback_inodes_wb(struct bdi_writ
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&wb->list_lock);
-
-	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -596,20 +592,23 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&wb->list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
-static void __writeback_inodes_sb(struct super_block *sb,
-		struct bdi_writeback *wb, struct writeback_control *wbc)
+void bdi_flush_io(struct backing_dev_info *bdi)
 {
-	WARN_ON(!rwsem_is_locked(&sb->s_umount));
+	struct writeback_control wbc = {
+		.sync_mode		= WB_SYNC_NONE,
+		.older_than_this	= NULL,
+		.range_cyclic		= 1,
+		.nr_to_write		= 1024,
+	};
 
-	spin_lock(&wb->list_lock);
-	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc);
-	writeback_sb_inodes(sb, wb, wbc, true);
-	spin_unlock(&wb->list_lock);
+	spin_lock(&bdi->wb.list_lock);
+	if (list_empty(&bdi->wb.b_io))
+		queue_io(&bdi->wb, &wbc);
+	writeback_inodes_wb(&bdi->wb, &wbc);
+	spin_unlock(&bdi->wb.list_lock);
 }
 
 /*
@@ -674,7 +673,7 @@ static long wb_writeback(struct bdi_writ
 	 * The intended call sequence for WB_SYNC_ALL writeback is:
 	 *
 	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
+	 *          writeback_sb_inodes()       <== called only once
 	 *              write_cache_pages()     <== called once for each inode
 	 *                   (quickly) tag currently dirty pages
 	 *                   (maybe slowly) sync all tagged pages
@@ -722,10 +721,14 @@ static long wb_writeback(struct bdi_writ
 
 retry:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
+		spin_lock(&wb->list_lock);
+		if (list_empty(&wb->b_io))
+			queue_io(wb, &wbc);
 		if (work->sb)
-			__writeback_inodes_sb(work->sb, wb, &wbc);
+			writeback_sb_inodes(work->sb, wb, &wbc, true);
 		else
 			writeback_inodes_wb(wb, &wbc);
+		spin_unlock(&wb->list_lock);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
--- linux-next.orig/mm/backing-dev.c	2011-04-21 20:11:52.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-04-21 20:16:15.000000000 +0800
@@ -260,18 +260,6 @@ int bdi_has_dirty_io(struct backing_dev_
 	return wb_has_dirty_io(&bdi->wb);
 }
 
-static void bdi_flush_io(struct backing_dev_info *bdi)
-{
-	struct writeback_control wbc = {
-		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
-		.range_cyclic		= 1,
-		.nr_to_write		= 1024,
-	};
-
-	writeback_inodes_wb(&bdi->wb, &wbc);
-}
-
 /*
  * kupdated() used to do this. We cannot do it from the bdi_forker_thread()
  * or we risk deadlocking on ->s_umount. The longer term solution would be
--- linux-next.orig/include/linux/writeback.h	2011-04-21 20:20:20.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-04-21 21:10:29.000000000 +0800
@@ -56,6 +56,7 @@ struct writeback_control {
  */	
 struct bdi_writeback;
 int inode_wait(void *);
+void bdi_flush_io(struct backing_dev_info *bdi);
 void writeback_inodes_sb(struct super_block *);
 void writeback_inodes_sb_nr(struct super_block *, unsigned long nr);
 int writeback_inodes_sb_if_idle(struct super_block *);

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-22  2:24                           ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-22  2:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Fri, Apr 22, 2011 at 12:04:05AM +0800, Jan Kara wrote:
> On Thu 21-04-11 12:10:11, Wu Fengguang wrote:
> > > > Still, given wb_writeback() is the only caller of both
> > > > __writeback_inodes_sb and writeback_inodes_wb(), I'm wondering if
> > > > moving the queue_io calls up into wb_writeback() would clean up this
> > > > logic somewhat. I think Jan mentioned doing something like this as
> > > > well elsewhere in the thread...
> > > 
> > > Unfortunately they call queue_io() inside the lock..
> > 
> > OK, let's try moving up the lock too. Do you like this change? :)
> > 
> > Thanks,
> > Fengguang
> > ---
> >  fs/fs-writeback.c |   22 ++++++----------------
> >  mm/backing-dev.c  |    4 ++++
> >  2 files changed, 10 insertions(+), 16 deletions(-)
> > 
> > --- linux-next.orig/fs/fs-writeback.c	2011-04-21 12:04:02.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2011-04-21 12:05:54.000000000 +0800
> > @@ -591,7 +591,6 @@ void writeback_inodes_wb(struct bdi_writ
> >  
> >  	if (!wbc->wb_start)
> >  		wbc->wb_start = jiffies; /* livelock avoidance */
> > -	spin_lock(&inode_wb_list_lock);
> >  
> >  	if (list_empty(&wb->b_io))
> >  		queue_io(wb, wbc);
> > @@ -610,22 +609,9 @@ void writeback_inodes_wb(struct bdi_writ
> >  		if (ret)
> >  			break;
> >  	}
> > -	spin_unlock(&inode_wb_list_lock);
> >  	/* Leave any unwritten inodes on b_io */
> >  }
> >  
> > -static void __writeback_inodes_sb(struct super_block *sb,
> > -		struct bdi_writeback *wb, struct writeback_control *wbc)
> > -{
> > -	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> > -
> > -	spin_lock(&inode_wb_list_lock);
> > -	if (list_empty(&wb->b_io))
> > -		queue_io(wb, wbc);
> > -	writeback_sb_inodes(sb, wb, wbc, true);
> > -	spin_unlock(&inode_wb_list_lock);
> > -}
> > -
> >  static inline bool over_bground_thresh(void)
> >  {
> >  	unsigned long background_thresh, dirty_thresh;
> > @@ -652,7 +638,7 @@ static unsigned long writeback_chunk_siz
> >  	 * The intended call sequence for WB_SYNC_ALL writeback is:
> >  	 *
> >  	 *      wb_writeback()
> > -	 *          __writeback_inodes_sb()     <== called only once
> > +	 *          writeback_sb_inodes()       <== called only once
> >  	 *              write_cache_pages()     <== called once for each inode
> >  	 *                  (quickly) tag currently dirty pages
> >  	 *                  (maybe slowly) sync all tagged pages
> > @@ -742,10 +728,14 @@ static long wb_writeback(struct bdi_writ
> >  
> >  retry:
> >  		trace_wbc_writeback_start(&wbc, wb->bdi);
> > +		spin_lock(&inode_wb_list_lock);
> > +		if (list_empty(&wb->b_io))
> > +			queue_io(wb, wbc);
> >  		if (work->sb)
> > -			__writeback_inodes_sb(work->sb, wb, &wbc);
> > +			writeback_sb_inodes(work->sb, wb, &wbc, true);
> >  		else
> >  			writeback_inodes_wb(wb, &wbc);
> > +		spin_unlock(&inode_wb_list_lock);
> >  		trace_wbc_writeback_written(&wbc, wb->bdi);
> >  
> >  		bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
> > --- linux-next.orig/mm/backing-dev.c	2011-04-21 12:06:02.000000000 +0800
> > +++ linux-next/mm/backing-dev.c	2011-04-21 12:06:31.000000000 +0800
> > @@ -268,7 +268,11 @@ static void bdi_flush_io(struct backing_
> >  		.nr_to_write		= 1024,
> >  	};
> >  
> > +	spin_lock(&inode_wb_list_lock);
> > +	if (list_empty(&wb->b_io))
> > +		queue_io(wb, wbc);
> >  	writeback_inodes_wb(&bdi->wb, &wbc);
> > +	spin_unlock(&inode_wb_list_lock);
> >  }
>   Three notes here:
> 1) You are missing the call to writeback_inodes_wb() in
> balance_dirty_pages() (the patch should really work for vanilla kernels).

Good catch! I'm using old cscope index so missed it..

> 2) The intention of both bdi_flush_io() and balance_dirty_pages() is to
> write .nr_to_write pages. So they should either do queue_io()
> unconditionally (I kind of like that for simplicity) or they should requeue
> once if they have not written enough - otherwise it could happen that they
> are called just at the moment when b_io contains a single inode with a few
> dirty pages and they end up doing almost nothing.

It makes much more sense to keep the policy consistent. When the
flusher and the throttled tasks are both actively manipulating the
shared lists but in different ways, how are we going to analyze the
resulted mixture behavior?

Note that bdi_flush_io() and balance_dirty_pages() both have outer
loops to retry writeout, so smallish b_io is not a problem at all.

> 3) I guess your patch does not compile because queue_io() is static ;).

Yeah, good spot~ :) Here is the updated patch. I feel like moving
bdi_flush_io() to fs-writeback.c rather than exporting the low level
queue_io() (and enable others to conveniently change the queue policy!).

balance_dirty_pages() cannot be moved.. so I plan to submit it after
any IO-less merges. It's a cleanup patch after all.

Thanks,
Fengguang
---
Subject: writeback: move queue_io() up
Date: Thu Apr 21 12:06:32 CST 2011

Refactor code for more logical code layout.
No behavior change. 

- kill __writeback_inodes_sb()
- move bdi_flush_io() to fs-writeback.c
- elevate queue_io() and locking up to wb_writeback() and bdi_flush_io()

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   33 ++++++++++++++++++---------------
 include/linux/writeback.h |    1 +
 mm/backing-dev.c          |   12 ------------
 3 files changed, 19 insertions(+), 27 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-21 20:11:53.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-21 21:11:02.000000000 +0800
@@ -577,10 +577,6 @@ void writeback_inodes_wb(struct bdi_writ
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&wb->list_lock);
-
-	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -596,20 +592,23 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&wb->list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
-static void __writeback_inodes_sb(struct super_block *sb,
-		struct bdi_writeback *wb, struct writeback_control *wbc)
+void bdi_flush_io(struct backing_dev_info *bdi)
 {
-	WARN_ON(!rwsem_is_locked(&sb->s_umount));
+	struct writeback_control wbc = {
+		.sync_mode		= WB_SYNC_NONE,
+		.older_than_this	= NULL,
+		.range_cyclic		= 1,
+		.nr_to_write		= 1024,
+	};
 
-	spin_lock(&wb->list_lock);
-	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc);
-	writeback_sb_inodes(sb, wb, wbc, true);
-	spin_unlock(&wb->list_lock);
+	spin_lock(&bdi->wb.list_lock);
+	if (list_empty(&bdi->wb.b_io))
+		queue_io(&bdi->wb, &wbc);
+	writeback_inodes_wb(&bdi->wb, &wbc);
+	spin_unlock(&bdi->wb.list_lock);
 }
 
 /*
@@ -674,7 +673,7 @@ static long wb_writeback(struct bdi_writ
 	 * The intended call sequence for WB_SYNC_ALL writeback is:
 	 *
 	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
+	 *          writeback_sb_inodes()       <== called only once
 	 *              write_cache_pages()     <== called once for each inode
 	 *                   (quickly) tag currently dirty pages
 	 *                   (maybe slowly) sync all tagged pages
@@ -722,10 +721,14 @@ static long wb_writeback(struct bdi_writ
 
 retry:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
+		spin_lock(&wb->list_lock);
+		if (list_empty(&wb->b_io))
+			queue_io(wb, &wbc);
 		if (work->sb)
-			__writeback_inodes_sb(work->sb, wb, &wbc);
+			writeback_sb_inodes(work->sb, wb, &wbc, true);
 		else
 			writeback_inodes_wb(wb, &wbc);
+		spin_unlock(&wb->list_lock);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
--- linux-next.orig/mm/backing-dev.c	2011-04-21 20:11:52.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-04-21 20:16:15.000000000 +0800
@@ -260,18 +260,6 @@ int bdi_has_dirty_io(struct backing_dev_
 	return wb_has_dirty_io(&bdi->wb);
 }
 
-static void bdi_flush_io(struct backing_dev_info *bdi)
-{
-	struct writeback_control wbc = {
-		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
-		.range_cyclic		= 1,
-		.nr_to_write		= 1024,
-	};
-
-	writeback_inodes_wb(&bdi->wb, &wbc);
-}
-
 /*
  * kupdated() used to do this. We cannot do it from the bdi_forker_thread()
  * or we risk deadlocking on ->s_umount. The longer term solution would be
--- linux-next.orig/include/linux/writeback.h	2011-04-21 20:20:20.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-04-21 21:10:29.000000000 +0800
@@ -56,6 +56,7 @@ struct writeback_control {
  */	
 struct bdi_writeback;
 int inode_wait(void *);
+void bdi_flush_io(struct backing_dev_info *bdi);
 void writeback_inodes_sb(struct super_block *);
 void writeback_inodes_sb_nr(struct super_block *, unsigned long nr);
 int writeback_inodes_sb_if_idle(struct super_block *);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21 16:41                     ` Jan Kara
@ 2011-04-22  2:32                       ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-22  2:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Fri, Apr 22, 2011 at 12:41:54AM +0800, Jan Kara wrote:
> On Thu 21-04-11 14:05:56, Wu Fengguang wrote:
> > On Thu, Apr 21, 2011 at 12:39:40PM +0800, Christoph Hellwig wrote:
> > > On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> > > > I collected the writeback_single_inode() traces (patch attached for
> > > > your reference) each for several test runs, and find much more
> > > > I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> > > > I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> > > > even for small files?
> > > 
> > > What is your defintion of a small file?  As soon as it has multiple
> > > extents or holes there's absolutely no way to clean it with a single
> > > writepage call.
> > 
> > It's writing a kernel source tree to XFS. You can find in the below
> > trace that it often leaves more dirty pages behind (indicated by the
> > I_DIRTY_PAGES flag) after writing as less as 1 page (indicated by the
> > wrote=1 field).
>   As Dave said, it's probably just a race since XFS redirties the inode on
> IO completion. So I think the inodes are just small so they have only a few
> dirty pages so you don't have much to write and they are written and
> redirtied before you check the I_DIRTY flags. You could use radix tree
> dirty tag to verify whether there are really dirty pages or not...

Yeah, Dave and Christoph root caused it in the other email -- XFS sets
I_DIRTY which accidentally sets I_DIRTY_PAGES. We can safely bet there
are no real dirty pages -- otherwise it would have turned up as
performance regressions.

>   BTW a quick check of kernel tree shows the following distribution of
> sizes (in KB):
>   Count KB  Cumulative Percent
>     257 0   0.9%
>   13309 4   45%
>    5553 8   63%
>    2997 12  73%
>    1879 16  80%
>    1275 20  83%
>     987 24  87%
>     685 28  89%
>     540 32  91%
>     387 36  ...
>     309 40
>     264 44
>     249 48
>     170 52
>     143 56
>     144 60
>     132 64
>     100 68
>     ...
> Total 30155
> 
> And the distribution of your 'wrote=xxx' roughly corresponds to this...

Nice numbers! How do you manage to account them? :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-22  2:32                       ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-22  2:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Fri, Apr 22, 2011 at 12:41:54AM +0800, Jan Kara wrote:
> On Thu 21-04-11 14:05:56, Wu Fengguang wrote:
> > On Thu, Apr 21, 2011 at 12:39:40PM +0800, Christoph Hellwig wrote:
> > > On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> > > > I collected the writeback_single_inode() traces (patch attached for
> > > > your reference) each for several test runs, and find much more
> > > > I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> > > > I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> > > > even for small files?
> > > 
> > > What is your defintion of a small file?  As soon as it has multiple
> > > extents or holes there's absolutely no way to clean it with a single
> > > writepage call.
> > 
> > It's writing a kernel source tree to XFS. You can find in the below
> > trace that it often leaves more dirty pages behind (indicated by the
> > I_DIRTY_PAGES flag) after writing as less as 1 page (indicated by the
> > wrote=1 field).
>   As Dave said, it's probably just a race since XFS redirties the inode on
> IO completion. So I think the inodes are just small so they have only a few
> dirty pages so you don't have much to write and they are written and
> redirtied before you check the I_DIRTY flags. You could use radix tree
> dirty tag to verify whether there are really dirty pages or not...

Yeah, Dave and Christoph root caused it in the other email -- XFS sets
I_DIRTY which accidentally sets I_DIRTY_PAGES. We can safely bet there
are no real dirty pages -- otherwise it would have turned up as
performance regressions.

>   BTW a quick check of kernel tree shows the following distribution of
> sizes (in KB):
>   Count KB  Cumulative Percent
>     257 0   0.9%
>   13309 4   45%
>    5553 8   63%
>    2997 12  73%
>    1879 16  80%
>    1275 20  83%
>     987 24  87%
>     685 28  89%
>     540 32  91%
>     387 36  ...
>     309 40
>     264 44
>     249 48
>     170 52
>     143 56
>     144 60
>     132 64
>     100 68
>     ...
> Total 30155
> 
> And the distribution of your 'wrote=xxx' roughly corresponds to this...

Nice numbers! How do you manage to account them? :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-22  2:24                           ` Wu Fengguang
@ 2011-04-22 21:12                             ` Jan Kara
  -1 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-22 21:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Dave Chinner, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Fri 22-04-11 10:24:59, Wu Fengguang wrote:
> > 2) The intention of both bdi_flush_io() and balance_dirty_pages() is to
> > write .nr_to_write pages. So they should either do queue_io()
> > unconditionally (I kind of like that for simplicity) or they should requeue
> > once if they have not written enough - otherwise it could happen that they
> > are called just at the moment when b_io contains a single inode with a few
> > dirty pages and they end up doing almost nothing.
> 
> It makes much more sense to keep the policy consistent. When the
> flusher and the throttled tasks are both actively manipulating the
> shared lists but in different ways, how are we going to analyze the
> resulted mixture behavior?
> 
> Note that bdi_flush_io() and balance_dirty_pages() both have outer
> loops to retry writeout, so smallish b_io is not a problem at all.
  Well, it changes how balance_dirty_pages() behaves in some corner cases
(I'm not that much concerned about bdi_flush_io() because that is a last
resort thing anyway). But I see your point in consistency as well.

> > 3) I guess your patch does not compile because queue_io() is static ;).
> 
> Yeah, good spot~ :) Here is the updated patch. I feel like moving
> bdi_flush_io() to fs-writeback.c rather than exporting the low level
> queue_io() (and enable others to conveniently change the queue policy!).
> 
> balance_dirty_pages() cannot be moved.. so I plan to submit it after
> any IO-less merges. It's a cleanup patch after all.
Can't we just have a wrapper in fs/fs-writeback.c that will do:
     spin_lock(&bdi->wb.list_lock);
     if (list_empty(&bdi->wb.b_io))
             queue_io(&bdi->wb, &wbc);
     writeback_inodes_wb(&bdi->wb, &wbc);
     spin_unlock(&bdi->wb.list_lock);

And call it wherever we need? We can then also unexport
writeback_inodes_wb() which is not really a function someone would want to
call externally after your changes.

								Honza
> ---
> Subject: writeback: move queue_io() up
> Date: Thu Apr 21 12:06:32 CST 2011
> 
> Refactor code for more logical code layout.
> No behavior change. 
> 
> - kill __writeback_inodes_sb()
> - move bdi_flush_io() to fs-writeback.c
> - elevate queue_io() and locking up to wb_writeback() and bdi_flush_io()
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c         |   33 ++++++++++++++++++---------------
>  include/linux/writeback.h |    1 +
>  mm/backing-dev.c          |   12 ------------
>  3 files changed, 19 insertions(+), 27 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-21 20:11:53.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-21 21:11:02.000000000 +0800
> @@ -577,10 +577,6 @@ void writeback_inodes_wb(struct bdi_writ
>  
>  	if (!wbc->wb_start)
>  		wbc->wb_start = jiffies; /* livelock avoidance */
> -	spin_lock(&wb->list_lock);
> -
> -	if (list_empty(&wb->b_io))
> -		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = wb_inode(wb->b_io.prev);
> @@ -596,20 +592,23 @@ void writeback_inodes_wb(struct bdi_writ
>  		if (ret)
>  			break;
>  	}
> -	spin_unlock(&wb->list_lock);
>  	/* Leave any unwritten inodes on b_io */
>  }
>  
> -static void __writeback_inodes_sb(struct super_block *sb,
> -		struct bdi_writeback *wb, struct writeback_control *wbc)
> +void bdi_flush_io(struct backing_dev_info *bdi)
>  {
> -	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> +	struct writeback_control wbc = {
> +		.sync_mode		= WB_SYNC_NONE,
> +		.older_than_this	= NULL,
> +		.range_cyclic		= 1,
> +		.nr_to_write		= 1024,
> +	};
>  
> -	spin_lock(&wb->list_lock);
> -	if (list_empty(&wb->b_io))
> -		queue_io(wb, wbc);
> -	writeback_sb_inodes(sb, wb, wbc, true);
> -	spin_unlock(&wb->list_lock);
> +	spin_lock(&bdi->wb.list_lock);
> +	if (list_empty(&bdi->wb.b_io))
> +		queue_io(&bdi->wb, &wbc);
> +	writeback_inodes_wb(&bdi->wb, &wbc);
> +	spin_unlock(&bdi->wb.list_lock);
>  }
>  
>  /*
> @@ -674,7 +673,7 @@ static long wb_writeback(struct bdi_writ
>  	 * The intended call sequence for WB_SYNC_ALL writeback is:
>  	 *
>  	 *      wb_writeback()
> -	 *          __writeback_inodes_sb()     <== called only once
> +	 *          writeback_sb_inodes()       <== called only once
>  	 *              write_cache_pages()     <== called once for each inode
>  	 *                   (quickly) tag currently dirty pages
>  	 *                   (maybe slowly) sync all tagged pages
> @@ -722,10 +721,14 @@ static long wb_writeback(struct bdi_writ
>  
>  retry:
>  		trace_wbc_writeback_start(&wbc, wb->bdi);
> +		spin_lock(&wb->list_lock);
> +		if (list_empty(&wb->b_io))
> +			queue_io(wb, &wbc);
>  		if (work->sb)
> -			__writeback_inodes_sb(work->sb, wb, &wbc);
> +			writeback_sb_inodes(work->sb, wb, &wbc, true);
>  		else
>  			writeback_inodes_wb(wb, &wbc);
> +		spin_unlock(&wb->list_lock);
>  		trace_wbc_writeback_written(&wbc, wb->bdi);
>  
>  		work->nr_pages -= write_chunk - wbc.nr_to_write;
> --- linux-next.orig/mm/backing-dev.c	2011-04-21 20:11:52.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2011-04-21 20:16:15.000000000 +0800
> @@ -260,18 +260,6 @@ int bdi_has_dirty_io(struct backing_dev_
>  	return wb_has_dirty_io(&bdi->wb);
>  }
>  
> -static void bdi_flush_io(struct backing_dev_info *bdi)
> -{
> -	struct writeback_control wbc = {
> -		.sync_mode		= WB_SYNC_NONE,
> -		.older_than_this	= NULL,
> -		.range_cyclic		= 1,
> -		.nr_to_write		= 1024,
> -	};
> -
> -	writeback_inodes_wb(&bdi->wb, &wbc);
> -}
> -
>  /*
>   * kupdated() used to do this. We cannot do it from the bdi_forker_thread()
>   * or we risk deadlocking on ->s_umount. The longer term solution would be
> --- linux-next.orig/include/linux/writeback.h	2011-04-21 20:20:20.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-04-21 21:10:29.000000000 +0800
> @@ -56,6 +56,7 @@ struct writeback_control {
>   */	
>  struct bdi_writeback;
>  int inode_wait(void *);
> +void bdi_flush_io(struct backing_dev_info *bdi);
>  void writeback_inodes_sb(struct super_block *);
>  void writeback_inodes_sb_nr(struct super_block *, unsigned long nr);
>  int writeback_inodes_sb_if_idle(struct super_block *);
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-22 21:12                             ` Jan Kara
  0 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-22 21:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Dave Chinner, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Fri 22-04-11 10:24:59, Wu Fengguang wrote:
> > 2) The intention of both bdi_flush_io() and balance_dirty_pages() is to
> > write .nr_to_write pages. So they should either do queue_io()
> > unconditionally (I kind of like that for simplicity) or they should requeue
> > once if they have not written enough - otherwise it could happen that they
> > are called just at the moment when b_io contains a single inode with a few
> > dirty pages and they end up doing almost nothing.
> 
> It makes much more sense to keep the policy consistent. When the
> flusher and the throttled tasks are both actively manipulating the
> shared lists but in different ways, how are we going to analyze the
> resulted mixture behavior?
> 
> Note that bdi_flush_io() and balance_dirty_pages() both have outer
> loops to retry writeout, so smallish b_io is not a problem at all.
  Well, it changes how balance_dirty_pages() behaves in some corner cases
(I'm not that much concerned about bdi_flush_io() because that is a last
resort thing anyway). But I see your point in consistency as well.

> > 3) I guess your patch does not compile because queue_io() is static ;).
> 
> Yeah, good spot~ :) Here is the updated patch. I feel like moving
> bdi_flush_io() to fs-writeback.c rather than exporting the low level
> queue_io() (and enable others to conveniently change the queue policy!).
> 
> balance_dirty_pages() cannot be moved.. so I plan to submit it after
> any IO-less merges. It's a cleanup patch after all.
Can't we just have a wrapper in fs/fs-writeback.c that will do:
     spin_lock(&bdi->wb.list_lock);
     if (list_empty(&bdi->wb.b_io))
             queue_io(&bdi->wb, &wbc);
     writeback_inodes_wb(&bdi->wb, &wbc);
     spin_unlock(&bdi->wb.list_lock);

And call it wherever we need? We can then also unexport
writeback_inodes_wb() which is not really a function someone would want to
call externally after your changes.

								Honza
> ---
> Subject: writeback: move queue_io() up
> Date: Thu Apr 21 12:06:32 CST 2011
> 
> Refactor code for more logical code layout.
> No behavior change. 
> 
> - kill __writeback_inodes_sb()
> - move bdi_flush_io() to fs-writeback.c
> - elevate queue_io() and locking up to wb_writeback() and bdi_flush_io()
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c         |   33 ++++++++++++++++++---------------
>  include/linux/writeback.h |    1 +
>  mm/backing-dev.c          |   12 ------------
>  3 files changed, 19 insertions(+), 27 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-21 20:11:53.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-21 21:11:02.000000000 +0800
> @@ -577,10 +577,6 @@ void writeback_inodes_wb(struct bdi_writ
>  
>  	if (!wbc->wb_start)
>  		wbc->wb_start = jiffies; /* livelock avoidance */
> -	spin_lock(&wb->list_lock);
> -
> -	if (list_empty(&wb->b_io))
> -		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = wb_inode(wb->b_io.prev);
> @@ -596,20 +592,23 @@ void writeback_inodes_wb(struct bdi_writ
>  		if (ret)
>  			break;
>  	}
> -	spin_unlock(&wb->list_lock);
>  	/* Leave any unwritten inodes on b_io */
>  }
>  
> -static void __writeback_inodes_sb(struct super_block *sb,
> -		struct bdi_writeback *wb, struct writeback_control *wbc)
> +void bdi_flush_io(struct backing_dev_info *bdi)
>  {
> -	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> +	struct writeback_control wbc = {
> +		.sync_mode		= WB_SYNC_NONE,
> +		.older_than_this	= NULL,
> +		.range_cyclic		= 1,
> +		.nr_to_write		= 1024,
> +	};
>  
> -	spin_lock(&wb->list_lock);
> -	if (list_empty(&wb->b_io))
> -		queue_io(wb, wbc);
> -	writeback_sb_inodes(sb, wb, wbc, true);
> -	spin_unlock(&wb->list_lock);
> +	spin_lock(&bdi->wb.list_lock);
> +	if (list_empty(&bdi->wb.b_io))
> +		queue_io(&bdi->wb, &wbc);
> +	writeback_inodes_wb(&bdi->wb, &wbc);
> +	spin_unlock(&bdi->wb.list_lock);
>  }
>  
>  /*
> @@ -674,7 +673,7 @@ static long wb_writeback(struct bdi_writ
>  	 * The intended call sequence for WB_SYNC_ALL writeback is:
>  	 *
>  	 *      wb_writeback()
> -	 *          __writeback_inodes_sb()     <== called only once
> +	 *          writeback_sb_inodes()       <== called only once
>  	 *              write_cache_pages()     <== called once for each inode
>  	 *                   (quickly) tag currently dirty pages
>  	 *                   (maybe slowly) sync all tagged pages
> @@ -722,10 +721,14 @@ static long wb_writeback(struct bdi_writ
>  
>  retry:
>  		trace_wbc_writeback_start(&wbc, wb->bdi);
> +		spin_lock(&wb->list_lock);
> +		if (list_empty(&wb->b_io))
> +			queue_io(wb, &wbc);
>  		if (work->sb)
> -			__writeback_inodes_sb(work->sb, wb, &wbc);
> +			writeback_sb_inodes(work->sb, wb, &wbc, true);
>  		else
>  			writeback_inodes_wb(wb, &wbc);
> +		spin_unlock(&wb->list_lock);
>  		trace_wbc_writeback_written(&wbc, wb->bdi);
>  
>  		work->nr_pages -= write_chunk - wbc.nr_to_write;
> --- linux-next.orig/mm/backing-dev.c	2011-04-21 20:11:52.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2011-04-21 20:16:15.000000000 +0800
> @@ -260,18 +260,6 @@ int bdi_has_dirty_io(struct backing_dev_
>  	return wb_has_dirty_io(&bdi->wb);
>  }
>  
> -static void bdi_flush_io(struct backing_dev_info *bdi)
> -{
> -	struct writeback_control wbc = {
> -		.sync_mode		= WB_SYNC_NONE,
> -		.older_than_this	= NULL,
> -		.range_cyclic		= 1,
> -		.nr_to_write		= 1024,
> -	};
> -
> -	writeback_inodes_wb(&bdi->wb, &wbc);
> -}
> -
>  /*
>   * kupdated() used to do this. We cannot do it from the bdi_forker_thread()
>   * or we risk deadlocking on ->s_umount. The longer term solution would be
> --- linux-next.orig/include/linux/writeback.h	2011-04-21 20:20:20.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-04-21 21:10:29.000000000 +0800
> @@ -56,6 +56,7 @@ struct writeback_control {
>   */	
>  struct bdi_writeback;
>  int inode_wait(void *);
> +void bdi_flush_io(struct backing_dev_info *bdi);
>  void writeback_inodes_sb(struct super_block *);
>  void writeback_inodes_sb_nr(struct super_block *, unsigned long nr);
>  int writeback_inodes_sb_if_idle(struct super_block *);
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-22  2:32                       ` Wu Fengguang
@ 2011-04-22 21:23                         ` Jan Kara
  -1 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-22 21:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Christoph Hellwig, Andrew Morton, Mel Gorman,
	Dave Chinner, Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Fri 22-04-11 10:32:26, Wu Fengguang wrote:
> On Fri, Apr 22, 2011 at 12:41:54AM +0800, Jan Kara wrote:
> > On Thu 21-04-11 14:05:56, Wu Fengguang wrote:
> > > On Thu, Apr 21, 2011 at 12:39:40PM +0800, Christoph Hellwig wrote:
> > > > On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> > > > > I collected the writeback_single_inode() traces (patch attached for
> > > > > your reference) each for several test runs, and find much more
> > > > > I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> > > > > I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> > > > > even for small files?
> > > > 
> > > > What is your defintion of a small file?  As soon as it has multiple
> > > > extents or holes there's absolutely no way to clean it with a single
> > > > writepage call.
> > > 
> > > It's writing a kernel source tree to XFS. You can find in the below
> > > trace that it often leaves more dirty pages behind (indicated by the
> > > I_DIRTY_PAGES flag) after writing as less as 1 page (indicated by the
> > > wrote=1 field).
> >   As Dave said, it's probably just a race since XFS redirties the inode on
> > IO completion. So I think the inodes are just small so they have only a few
> > dirty pages so you don't have much to write and they are written and
> > redirtied before you check the I_DIRTY flags. You could use radix tree
> > dirty tag to verify whether there are really dirty pages or not...
> 
> Yeah, Dave and Christoph root caused it in the other email -- XFS sets
> I_DIRTY which accidentally sets I_DIRTY_PAGES. We can safely bet there
> are no real dirty pages -- otherwise it would have turned up as
> performance regressions.
  Yes, but then the question what we actually do better is still open,
right? :) I'm really curious what it could be because especially in your
copy-kernel case I should not make much different - maybe except if we
occasionally managed to block on PageLock behind the writing thread and now
we don't because we queue the inode later but I find that highly unlikely.

> >   BTW a quick check of kernel tree shows the following distribution of
> > sizes (in KB):
> >   Count KB  Cumulative Percent
> >     257 0   0.9%
> >   13309 4   45%
> >    5553 8   63%
> >    2997 12  73%
> >    1879 16  80%
> >    1275 20  83%
> >     987 24  87%
> >     685 28  89%
> >     540 32  91%
> >     387 36  ...
> >     309 40
> >     264 44
> >     249 48
> >     170 52
> >     143 56
> >     144 60
> >     132 64
> >     100 68
> >     ...
> > Total 30155
> > 
> > And the distribution of your 'wrote=xxx' roughly corresponds to this...
> 
> Nice numbers! How do you manage to account them? :)
  Easy shell command (and I handcomputed the percentages because I was lazy
to write a script for that):
find . -type f -name "*.[ch]" -exec du {} \; | cut -d '	' -f 1 |
sort -n | uniq -c

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2011-04-22 21:23                         ` Jan Kara
  0 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-22 21:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Christoph Hellwig, Andrew Morton, Mel Gorman,
	Dave Chinner, Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Fri 22-04-11 10:32:26, Wu Fengguang wrote:
> On Fri, Apr 22, 2011 at 12:41:54AM +0800, Jan Kara wrote:
> > On Thu 21-04-11 14:05:56, Wu Fengguang wrote:
> > > On Thu, Apr 21, 2011 at 12:39:40PM +0800, Christoph Hellwig wrote:
> > > > On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> > > > > I collected the writeback_single_inode() traces (patch attached for
> > > > > your reference) each for several test runs, and find much more
> > > > > I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> > > > > I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> > > > > even for small files?
> > > > 
> > > > What is your defintion of a small file?  As soon as it has multiple
> > > > extents or holes there's absolutely no way to clean it with a single
> > > > writepage call.
> > > 
> > > It's writing a kernel source tree to XFS. You can find in the below
> > > trace that it often leaves more dirty pages behind (indicated by the
> > > I_DIRTY_PAGES flag) after writing as less as 1 page (indicated by the
> > > wrote=1 field).
> >   As Dave said, it's probably just a race since XFS redirties the inode on
> > IO completion. So I think the inodes are just small so they have only a few
> > dirty pages so you don't have much to write and they are written and
> > redirtied before you check the I_DIRTY flags. You could use radix tree
> > dirty tag to verify whether there are really dirty pages or not...
> 
> Yeah, Dave and Christoph root caused it in the other email -- XFS sets
> I_DIRTY which accidentally sets I_DIRTY_PAGES. We can safely bet there
> are no real dirty pages -- otherwise it would have turned up as
> performance regressions.
  Yes, but then the question what we actually do better is still open,
right? :) I'm really curious what it could be because especially in your
copy-kernel case I should not make much different - maybe except if we
occasionally managed to block on PageLock behind the writing thread and now
we don't because we queue the inode later but I find that highly unlikely.

> >   BTW a quick check of kernel tree shows the following distribution of
> > sizes (in KB):
> >   Count KB  Cumulative Percent
> >     257 0   0.9%
> >   13309 4   45%
> >    5553 8   63%
> >    2997 12  73%
> >    1879 16  80%
> >    1275 20  83%
> >     987 24  87%
> >     685 28  89%
> >     540 32  91%
> >     387 36  ...
> >     309 40
> >     264 44
> >     249 48
> >     170 52
> >     143 56
> >     144 60
> >     132 64
> >     100 68
> >     ...
> > Total 30155
> > 
> > And the distribution of your 'wrote=xxx' roughly corresponds to this...
> 
> Nice numbers! How do you manage to account them? :)
  Easy shell command (and I handcomputed the percentages because I was lazy
to write a script for that):
find . -type f -name "*.[ch]" -exec du {} \; | cut -d '	' -f 1 |
sort -n | uniq -c

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-22 21:12                             ` Jan Kara
@ 2011-04-26  5:37                               ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-26  5:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Sat, Apr 23, 2011 at 05:12:55AM +0800, Jan Kara wrote:
> On Fri 22-04-11 10:24:59, Wu Fengguang wrote:
> > > 2) The intention of both bdi_flush_io() and balance_dirty_pages() is to
> > > write .nr_to_write pages. So they should either do queue_io()
> > > unconditionally (I kind of like that for simplicity) or they should requeue
> > > once if they have not written enough - otherwise it could happen that they
> > > are called just at the moment when b_io contains a single inode with a few
> > > dirty pages and they end up doing almost nothing.
> > 
> > It makes much more sense to keep the policy consistent. When the
> > flusher and the throttled tasks are both actively manipulating the
> > shared lists but in different ways, how are we going to analyze the
> > resulted mixture behavior?
> > 
> > Note that bdi_flush_io() and balance_dirty_pages() both have outer
> > loops to retry writeout, so smallish b_io is not a problem at all.
>   Well, it changes how balance_dirty_pages() behaves in some corner cases
> (I'm not that much concerned about bdi_flush_io() because that is a last
> resort thing anyway). But I see your point in consistency as well.
> 
> > > 3) I guess your patch does not compile because queue_io() is static ;).
> > 
> > Yeah, good spot~ :) Here is the updated patch. I feel like moving
> > bdi_flush_io() to fs-writeback.c rather than exporting the low level
> > queue_io() (and enable others to conveniently change the queue policy!).
> > 
> > balance_dirty_pages() cannot be moved.. so I plan to submit it after
> > any IO-less merges. It's a cleanup patch after all.
> Can't we just have a wrapper in fs/fs-writeback.c that will do:
>      spin_lock(&bdi->wb.list_lock);
>      if (list_empty(&bdi->wb.b_io))
>              queue_io(&bdi->wb, &wbc);
>      writeback_inodes_wb(&bdi->wb, &wbc);
>      spin_unlock(&bdi->wb.list_lock);
> 
> And call it wherever we need? We can then also unexport
> writeback_inodes_wb() which is not really a function someone would want to
> call externally after your changes.

OK, this avoids the need to move bdi_flush_io(). Here is the updated
patch, do you see any more problems?

Thanks,
Fengguang
---
Subject: writeback: elevate queue_io() into wb_writeback()
Date: Thu Apr 21 12:06:32 CST 2011

Code refactor for more logical code layout.
No behavior change.

- remove the mis-named __writeback_inodes_sb()

- wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
  before calling __writeback_inodes_wb()

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-26 13:20:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-26 13:30:19.000000000 +0800
@@ -570,17 +570,13 @@ static int writeback_sb_inodes(struct su
 	return 1;
 }
 
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc)
+static void __writeback_inodes_wb(struct bdi_writeback *wb,
+				  struct writeback_control *wbc)
 {
 	int ret = 0;
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&wb->list_lock);
-
-	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -596,19 +592,16 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&wb->list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
-static void __writeback_inodes_sb(struct super_block *sb,
-		struct bdi_writeback *wb, struct writeback_control *wbc)
+void writeback_inodes_wb(struct bdi_writeback *wb,
+		struct writeback_control *wbc)
 {
-	WARN_ON(!rwsem_is_locked(&sb->s_umount));
-
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc);
-	writeback_sb_inodes(sb, wb, wbc, true);
+	__writeback_inodes_wb(wb, wbc);
 	spin_unlock(&wb->list_lock);
 }
 
@@ -674,7 +667,7 @@ static long wb_writeback(struct bdi_writ
 	 * The intended call sequence for WB_SYNC_ALL writeback is:
 	 *
 	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
+	 *          writeback_sb_inodes()       <== called only once
 	 *              write_cache_pages()     <== called once for each inode
 	 *                   (quickly) tag currently dirty pages
 	 *                   (maybe slowly) sync all tagged pages
@@ -722,10 +715,14 @@ static long wb_writeback(struct bdi_writ
 
 retry:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
+		spin_lock(&wb->list_lock);
+		if (list_empty(&wb->b_io))
+			queue_io(wb, &wbc);
 		if (work->sb)
-			__writeback_inodes_sb(work->sb, wb, &wbc);
+			writeback_sb_inodes(work->sb, wb, &wbc, true);
 		else
-			writeback_inodes_wb(wb, &wbc);
+			__writeback_inodes_wb(wb, &wbc);
+		spin_unlock(&wb->list_lock);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-26  5:37                               ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-26  5:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Sat, Apr 23, 2011 at 05:12:55AM +0800, Jan Kara wrote:
> On Fri 22-04-11 10:24:59, Wu Fengguang wrote:
> > > 2) The intention of both bdi_flush_io() and balance_dirty_pages() is to
> > > write .nr_to_write pages. So they should either do queue_io()
> > > unconditionally (I kind of like that for simplicity) or they should requeue
> > > once if they have not written enough - otherwise it could happen that they
> > > are called just at the moment when b_io contains a single inode with a few
> > > dirty pages and they end up doing almost nothing.
> > 
> > It makes much more sense to keep the policy consistent. When the
> > flusher and the throttled tasks are both actively manipulating the
> > shared lists but in different ways, how are we going to analyze the
> > resulted mixture behavior?
> > 
> > Note that bdi_flush_io() and balance_dirty_pages() both have outer
> > loops to retry writeout, so smallish b_io is not a problem at all.
>   Well, it changes how balance_dirty_pages() behaves in some corner cases
> (I'm not that much concerned about bdi_flush_io() because that is a last
> resort thing anyway). But I see your point in consistency as well.
> 
> > > 3) I guess your patch does not compile because queue_io() is static ;).
> > 
> > Yeah, good spot~ :) Here is the updated patch. I feel like moving
> > bdi_flush_io() to fs-writeback.c rather than exporting the low level
> > queue_io() (and enable others to conveniently change the queue policy!).
> > 
> > balance_dirty_pages() cannot be moved.. so I plan to submit it after
> > any IO-less merges. It's a cleanup patch after all.
> Can't we just have a wrapper in fs/fs-writeback.c that will do:
>      spin_lock(&bdi->wb.list_lock);
>      if (list_empty(&bdi->wb.b_io))
>              queue_io(&bdi->wb, &wbc);
>      writeback_inodes_wb(&bdi->wb, &wbc);
>      spin_unlock(&bdi->wb.list_lock);
> 
> And call it wherever we need? We can then also unexport
> writeback_inodes_wb() which is not really a function someone would want to
> call externally after your changes.

OK, this avoids the need to move bdi_flush_io(). Here is the updated
patch, do you see any more problems?

Thanks,
Fengguang
---
Subject: writeback: elevate queue_io() into wb_writeback()
Date: Thu Apr 21 12:06:32 CST 2011

Code refactor for more logical code layout.
No behavior change.

- remove the mis-named __writeback_inodes_sb()

- wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
  before calling __writeback_inodes_wb()

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-26 13:20:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-26 13:30:19.000000000 +0800
@@ -570,17 +570,13 @@ static int writeback_sb_inodes(struct su
 	return 1;
 }
 
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc)
+static void __writeback_inodes_wb(struct bdi_writeback *wb,
+				  struct writeback_control *wbc)
 {
 	int ret = 0;
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&wb->list_lock);
-
-	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -596,19 +592,16 @@ void writeback_inodes_wb(struct bdi_writ
 		if (ret)
 			break;
 	}
-	spin_unlock(&wb->list_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
-static void __writeback_inodes_sb(struct super_block *sb,
-		struct bdi_writeback *wb, struct writeback_control *wbc)
+void writeback_inodes_wb(struct bdi_writeback *wb,
+		struct writeback_control *wbc)
 {
-	WARN_ON(!rwsem_is_locked(&sb->s_umount));
-
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc);
-	writeback_sb_inodes(sb, wb, wbc, true);
+	__writeback_inodes_wb(wb, wbc);
 	spin_unlock(&wb->list_lock);
 }
 
@@ -674,7 +667,7 @@ static long wb_writeback(struct bdi_writ
 	 * The intended call sequence for WB_SYNC_ALL writeback is:
 	 *
 	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
+	 *          writeback_sb_inodes()       <== called only once
 	 *              write_cache_pages()     <== called once for each inode
 	 *                   (quickly) tag currently dirty pages
 	 *                   (maybe slowly) sync all tagged pages
@@ -722,10 +715,14 @@ static long wb_writeback(struct bdi_writ
 
 retry:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
+		spin_lock(&wb->list_lock);
+		if (list_empty(&wb->b_io))
+			queue_io(wb, &wbc);
 		if (work->sb)
-			__writeback_inodes_sb(work->sb, wb, &wbc);
+			writeback_sb_inodes(work->sb, wb, &wbc, true);
 		else
-			writeback_inodes_wb(wb, &wbc);
+			__writeback_inodes_wb(wb, &wbc);
+		spin_unlock(&wb->list_lock);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
  2011-04-26  5:37                               ` Wu Fengguang
@ 2011-04-26 14:30                                 ` Jan Kara
  -1 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-26 14:30 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Dave Chinner, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Tue 26-04-11 13:37:06, Wu Fengguang wrote:
> On Sat, Apr 23, 2011 at 05:12:55AM +0800, Jan Kara wrote:
> > On Fri 22-04-11 10:24:59, Wu Fengguang wrote:
> > > > 2) The intention of both bdi_flush_io() and balance_dirty_pages() is to
> > > > write .nr_to_write pages. So they should either do queue_io()
> > > > unconditionally (I kind of like that for simplicity) or they should requeue
> > > > once if they have not written enough - otherwise it could happen that they
> > > > are called just at the moment when b_io contains a single inode with a few
> > > > dirty pages and they end up doing almost nothing.
> > > 
> > > It makes much more sense to keep the policy consistent. When the
> > > flusher and the throttled tasks are both actively manipulating the
> > > shared lists but in different ways, how are we going to analyze the
> > > resulted mixture behavior?
> > > 
> > > Note that bdi_flush_io() and balance_dirty_pages() both have outer
> > > loops to retry writeout, so smallish b_io is not a problem at all.
> >   Well, it changes how balance_dirty_pages() behaves in some corner cases
> > (I'm not that much concerned about bdi_flush_io() because that is a last
> > resort thing anyway). But I see your point in consistency as well.
> > 
> > > > 3) I guess your patch does not compile because queue_io() is static ;).
> > > 
> > > Yeah, good spot~ :) Here is the updated patch. I feel like moving
> > > bdi_flush_io() to fs-writeback.c rather than exporting the low level
> > > queue_io() (and enable others to conveniently change the queue policy!).
> > > 
> > > balance_dirty_pages() cannot be moved.. so I plan to submit it after
> > > any IO-less merges. It's a cleanup patch after all.
> > Can't we just have a wrapper in fs/fs-writeback.c that will do:
> >      spin_lock(&bdi->wb.list_lock);
> >      if (list_empty(&bdi->wb.b_io))
> >              queue_io(&bdi->wb, &wbc);
> >      writeback_inodes_wb(&bdi->wb, &wbc);
> >      spin_unlock(&bdi->wb.list_lock);
> > 
> > And call it wherever we need? We can then also unexport
> > writeback_inodes_wb() which is not really a function someone would want to
> > call externally after your changes.
> 
> OK, this avoids the need to move bdi_flush_io(). Here is the updated
> patch, do you see any more problems?
  Yes, with this patch I think your change to the queueing logic is OK.
Thanks.

								Honza
> 
> Thanks,
> Fengguang
> ---
> Subject: writeback: elevate queue_io() into wb_writeback()
> Date: Thu Apr 21 12:06:32 CST 2011
> 
> Code refactor for more logical code layout.
> No behavior change.
> 
> - remove the mis-named __writeback_inodes_sb()
> 
> - wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
>   before calling __writeback_inodes_wb()
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   27 ++++++++++++---------------
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-26 13:20:17.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-26 13:30:19.000000000 +0800
> @@ -570,17 +570,13 @@ static int writeback_sb_inodes(struct su
>  	return 1;
>  }
>  
> -void writeback_inodes_wb(struct bdi_writeback *wb,
> -		struct writeback_control *wbc)
> +static void __writeback_inodes_wb(struct bdi_writeback *wb,
> +				  struct writeback_control *wbc)
>  {
>  	int ret = 0;
>  
>  	if (!wbc->wb_start)
>  		wbc->wb_start = jiffies; /* livelock avoidance */
> -	spin_lock(&wb->list_lock);
> -
> -	if (list_empty(&wb->b_io))
> -		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = wb_inode(wb->b_io.prev);
> @@ -596,19 +592,16 @@ void writeback_inodes_wb(struct bdi_writ
>  		if (ret)
>  			break;
>  	}
> -	spin_unlock(&wb->list_lock);
>  	/* Leave any unwritten inodes on b_io */
>  }
>  
> -static void __writeback_inodes_sb(struct super_block *sb,
> -		struct bdi_writeback *wb, struct writeback_control *wbc)
> +void writeback_inodes_wb(struct bdi_writeback *wb,
> +		struct writeback_control *wbc)
>  {
> -	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> -
>  	spin_lock(&wb->list_lock);
>  	if (list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
> -	writeback_sb_inodes(sb, wb, wbc, true);
> +	__writeback_inodes_wb(wb, wbc);
>  	spin_unlock(&wb->list_lock);
>  }
>  
> @@ -674,7 +667,7 @@ static long wb_writeback(struct bdi_writ
>  	 * The intended call sequence for WB_SYNC_ALL writeback is:
>  	 *
>  	 *      wb_writeback()
> -	 *          __writeback_inodes_sb()     <== called only once
> +	 *          writeback_sb_inodes()       <== called only once
>  	 *              write_cache_pages()     <== called once for each inode
>  	 *                   (quickly) tag currently dirty pages
>  	 *                   (maybe slowly) sync all tagged pages
> @@ -722,10 +715,14 @@ static long wb_writeback(struct bdi_writ
>  
>  retry:
>  		trace_wbc_writeback_start(&wbc, wb->bdi);
> +		spin_lock(&wb->list_lock);
> +		if (list_empty(&wb->b_io))
> +			queue_io(wb, &wbc);
>  		if (work->sb)
> -			__writeback_inodes_sb(work->sb, wb, &wbc);
> +			writeback_sb_inodes(work->sb, wb, &wbc, true);
>  		else
> -			writeback_inodes_wb(wb, &wbc);
> +			__writeback_inodes_wb(wb, &wbc);
> +		spin_unlock(&wb->list_lock);
>  		trace_wbc_writeback_written(&wbc, wb->bdi);
>  
>  		work->nr_pages -= write_chunk - wbc.nr_to_write;
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback
@ 2011-04-26 14:30                                 ` Jan Kara
  0 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2011-04-26 14:30 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Dave Chinner, Andrew Morton, Mel Gorman, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel, Linux Memory Management List

On Tue 26-04-11 13:37:06, Wu Fengguang wrote:
> On Sat, Apr 23, 2011 at 05:12:55AM +0800, Jan Kara wrote:
> > On Fri 22-04-11 10:24:59, Wu Fengguang wrote:
> > > > 2) The intention of both bdi_flush_io() and balance_dirty_pages() is to
> > > > write .nr_to_write pages. So they should either do queue_io()
> > > > unconditionally (I kind of like that for simplicity) or they should requeue
> > > > once if they have not written enough - otherwise it could happen that they
> > > > are called just at the moment when b_io contains a single inode with a few
> > > > dirty pages and they end up doing almost nothing.
> > > 
> > > It makes much more sense to keep the policy consistent. When the
> > > flusher and the throttled tasks are both actively manipulating the
> > > shared lists but in different ways, how are we going to analyze the
> > > resulted mixture behavior?
> > > 
> > > Note that bdi_flush_io() and balance_dirty_pages() both have outer
> > > loops to retry writeout, so smallish b_io is not a problem at all.
> >   Well, it changes how balance_dirty_pages() behaves in some corner cases
> > (I'm not that much concerned about bdi_flush_io() because that is a last
> > resort thing anyway). But I see your point in consistency as well.
> > 
> > > > 3) I guess your patch does not compile because queue_io() is static ;).
> > > 
> > > Yeah, good spot~ :) Here is the updated patch. I feel like moving
> > > bdi_flush_io() to fs-writeback.c rather than exporting the low level
> > > queue_io() (and enable others to conveniently change the queue policy!).
> > > 
> > > balance_dirty_pages() cannot be moved.. so I plan to submit it after
> > > any IO-less merges. It's a cleanup patch after all.
> > Can't we just have a wrapper in fs/fs-writeback.c that will do:
> >      spin_lock(&bdi->wb.list_lock);
> >      if (list_empty(&bdi->wb.b_io))
> >              queue_io(&bdi->wb, &wbc);
> >      writeback_inodes_wb(&bdi->wb, &wbc);
> >      spin_unlock(&bdi->wb.list_lock);
> > 
> > And call it wherever we need? We can then also unexport
> > writeback_inodes_wb() which is not really a function someone would want to
> > call externally after your changes.
> 
> OK, this avoids the need to move bdi_flush_io(). Here is the updated
> patch, do you see any more problems?
  Yes, with this patch I think your change to the queueing logic is OK.
Thanks.

								Honza
> 
> Thanks,
> Fengguang
> ---
> Subject: writeback: elevate queue_io() into wb_writeback()
> Date: Thu Apr 21 12:06:32 CST 2011
> 
> Code refactor for more logical code layout.
> No behavior change.
> 
> - remove the mis-named __writeback_inodes_sb()
> 
> - wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
>   before calling __writeback_inodes_wb()
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   27 ++++++++++++---------------
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-26 13:20:17.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-26 13:30:19.000000000 +0800
> @@ -570,17 +570,13 @@ static int writeback_sb_inodes(struct su
>  	return 1;
>  }
>  
> -void writeback_inodes_wb(struct bdi_writeback *wb,
> -		struct writeback_control *wbc)
> +static void __writeback_inodes_wb(struct bdi_writeback *wb,
> +				  struct writeback_control *wbc)
>  {
>  	int ret = 0;
>  
>  	if (!wbc->wb_start)
>  		wbc->wb_start = jiffies; /* livelock avoidance */
> -	spin_lock(&wb->list_lock);
> -
> -	if (list_empty(&wb->b_io))
> -		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = wb_inode(wb->b_io.prev);
> @@ -596,19 +592,16 @@ void writeback_inodes_wb(struct bdi_writ
>  		if (ret)
>  			break;
>  	}
> -	spin_unlock(&wb->list_lock);
>  	/* Leave any unwritten inodes on b_io */
>  }
>  
> -static void __writeback_inodes_sb(struct super_block *sb,
> -		struct bdi_writeback *wb, struct writeback_control *wbc)
> +void writeback_inodes_wb(struct bdi_writeback *wb,
> +		struct writeback_control *wbc)
>  {
> -	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> -
>  	spin_lock(&wb->list_lock);
>  	if (list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
> -	writeback_sb_inodes(sb, wb, wbc, true);
> +	__writeback_inodes_wb(wb, wbc);
>  	spin_unlock(&wb->list_lock);
>  }
>  
> @@ -674,7 +667,7 @@ static long wb_writeback(struct bdi_writ
>  	 * The intended call sequence for WB_SYNC_ALL writeback is:
>  	 *
>  	 *      wb_writeback()
> -	 *          __writeback_inodes_sb()     <== called only once
> +	 *          writeback_sb_inodes()       <== called only once
>  	 *              write_cache_pages()     <== called once for each inode
>  	 *                   (quickly) tag currently dirty pages
>  	 *                   (maybe slowly) sync all tagged pages
> @@ -722,10 +715,14 @@ static long wb_writeback(struct bdi_writ
>  
>  retry:
>  		trace_wbc_writeback_start(&wbc, wb->bdi);
> +		spin_lock(&wb->list_lock);
> +		if (list_empty(&wb->b_io))
> +			queue_io(wb, &wbc);
>  		if (work->sb)
> -			__writeback_inodes_sb(work->sb, wb, &wbc);
> +			writeback_sb_inodes(work->sb, wb, &wbc, true);
>  		else
> -			writeback_inodes_wb(wb, &wbc);
> +			__writeback_inodes_wb(wb, &wbc);
> +		spin_unlock(&wb->list_lock);
>  		trace_wbc_writeback_written(&wbc, wb->bdi);
>  
>  		work->nr_pages -= write_chunk - wbc.nr_to_write;
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2011-05-04 11:04     ` Christoph Hellwig
@ 2011-05-04 11:13       ` Wu Fengguang
  -1 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-05-04 11:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman, Dave Chinner,
	Itaru Kitayama, Minchan Kim, Linux Memory Management List,
	linux-fsdevel, LKML

On Wed, May 04, 2011 at 07:04:32PM +0800, Christoph Hellwig wrote:
> On Wed, Apr 20, 2011 at 04:03:37PM +0800, Wu Fengguang wrote:
> > No behavior change. This will add debug visibility to the code, for
> > example, to dump the wbc contents when kprobing queue_io().
> 
> I don't think it's a good idea.  The writeback_control should move
> back to just controlling per-inode writeback and not be passed to
> more routines dealing with high-level writeback.

Good point. I can do without this patch.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2011-05-04 11:13       ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-05-04 11:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman, Dave Chinner,
	Itaru Kitayama, Minchan Kim, Linux Memory Management List,
	linux-fsdevel, LKML

On Wed, May 04, 2011 at 07:04:32PM +0800, Christoph Hellwig wrote:
> On Wed, Apr 20, 2011 at 04:03:37PM +0800, Wu Fengguang wrote:
> > No behavior change. This will add debug visibility to the code, for
> > example, to dump the wbc contents when kprobing queue_io().
> 
> I don't think it's a good idea.  The writeback_control should move
> back to just controlling per-inode writeback and not be passed to
> more routines dealing with high-level writeback.

Good point. I can do without this patch.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2011-04-20  8:03   ` Wu Fengguang
@ 2011-05-04 11:04     ` Christoph Hellwig
  -1 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-05-04 11:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman, Dave Chinner,
	Itaru Kitayama, Minchan Kim, Linux Memory Management List,
	linux-fsdevel, LKML

On Wed, Apr 20, 2011 at 04:03:37PM +0800, Wu Fengguang wrote:
> No behavior change. This will add debug visibility to the code, for
> example, to dump the wbc contents when kprobing queue_io().

I don't think it's a good idea.  The writeback_control should move
back to just controlling per-inode writeback and not be passed to
more routines dealing with high-level writeback.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2011-05-04 11:04     ` Christoph Hellwig
  0 siblings, 0 replies; 135+ messages in thread
From: Christoph Hellwig @ 2011-05-04 11:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Mel Gorman, Dave Chinner,
	Itaru Kitayama, Minchan Kim, Linux Memory Management List,
	linux-fsdevel, LKML

On Wed, Apr 20, 2011 at 04:03:37PM +0800, Wu Fengguang wrote:
> No behavior change. This will add debug visibility to the code, for
> example, to dump the wbc contents when kprobing queue_io().

I don't think it's a good idea.  The writeback_control should move
back to just controlling per-inode writeback and not be passed to
more routines dealing with high-level writeback.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2011-04-20  8:03 [PATCH 0/6] writeback: moving expire targets for background/kupdate works v2 Wu Fengguang
@ 2011-04-20  8:03   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-20  8:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Mel Gorman, Wu Fengguang, Dave Chinner,
	Itaru Kitayama, Minchan Kim, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-pass-wbc-to-queue_io.patch --]
[-- Type: text/plain, Size: 2554 bytes --]

No behavior change. This will add debug visibility to the code, for
example, to dump the wbc contents when kprobing queue_io().

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:28.000000000 +0800
@@ -251,8 +251,8 @@ static bool inode_dirtied_after(struct i
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
 static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+				struct list_head *dispatch_queue,
+				struct writeback_control *wbc)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -262,8 +262,8 @@ static void move_expired_inodes(struct l
 
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (wbc->older_than_this &&
+		    inode_dirtied_after(inode, *wbc->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -299,11 +299,11 @@ static void move_expired_inodes(struct l
  *                                           |
  *                                           +--> dequeue for IO
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
 {
 	assert_spin_locked(&inode_wb_list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -579,7 +579,7 @@ void writeback_inodes_wb(struct bdi_writ
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_wb_list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -606,7 +606,7 @@ static void __writeback_inodes_sb(struct
 
 	spin_lock(&inode_wb_list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_wb_list_lock);
 }



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2011-04-20  8:03   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2011-04-20  8:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Mel Gorman, Wu Fengguang, Dave Chinner,
	Itaru Kitayama, Minchan Kim, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-pass-wbc-to-queue_io.patch --]
[-- Type: text/plain, Size: 2857 bytes --]

No behavior change. This will add debug visibility to the code, for
example, to dump the wbc contents when kprobing queue_io().

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:28.000000000 +0800
@@ -251,8 +251,8 @@ static bool inode_dirtied_after(struct i
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
 static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+				struct list_head *dispatch_queue,
+				struct writeback_control *wbc)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -262,8 +262,8 @@ static void move_expired_inodes(struct l
 
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (wbc->older_than_this &&
+		    inode_dirtied_after(inode, *wbc->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -299,11 +299,11 @@ static void move_expired_inodes(struct l
  *                                           |
  *                                           +--> dequeue for IO
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
 {
 	assert_spin_locked(&inode_wb_list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -579,7 +579,7 @@ void writeback_inodes_wb(struct bdi_writ
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_wb_list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -606,7 +606,7 @@ static void __writeback_inodes_sb(struct
 
 	spin_lock(&inode_wb_list_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_wb_list_lock);
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-08-01 15:23     ` Minchan Kim
  -1 siblings, 0 replies; 135+ messages in thread
From: Minchan Kim @ 2010-08-01 15:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:29PM +0800, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2010-08-01 15:23     ` Minchan Kim
  0 siblings, 0 replies; 135+ messages in thread
From: Minchan Kim @ 2010-08-01 15:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:29PM +0800, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-07-26 10:44     ` Mel Gorman
  -1 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2010-07-26 10:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:29PM +0800, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Can't see any problem.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2010-07-26 10:44     ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2010-07-26 10:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:29PM +0800, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Can't see any problem.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-07-23 18:16     ` Jan Kara
  -1 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2010-07-23 18:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu 22-07-10 13:09:29, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
  Looks OK.

Acked-by: Jan Kara <jack@suse.cz>

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-21 20:12:38.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-21 20:14:38.000000000 +0800
> @@ -213,8 +213,8 @@ static bool inode_dirtied_after(struct i
>   * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
>   */
>  static void move_expired_inodes(struct list_head *delaying_queue,
> -			       struct list_head *dispatch_queue,
> -				unsigned long *older_than_this)
> +				struct list_head *dispatch_queue,
> +				struct writeback_control *wbc)
>  {
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
> @@ -224,8 +224,8 @@ static void move_expired_inodes(struct l
>  
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> -		if (older_than_this &&
> -		    inode_dirtied_after(inode, *older_than_this))
> +		if (wbc->older_than_this &&
> +		    inode_dirtied_after(inode, *wbc->older_than_this))
>  			break;
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
> @@ -257,10 +257,10 @@ static void move_expired_inodes(struct l
>   *                 => b_more_io inodes
>   *                 => remaining inodes in b_io => (dequeue for sync)
>   */
> -static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
> +static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
>  {
>  	list_splice_init(&wb->b_more_io, &wb->b_io);
> -	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
> +	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
>  }
>  
>  static int write_inode(struct inode *inode, struct writeback_control *wbc)
> @@ -519,7 +519,7 @@ void writeback_inodes_wb(struct bdi_writ
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
>  	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> -		queue_io(wb, wbc->older_than_this);
> +		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = list_entry(wb->b_io.prev,
> @@ -548,7 +548,7 @@ static void __writeback_inodes_sb(struct
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
>  	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> -		queue_io(wb, wbc->older_than_this);
> +		queue_io(wb, wbc);
>  	writeback_sb_inodes(sb, wb, wbc, true);
>  	spin_unlock(&inode_lock);
>  }
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2010-07-23 18:16     ` Jan Kara
  0 siblings, 0 replies; 135+ messages in thread
From: Jan Kara @ 2010-07-23 18:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu 22-07-10 13:09:29, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
  Looks OK.

Acked-by: Jan Kara <jack@suse.cz>

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-21 20:12:38.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-21 20:14:38.000000000 +0800
> @@ -213,8 +213,8 @@ static bool inode_dirtied_after(struct i
>   * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
>   */
>  static void move_expired_inodes(struct list_head *delaying_queue,
> -			       struct list_head *dispatch_queue,
> -				unsigned long *older_than_this)
> +				struct list_head *dispatch_queue,
> +				struct writeback_control *wbc)
>  {
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
> @@ -224,8 +224,8 @@ static void move_expired_inodes(struct l
>  
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> -		if (older_than_this &&
> -		    inode_dirtied_after(inode, *older_than_this))
> +		if (wbc->older_than_this &&
> +		    inode_dirtied_after(inode, *wbc->older_than_this))
>  			break;
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
> @@ -257,10 +257,10 @@ static void move_expired_inodes(struct l
>   *                 => b_more_io inodes
>   *                 => remaining inodes in b_io => (dequeue for sync)
>   */
> -static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
> +static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
>  {
>  	list_splice_init(&wb->b_more_io, &wb->b_io);
> -	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
> +	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
>  }
>  
>  static int write_inode(struct inode *inode, struct writeback_control *wbc)
> @@ -519,7 +519,7 @@ void writeback_inodes_wb(struct bdi_writ
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
>  	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> -		queue_io(wb, wbc->older_than_this);
> +		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = list_entry(wb->b_io.prev,
> @@ -548,7 +548,7 @@ static void __writeback_inodes_sb(struct
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
>  	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> -		queue_io(wb, wbc->older_than_this);
> +		queue_io(wb, wbc);
>  	writeback_sb_inodes(sb, wb, wbc, true);
>  	spin_unlock(&inode_lock);
>  }
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2010-07-22  5:09 [PATCH 0/6] [RFC] writeback: try to write older pages first Wu Fengguang
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-07-22  5:09   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-pass-wbc-to-queue_io.patch --]
[-- Type: text/plain, Size: 2458 bytes --]

This is to prepare for moving the dirty expire policy to move_expired_inodes().
No behavior change.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-21 20:12:38.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-21 20:14:38.000000000 +0800
@@ -213,8 +213,8 @@ static bool inode_dirtied_after(struct i
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
 static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+				struct list_head *dispatch_queue,
+				struct writeback_control *wbc)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -224,8 +224,8 @@ static void move_expired_inodes(struct l
 
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (wbc->older_than_this &&
+		    inode_dirtied_after(inode, *wbc->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -257,10 +257,10 @@ static void move_expired_inodes(struct l
  *                 => b_more_io inodes
  *                 => remaining inodes in b_io => (dequeue for sync)
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
 {
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -519,7 +519,7 @@ void writeback_inodes_wb(struct bdi_writ
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = list_entry(wb->b_io.prev,
@@ -548,7 +548,7 @@ static void __writeback_inodes_sb(struct
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);
 }



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2010-07-22  5:09   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-pass-wbc-to-queue_io.patch --]
[-- Type: text/plain, Size: 2683 bytes --]

This is to prepare for moving the dirty expire policy to move_expired_inodes().
No behavior change.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-21 20:12:38.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-21 20:14:38.000000000 +0800
@@ -213,8 +213,8 @@ static bool inode_dirtied_after(struct i
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
 static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+				struct list_head *dispatch_queue,
+				struct writeback_control *wbc)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -224,8 +224,8 @@ static void move_expired_inodes(struct l
 
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (wbc->older_than_this &&
+		    inode_dirtied_after(inode, *wbc->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -257,10 +257,10 @@ static void move_expired_inodes(struct l
  *                 => b_more_io inodes
  *                 => remaining inodes in b_io => (dequeue for sync)
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
 {
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -519,7 +519,7 @@ void writeback_inodes_wb(struct bdi_writ
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = list_entry(wb->b_io.prev,
@@ -548,7 +548,7 @@ static void __writeback_inodes_sb(struct
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2010-07-22  5:09   ` Wu Fengguang
  0 siblings, 0 replies; 135+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-pass-wbc-to-queue_io.patch --]
[-- Type: text/plain, Size: 2683 bytes --]

This is to prepare for moving the dirty expire policy to move_expired_inodes().
No behavior change.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-21 20:12:38.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-21 20:14:38.000000000 +0800
@@ -213,8 +213,8 @@ static bool inode_dirtied_after(struct i
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
 static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+				struct list_head *dispatch_queue,
+				struct writeback_control *wbc)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -224,8 +224,8 @@ static void move_expired_inodes(struct l
 
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (wbc->older_than_this &&
+		    inode_dirtied_after(inode, *wbc->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -257,10 +257,10 @@ static void move_expired_inodes(struct l
  *                 => b_more_io inodes
  *                 => remaining inodes in b_io => (dequeue for sync)
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
 {
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -519,7 +519,7 @@ void writeback_inodes_wb(struct bdi_writ
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = list_entry(wb->b_io.prev,
@@ -548,7 +548,7 @@ static void __writeback_inodes_sb(struct
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

end of thread, other threads:[~2011-05-04 11:13 UTC | newest]

Thread overview: 135+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-19  3:00 [PATCH 0/6] writeback: moving expire targets for background/kupdate works Wu Fengguang
2011-04-19  3:00 ` Wu Fengguang
2011-04-19  3:00 ` Wu Fengguang
2011-04-19  3:00 ` [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
2011-04-19  3:00   ` Wu Fengguang
2011-04-19  3:00   ` Wu Fengguang
2011-04-19  3:00 ` [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
2011-04-19  3:00   ` Wu Fengguang
2011-04-19  3:00   ` Wu Fengguang
2011-04-19  7:02   ` Dave Chinner
2011-04-19  7:02     ` Dave Chinner
2011-04-19  7:20     ` Wu Fengguang
2011-04-19  7:20       ` Wu Fengguang
2011-04-19  9:31       ` Jan Kara
2011-04-19  9:31         ` Jan Kara
2011-04-19  3:00 ` [PATCH 3/6] writeback: sync expired inodes first in background writeback Wu Fengguang
2011-04-19  3:00   ` Wu Fengguang
2011-04-19  3:00   ` Wu Fengguang
2011-04-19  7:35   ` Dave Chinner
2011-04-19  7:35     ` Dave Chinner
2011-04-19  9:57     ` Jan Kara
2011-04-19  9:57       ` Jan Kara
2011-04-19 12:56       ` Wu Fengguang
2011-04-19 13:46         ` Wu Fengguang
2011-04-19 13:46           ` Wu Fengguang
2011-04-20  1:21         ` Dave Chinner
2011-04-20  1:21           ` Dave Chinner
2011-04-20  2:53           ` Wu Fengguang
2011-04-20  2:53             ` Wu Fengguang
2011-04-21  0:45             ` Dave Chinner
2011-04-21  0:45               ` Dave Chinner
2011-04-21  2:06               ` Wu Fengguang
2011-04-21  2:06                 ` Wu Fengguang
2011-04-21  3:01                 ` Dave Chinner
2011-04-21  3:01                   ` Dave Chinner
2011-04-21  3:59                   ` Wu Fengguang
2011-04-21  3:59                     ` Wu Fengguang
2011-04-21  4:10                     ` Wu Fengguang
2011-04-21  4:10                       ` Wu Fengguang
2011-04-21  4:36                       ` Christoph Hellwig
2011-04-21  4:36                         ` Christoph Hellwig
2011-04-21  6:36                       ` Dave Chinner
2011-04-21  6:36                         ` Dave Chinner
2011-04-21 16:04                       ` Jan Kara
2011-04-21 16:04                         ` Jan Kara
2011-04-22  2:24                         ` Wu Fengguang
2011-04-22  2:24                           ` Wu Fengguang
2011-04-22 21:12                           ` Jan Kara
2011-04-22 21:12                             ` Jan Kara
2011-04-26  5:37                             ` Wu Fengguang
2011-04-26  5:37                               ` Wu Fengguang
2011-04-26 14:30                               ` Jan Kara
2011-04-26 14:30                                 ` Jan Kara
2011-04-20  7:38           ` Wu Fengguang
2011-04-20  7:38             ` Wu Fengguang
2011-04-21  1:01             ` Dave Chinner
2011-04-21  1:01               ` Dave Chinner
2011-04-21  1:47               ` Wu Fengguang
2011-04-21  1:47                 ` Wu Fengguang
2011-04-19  3:00 ` [PATCH 4/6] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
2011-04-19  3:00   ` Wu Fengguang
2011-04-19  3:00   ` Wu Fengguang
2011-04-19  9:47   ` Jan Kara
2011-04-19  9:47     ` Jan Kara
2011-04-19  3:00 ` [PATCH 5/6] writeback: try more writeback as long as something was written Wu Fengguang
2011-04-19  3:00   ` Wu Fengguang
2011-04-19  3:00   ` Wu Fengguang
2011-04-19 10:20   ` Jan Kara
2011-04-19 10:20     ` Jan Kara
2011-04-19 11:16     ` Wu Fengguang
2011-04-19 11:16       ` Wu Fengguang
2011-04-19 21:10       ` Jan Kara
2011-04-19 21:10         ` Jan Kara
2011-04-20  7:50         ` Wu Fengguang
2011-04-20  7:50           ` Wu Fengguang
2011-04-20 15:22           ` Jan Kara
2011-04-20 15:22             ` Jan Kara
2011-04-21  3:33             ` Wu Fengguang
2011-04-21  4:39               ` Christoph Hellwig
2011-04-21  4:39                 ` Christoph Hellwig
2011-04-21  6:05                 ` Wu Fengguang
2011-04-21  6:05                   ` Wu Fengguang
2011-04-21 16:41                   ` Jan Kara
2011-04-21 16:41                     ` Jan Kara
2011-04-22  2:32                     ` Wu Fengguang
2011-04-22  2:32                       ` Wu Fengguang
2011-04-22 21:23                       ` Jan Kara
2011-04-22 21:23                         ` Jan Kara
2011-04-21  7:09               ` Dave Chinner
2011-04-21  7:09                 ` Dave Chinner
2011-04-21  7:14                 ` Christoph Hellwig
2011-04-21  7:14                   ` Christoph Hellwig
2011-04-21  7:52                   ` Dave Chinner
2011-04-21  7:52                     ` Dave Chinner
2011-04-21  8:00                     ` Christoph Hellwig
2011-04-21  8:00                       ` Christoph Hellwig
2011-04-19  3:00 ` [PATCH 6/6] NFS: return -EAGAIN when skipped commit in nfs_commit_unstable_pages() Wu Fengguang
2011-04-19  3:00   ` Wu Fengguang
2011-04-19  3:29   ` Trond Myklebust
2011-04-19  3:29     ` Trond Myklebust
2011-04-19  3:55     ` Wu Fengguang
2011-04-19  3:55       ` Wu Fengguang
2011-04-21  4:40   ` Christoph Hellwig
2011-04-21  4:40     ` Christoph Hellwig
2011-04-19  6:38 ` [PATCH 0/6] writeback: moving expire targets for background/kupdate works Dave Chinner
2011-04-19  6:38   ` Dave Chinner
2011-04-19  8:02   ` Wu Fengguang
2011-04-19  8:02     ` Wu Fengguang
2011-04-21  4:34 ` Christoph Hellwig
2011-04-21  4:34   ` Christoph Hellwig
2011-04-21  5:50   ` Wu Fengguang
2011-04-21  5:50     ` Wu Fengguang
2011-04-21  5:56     ` Christoph Hellwig
2011-04-21  5:56       ` Christoph Hellwig
2011-04-21  6:07       ` Wu Fengguang
2011-04-21  6:07         ` Wu Fengguang
2011-04-21  7:17         ` Christoph Hellwig
2011-04-21  7:17           ` Christoph Hellwig
2011-04-21 10:15           ` Wu Fengguang
2011-04-21 10:15             ` Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2011-04-20  8:03 [PATCH 0/6] writeback: moving expire targets for background/kupdate works v2 Wu Fengguang
2011-04-20  8:03 ` [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
2011-04-20  8:03   ` Wu Fengguang
2011-05-04 11:04   ` Christoph Hellwig
2011-05-04 11:04     ` Christoph Hellwig
2011-05-04 11:13     ` Wu Fengguang
2011-05-04 11:13       ` Wu Fengguang
2010-07-22  5:09 [PATCH 0/6] [RFC] writeback: try to write older pages first Wu Fengguang
2010-07-22  5:09 ` [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
2010-07-22  5:09   ` Wu Fengguang
2010-07-22  5:09   ` Wu Fengguang
2010-07-23 18:16   ` Jan Kara
2010-07-23 18:16     ` Jan Kara
2010-07-26 10:44   ` Mel Gorman
2010-07-26 10:44     ` Mel Gorman
2010-08-01 15:23   ` Minchan Kim
2010-08-01 15:23     ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.