All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] [RFC] writeback: try to write older pages first
@ 2010-07-22  5:09 ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Chris Mason,
	Jens Axboe, Wu Fengguang, LKML, linux-fsdevel, linux-mm

Andrew,

The basic way of avoiding pageout() is to make the flusher sync inodes in the
right order. Oldest dirty inodes contains oldest pages. The smaller inode it
is, the more correlation between inode dirty time and its pages' dirty time.
So for small dirty inodes, syncing in the order of inode dirty time is able to
avoid pageout(). If pageout() is still triggered frequently in this case, the
30s dirty expire time may be too long and could be shrinked adaptively; or it
may be a stressed memcg list whose dirty inodes/pages are more hard to track.

For a large dirty inode, it may flush lots of newly dirtied pages _after_
syncing the expired pages. This is the normal case for a single-stream
sequential dirtier, where older pages are in lower offsets.  In this case we
shall not insist on syncing the whole large dirty inode before considering the
other small dirty inodes. This risks wasting time syncing 1GB freshly dirtied
pages before syncing the other N*1MB expired dirty pages who are approaching
the end of the LRU list and hence pageout().

For a large dirty inode, it may also flush lots of newly dirtied pages _before_
hitting the desired old ones, in which case it helps for pageout() to do some
clustered writeback, and/or set mapping->writeback_index to help the flusher
focus on old pages.

For a large dirty inode, it may also have intermixed old and new dirty pages.
In this case we need to make sure the inode is queued for IO before some of
its pages hit pageout(). Adaptive dirty expire time helps here.

OK, end of the vapour ideas. As for this patchset, it fixes the current
kupdate/background writeback priority:

- the kupdate/background writeback shall include newly expired inodes at each
  queue_io() time, as the large inodes left over from previous writeback rounds
  are likely to have less density of old pages.

- the background writeback shall consider expired inodes first, just like the
  kupdate writeback

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 0/6] [RFC] writeback: try to write older pages first
@ 2010-07-22  5:09 ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Chris Mason,
	Jens Axboe, Wu Fengguang, LKML, linux-fsdevel, linux-mm

Andrew,

The basic way of avoiding pageout() is to make the flusher sync inodes in the
right order. Oldest dirty inodes contains oldest pages. The smaller inode it
is, the more correlation between inode dirty time and its pages' dirty time.
So for small dirty inodes, syncing in the order of inode dirty time is able to
avoid pageout(). If pageout() is still triggered frequently in this case, the
30s dirty expire time may be too long and could be shrinked adaptively; or it
may be a stressed memcg list whose dirty inodes/pages are more hard to track.

For a large dirty inode, it may flush lots of newly dirtied pages _after_
syncing the expired pages. This is the normal case for a single-stream
sequential dirtier, where older pages are in lower offsets.  In this case we
shall not insist on syncing the whole large dirty inode before considering the
other small dirty inodes. This risks wasting time syncing 1GB freshly dirtied
pages before syncing the other N*1MB expired dirty pages who are approaching
the end of the LRU list and hence pageout().

For a large dirty inode, it may also flush lots of newly dirtied pages _before_
hitting the desired old ones, in which case it helps for pageout() to do some
clustered writeback, and/or set mapping->writeback_index to help the flusher
focus on old pages.

For a large dirty inode, it may also have intermixed old and new dirty pages.
In this case we need to make sure the inode is queued for IO before some of
its pages hit pageout(). Adaptive dirty expire time helps here.

OK, end of the vapour ideas. As for this patchset, it fixes the current
kupdate/background writeback priority:

- the kupdate/background writeback shall include newly expired inodes at each
  queue_io() time, as the large inodes left over from previous writeback rounds
  are likely to have less density of old pages.

- the background writeback shall consider expired inodes first, just like the
  kupdate writeback

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 0/6] [RFC] writeback: try to write older pages first
@ 2010-07-22  5:09 ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Chris Mason,
	Jens Axboe, Wu Fengguang, LKML, linux-fsdevel, linux-mm

Andrew,

The basic way of avoiding pageout() is to make the flusher sync inodes in the
right order. Oldest dirty inodes contains oldest pages. The smaller inode it
is, the more correlation between inode dirty time and its pages' dirty time.
So for small dirty inodes, syncing in the order of inode dirty time is able to
avoid pageout(). If pageout() is still triggered frequently in this case, the
30s dirty expire time may be too long and could be shrinked adaptively; or it
may be a stressed memcg list whose dirty inodes/pages are more hard to track.

For a large dirty inode, it may flush lots of newly dirtied pages _after_
syncing the expired pages. This is the normal case for a single-stream
sequential dirtier, where older pages are in lower offsets.  In this case we
shall not insist on syncing the whole large dirty inode before considering the
other small dirty inodes. This risks wasting time syncing 1GB freshly dirtied
pages before syncing the other N*1MB expired dirty pages who are approaching
the end of the LRU list and hence pageout().

For a large dirty inode, it may also flush lots of newly dirtied pages _before_
hitting the desired old ones, in which case it helps for pageout() to do some
clustered writeback, and/or set mapping->writeback_index to help the flusher
focus on old pages.

For a large dirty inode, it may also have intermixed old and new dirty pages.
In this case we need to make sure the inode is queued for IO before some of
its pages hit pageout(). Adaptive dirty expire time helps here.

OK, end of the vapour ideas. As for this patchset, it fixes the current
kupdate/background writeback priority:

- the kupdate/background writeback shall include newly expired inodes at each
  queue_io() time, as the large inodes left over from previous writeback rounds
  are likely to have less density of old pages.

- the background writeback shall consider expired inodes first, just like the
  kupdate writeback

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2010-07-22  5:09 ` Wu Fengguang
  (?)
@ 2010-07-22  5:09   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-pass-wbc-to-queue_io.patch --]
[-- Type: text/plain, Size: 2458 bytes --]

This is to prepare for moving the dirty expire policy to move_expired_inodes().
No behavior change.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-21 20:12:38.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-21 20:14:38.000000000 +0800
@@ -213,8 +213,8 @@ static bool inode_dirtied_after(struct i
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
 static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+				struct list_head *dispatch_queue,
+				struct writeback_control *wbc)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -224,8 +224,8 @@ static void move_expired_inodes(struct l
 
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (wbc->older_than_this &&
+		    inode_dirtied_after(inode, *wbc->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -257,10 +257,10 @@ static void move_expired_inodes(struct l
  *                 => b_more_io inodes
  *                 => remaining inodes in b_io => (dequeue for sync)
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
 {
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -519,7 +519,7 @@ void writeback_inodes_wb(struct bdi_writ
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = list_entry(wb->b_io.prev,
@@ -548,7 +548,7 @@ static void __writeback_inodes_sb(struct
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);
 }



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2010-07-22  5:09   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-pass-wbc-to-queue_io.patch --]
[-- Type: text/plain, Size: 2683 bytes --]

This is to prepare for moving the dirty expire policy to move_expired_inodes().
No behavior change.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-21 20:12:38.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-21 20:14:38.000000000 +0800
@@ -213,8 +213,8 @@ static bool inode_dirtied_after(struct i
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
 static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+				struct list_head *dispatch_queue,
+				struct writeback_control *wbc)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -224,8 +224,8 @@ static void move_expired_inodes(struct l
 
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (wbc->older_than_this &&
+		    inode_dirtied_after(inode, *wbc->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -257,10 +257,10 @@ static void move_expired_inodes(struct l
  *                 => b_more_io inodes
  *                 => remaining inodes in b_io => (dequeue for sync)
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
 {
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -519,7 +519,7 @@ void writeback_inodes_wb(struct bdi_writ
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = list_entry(wb->b_io.prev,
@@ -548,7 +548,7 @@ static void __writeback_inodes_sb(struct
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2010-07-22  5:09   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-pass-wbc-to-queue_io.patch --]
[-- Type: text/plain, Size: 2683 bytes --]

This is to prepare for moving the dirty expire policy to move_expired_inodes().
No behavior change.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-21 20:12:38.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-21 20:14:38.000000000 +0800
@@ -213,8 +213,8 @@ static bool inode_dirtied_after(struct i
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
 static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+				struct list_head *dispatch_queue,
+				struct writeback_control *wbc)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -224,8 +224,8 @@ static void move_expired_inodes(struct l
 
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (wbc->older_than_this &&
+		    inode_dirtied_after(inode, *wbc->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -257,10 +257,10 @@ static void move_expired_inodes(struct l
  *                 => b_more_io inodes
  *                 => remaining inodes in b_io => (dequeue for sync)
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
 {
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -519,7 +519,7 @@ void writeback_inodes_wb(struct bdi_writ
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = list_entry(wb->b_io.prev,
@@ -548,7 +548,7 @@ static void __writeback_inodes_sb(struct
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
  2010-07-22  5:09 ` Wu Fengguang
  (?)
@ 2010-07-22  5:09   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Jan Kara, Wu Fengguang, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

[-- Attachment #1: writeback-remove-older_than_this.patch --]
[-- Type: text/plain, Size: 6245 bytes --]

Dynamicly compute the dirty expire timestamp at queue_io() time.
Also remove writeback_control.older_than_this which is no longer used.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the very bad pageout(). Neverthless this patch merely addresses part
  of the problem.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   24 +++++++++---------------
 include/linux/writeback.h        |    2 --
 include/trace/events/writeback.h |    6 +-----
 mm/backing-dev.c                 |    1 -
 mm/page-writeback.c              |    1 -
 5 files changed, 10 insertions(+), 24 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-21 22:20:01.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
@@ -216,16 +216,23 @@ static void move_expired_inodes(struct l
 				struct list_head *dispatch_queue,
 				struct writeback_control *wbc)
 {
+	unsigned long expire_interval = 0;
+	unsigned long older_than_this;
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
+	if (wbc->for_kupdate) {
+		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
+		older_than_this = jiffies - expire_interval;
+	}
+
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
-		if (wbc->older_than_this &&
-		    inode_dirtied_after(inode, *wbc->older_than_this))
+		if (expire_interval &&
+		    inode_dirtied_after(inode, older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -583,29 +590,19 @@ static inline bool over_bground_thresh(v
  * Try to run once per dirty_writeback_interval.  But if a writeback event
  * takes longer than a dirty_writeback_interval interval, then leave a
  * one-second gap.
- *
- * older_than_this takes precedence over nr_to_write.  So we'll only write back
- * all dirty pages if they are all attached to "old" mappings.
  */
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
 	struct writeback_control wbc = {
 		.sync_mode		= work->sync_mode,
-		.older_than_this	= NULL,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
 		.range_cyclic		= work->range_cyclic,
 	};
-	unsigned long oldest_jif;
 	long wrote = 0;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -998,9 +995,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  * Write out a superblock's list of dirty inodes.  A wait will be performed
  * upon no inodes, all inodes or the final one, depending upon sync_mode.
  *
- * If older_than_this is non-NULL, then only write out inodes which
- * had their first dirtying at a time earlier than *older_than_this.
- *
  * If `bdi' is non-zero then we're being asked to writeback a specific queue.
  * This function assumes that the blockdev superblock's inodes are backed by
  * a variety of queues, so all inodes are searched.  For other superblocks,
--- linux-next.orig/include/linux/writeback.h	2010-07-21 22:20:02.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
@@ -28,8 +28,6 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	unsigned long wb_start;         /* Time writeback_inodes_wb was
 					   called. This is needed to avoid
 					   extra jobs and livelock */
--- linux-next.orig/include/trace/events/writeback.h	2010-07-21 22:20:02.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
@@ -100,7 +100,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
 		__field(int, more_io)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -115,14 +114,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
 		__entry->more_io	= wbc->more_io;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d more=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -133,7 +130,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
 		__entry->more_io,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
--- linux-next.orig/mm/page-writeback.c	2010-07-21 22:20:02.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-07-21 22:20:03.000000000 +0800
@@ -482,7 +482,6 @@ static void balance_dirty_pages(struct a
 	for (;;) {
 		struct writeback_control wbc = {
 			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
 			.nr_to_write	= write_chunk,
 			.range_cyclic	= 1,
 		};
--- linux-next.orig/mm/backing-dev.c	2010-07-22 11:23:34.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-07-22 11:23:39.000000000 +0800
@@ -271,7 +271,6 @@ static void bdi_flush_io(struct backing_
 {
 	struct writeback_control wbc = {
 		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 		.nr_to_write		= 1024,
 	};



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
@ 2010-07-22  5:09   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Jan Kara, Wu Fengguang, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

[-- Attachment #1: writeback-remove-older_than_this.patch --]
[-- Type: text/plain, Size: 6470 bytes --]

Dynamicly compute the dirty expire timestamp at queue_io() time.
Also remove writeback_control.older_than_this which is no longer used.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the very bad pageout(). Neverthless this patch merely addresses part
  of the problem.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   24 +++++++++---------------
 include/linux/writeback.h        |    2 --
 include/trace/events/writeback.h |    6 +-----
 mm/backing-dev.c                 |    1 -
 mm/page-writeback.c              |    1 -
 5 files changed, 10 insertions(+), 24 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-21 22:20:01.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
@@ -216,16 +216,23 @@ static void move_expired_inodes(struct l
 				struct list_head *dispatch_queue,
 				struct writeback_control *wbc)
 {
+	unsigned long expire_interval = 0;
+	unsigned long older_than_this;
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
+	if (wbc->for_kupdate) {
+		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
+		older_than_this = jiffies - expire_interval;
+	}
+
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
-		if (wbc->older_than_this &&
-		    inode_dirtied_after(inode, *wbc->older_than_this))
+		if (expire_interval &&
+		    inode_dirtied_after(inode, older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -583,29 +590,19 @@ static inline bool over_bground_thresh(v
  * Try to run once per dirty_writeback_interval.  But if a writeback event
  * takes longer than a dirty_writeback_interval interval, then leave a
  * one-second gap.
- *
- * older_than_this takes precedence over nr_to_write.  So we'll only write back
- * all dirty pages if they are all attached to "old" mappings.
  */
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
 	struct writeback_control wbc = {
 		.sync_mode		= work->sync_mode,
-		.older_than_this	= NULL,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
 		.range_cyclic		= work->range_cyclic,
 	};
-	unsigned long oldest_jif;
 	long wrote = 0;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -998,9 +995,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  * Write out a superblock's list of dirty inodes.  A wait will be performed
  * upon no inodes, all inodes or the final one, depending upon sync_mode.
  *
- * If older_than_this is non-NULL, then only write out inodes which
- * had their first dirtying at a time earlier than *older_than_this.
- *
  * If `bdi' is non-zero then we're being asked to writeback a specific queue.
  * This function assumes that the blockdev superblock's inodes are backed by
  * a variety of queues, so all inodes are searched.  For other superblocks,
--- linux-next.orig/include/linux/writeback.h	2010-07-21 22:20:02.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
@@ -28,8 +28,6 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	unsigned long wb_start;         /* Time writeback_inodes_wb was
 					   called. This is needed to avoid
 					   extra jobs and livelock */
--- linux-next.orig/include/trace/events/writeback.h	2010-07-21 22:20:02.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
@@ -100,7 +100,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
 		__field(int, more_io)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -115,14 +114,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
 		__entry->more_io	= wbc->more_io;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d more=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -133,7 +130,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
 		__entry->more_io,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
--- linux-next.orig/mm/page-writeback.c	2010-07-21 22:20:02.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-07-21 22:20:03.000000000 +0800
@@ -482,7 +482,6 @@ static void balance_dirty_pages(struct a
 	for (;;) {
 		struct writeback_control wbc = {
 			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
 			.nr_to_write	= write_chunk,
 			.range_cyclic	= 1,
 		};
--- linux-next.orig/mm/backing-dev.c	2010-07-22 11:23:34.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-07-22 11:23:39.000000000 +0800
@@ -271,7 +271,6 @@ static void bdi_flush_io(struct backing_
 {
 	struct writeback_control wbc = {
 		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 		.nr_to_write		= 1024,
 	};


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
@ 2010-07-22  5:09   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Jan Kara, Wu Fengguang, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

[-- Attachment #1: writeback-remove-older_than_this.patch --]
[-- Type: text/plain, Size: 6470 bytes --]

Dynamicly compute the dirty expire timestamp at queue_io() time.
Also remove writeback_control.older_than_this which is no longer used.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the very bad pageout(). Neverthless this patch merely addresses part
  of the problem.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   24 +++++++++---------------
 include/linux/writeback.h        |    2 --
 include/trace/events/writeback.h |    6 +-----
 mm/backing-dev.c                 |    1 -
 mm/page-writeback.c              |    1 -
 5 files changed, 10 insertions(+), 24 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-21 22:20:01.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
@@ -216,16 +216,23 @@ static void move_expired_inodes(struct l
 				struct list_head *dispatch_queue,
 				struct writeback_control *wbc)
 {
+	unsigned long expire_interval = 0;
+	unsigned long older_than_this;
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
+	if (wbc->for_kupdate) {
+		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
+		older_than_this = jiffies - expire_interval;
+	}
+
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
-		if (wbc->older_than_this &&
-		    inode_dirtied_after(inode, *wbc->older_than_this))
+		if (expire_interval &&
+		    inode_dirtied_after(inode, older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -583,29 +590,19 @@ static inline bool over_bground_thresh(v
  * Try to run once per dirty_writeback_interval.  But if a writeback event
  * takes longer than a dirty_writeback_interval interval, then leave a
  * one-second gap.
- *
- * older_than_this takes precedence over nr_to_write.  So we'll only write back
- * all dirty pages if they are all attached to "old" mappings.
  */
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
 	struct writeback_control wbc = {
 		.sync_mode		= work->sync_mode,
-		.older_than_this	= NULL,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
 		.range_cyclic		= work->range_cyclic,
 	};
-	unsigned long oldest_jif;
 	long wrote = 0;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -998,9 +995,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  * Write out a superblock's list of dirty inodes.  A wait will be performed
  * upon no inodes, all inodes or the final one, depending upon sync_mode.
  *
- * If older_than_this is non-NULL, then only write out inodes which
- * had their first dirtying at a time earlier than *older_than_this.
- *
  * If `bdi' is non-zero then we're being asked to writeback a specific queue.
  * This function assumes that the blockdev superblock's inodes are backed by
  * a variety of queues, so all inodes are searched.  For other superblocks,
--- linux-next.orig/include/linux/writeback.h	2010-07-21 22:20:02.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
@@ -28,8 +28,6 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	unsigned long wb_start;         /* Time writeback_inodes_wb was
 					   called. This is needed to avoid
 					   extra jobs and livelock */
--- linux-next.orig/include/trace/events/writeback.h	2010-07-21 22:20:02.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
@@ -100,7 +100,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
 		__field(int, more_io)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -115,14 +114,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
 		__entry->more_io	= wbc->more_io;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d more=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -133,7 +130,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
 		__entry->more_io,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
--- linux-next.orig/mm/page-writeback.c	2010-07-21 22:20:02.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-07-21 22:20:03.000000000 +0800
@@ -482,7 +482,6 @@ static void balance_dirty_pages(struct a
 	for (;;) {
 		struct writeback_control wbc = {
 			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
 			.nr_to_write	= write_chunk,
 			.range_cyclic	= 1,
 		};
--- linux-next.orig/mm/backing-dev.c	2010-07-22 11:23:34.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-07-22 11:23:39.000000000 +0800
@@ -271,7 +271,6 @@ static void bdi_flush_io(struct backing_
 {
 	struct writeback_control wbc = {
 		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 		.nr_to_write		= 1024,
 	};


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-07-22  5:09 ` Wu Fengguang
  (?)
@ 2010-07-22  5:09   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-kill-more_io.patch --]
[-- Type: text/plain, Size: 2988 bytes --]

When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |    9 ++-------
 include/linux/writeback.h        |    1 -
 include/trace/events/writeback.h |    5 +----
 3 files changed, 3 insertions(+), 12 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
@@ -507,12 +507,8 @@ static int writeback_sb_inodes(struct su
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
-		if (wbc->nr_to_write <= 0) {
-			wbc->more_io = 1;
+		if (wbc->nr_to_write <= 0)
 			return 1;
-		}
-		if (!list_empty(&wb->b_more_io))
-			wbc->more_io = 1;
 	}
 	/* b_io is empty */
 	return 1;
@@ -622,7 +618,6 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
-		wbc.more_io = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
 
@@ -644,7 +639,7 @@ static long wb_writeback(struct bdi_writ
 		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
-		if (!wbc.more_io)
+		if (list_empty(&wb->b_more_io))
 			break;
 		/*
 		 * Did we write something? Try for more
--- linux-next.orig/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-22 11:24:46.000000000 +0800
@@ -49,7 +49,6 @@ struct writeback_control {
 	unsigned for_background:1;	/* A background writeback */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
-	unsigned more_io:1;		/* more io to be dispatched */
 };
 
 /*
--- linux-next.orig/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-07-22 11:24:46.000000000 +0800
@@ -99,7 +99,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(int, more_io)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -113,13 +112,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->more_io	= wbc->more_io;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d "
+		"bgrd=%d reclm=%d cyclic=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -129,7 +127,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->more_io,
 		__entry->range_start,
 		__entry->range_end)
 )



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 3/6] writeback: kill writeback_control.more_io
@ 2010-07-22  5:09   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-kill-more_io.patch --]
[-- Type: text/plain, Size: 3213 bytes --]

When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |    9 ++-------
 include/linux/writeback.h        |    1 -
 include/trace/events/writeback.h |    5 +----
 3 files changed, 3 insertions(+), 12 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
@@ -507,12 +507,8 @@ static int writeback_sb_inodes(struct su
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
-		if (wbc->nr_to_write <= 0) {
-			wbc->more_io = 1;
+		if (wbc->nr_to_write <= 0)
 			return 1;
-		}
-		if (!list_empty(&wb->b_more_io))
-			wbc->more_io = 1;
 	}
 	/* b_io is empty */
 	return 1;
@@ -622,7 +618,6 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
-		wbc.more_io = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
 
@@ -644,7 +639,7 @@ static long wb_writeback(struct bdi_writ
 		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
-		if (!wbc.more_io)
+		if (list_empty(&wb->b_more_io))
 			break;
 		/*
 		 * Did we write something? Try for more
--- linux-next.orig/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-22 11:24:46.000000000 +0800
@@ -49,7 +49,6 @@ struct writeback_control {
 	unsigned for_background:1;	/* A background writeback */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
-	unsigned more_io:1;		/* more io to be dispatched */
 };
 
 /*
--- linux-next.orig/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-07-22 11:24:46.000000000 +0800
@@ -99,7 +99,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(int, more_io)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -113,13 +112,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->more_io	= wbc->more_io;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d "
+		"bgrd=%d reclm=%d cyclic=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -129,7 +127,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->more_io,
 		__entry->range_start,
 		__entry->range_end)
 )


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 3/6] writeback: kill writeback_control.more_io
@ 2010-07-22  5:09   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-kill-more_io.patch --]
[-- Type: text/plain, Size: 3213 bytes --]

When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |    9 ++-------
 include/linux/writeback.h        |    1 -
 include/trace/events/writeback.h |    5 +----
 3 files changed, 3 insertions(+), 12 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
@@ -507,12 +507,8 @@ static int writeback_sb_inodes(struct su
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
-		if (wbc->nr_to_write <= 0) {
-			wbc->more_io = 1;
+		if (wbc->nr_to_write <= 0)
 			return 1;
-		}
-		if (!list_empty(&wb->b_more_io))
-			wbc->more_io = 1;
 	}
 	/* b_io is empty */
 	return 1;
@@ -622,7 +618,6 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
-		wbc.more_io = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
 
@@ -644,7 +639,7 @@ static long wb_writeback(struct bdi_writ
 		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
-		if (!wbc.more_io)
+		if (list_empty(&wb->b_more_io))
 			break;
 		/*
 		 * Did we write something? Try for more
--- linux-next.orig/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-22 11:24:46.000000000 +0800
@@ -49,7 +49,6 @@ struct writeback_control {
 	unsigned for_background:1;	/* A background writeback */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
-	unsigned more_io:1;		/* more io to be dispatched */
 };
 
 /*
--- linux-next.orig/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-07-22 11:24:46.000000000 +0800
@@ -99,7 +99,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(int, more_io)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -113,13 +112,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->more_io	= wbc->more_io;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d "
+		"bgrd=%d reclm=%d cyclic=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -129,7 +127,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->more_io,
 		__entry->range_start,
 		__entry->range_end)
 )


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-22  5:09 ` Wu Fengguang
@ 2010-07-22  5:09   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Jan Kara, Wu Fengguang, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

[-- Attachment #1: writeback-expired-for-background.patch --]
[-- Type: text/plain, Size: 2449 bytes --]

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- retry with halfed expire interval until get some inodes to sync

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
@@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long older_than_this = 0; /* reset to kill gcc warning */
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
-			break;
+		    inode_dirtied_after(inode, older_than_this)) {
+			if (wbc->for_background &&
+			    list_empty(dispatch_queue) && list_empty(&tmp)) {
+				expire_interval >>= 1;
+				older_than_this = jiffies - expire_interval;
+				continue;
+			} else
+				break;
+		}
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
@@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-22  5:09   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Jan Kara, Wu Fengguang, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

[-- Attachment #1: writeback-expired-for-background.patch --]
[-- Type: text/plain, Size: 2449 bytes --]

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- retry with halfed expire interval until get some inodes to sync

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
@@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long older_than_this = 0; /* reset to kill gcc warning */
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
-			break;
+		    inode_dirtied_after(inode, older_than_this)) {
+			if (wbc->for_background &&
+			    list_empty(dispatch_queue) && list_empty(&tmp)) {
+				expire_interval >>= 1;
+				older_than_this = jiffies - expire_interval;
+				continue;
+			} else
+				break;
+		}
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
@@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 5/6] writeback: try more writeback as long as something was written
  2010-07-22  5:09 ` Wu Fengguang
  (?)
@ 2010-07-22  5:09   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-background-retry.patch --]
[-- Type: text/plain, Size: 2268 bytes --]

writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
they only populate b_io when necessary at entrance time. When the queued
set of inodes are all synced, they just return, possibly with
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

This will livelock sync when there are heavy dirtiers. However in that case
sync will already be livelocked w/o this patch, as the current livelock
avoidance code is virtually a no-op (for one thing, wb_time should be
set statically at sync start time and be used in move_expired_inodes()).
The sync livelock problem will be addressed in other patches.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:54.000000000 +0800
@@ -640,20 +640,23 @@ static long wb_writeback(struct bdi_writ
 		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 
 		/*
-		 * If we consumed everything, see if we have more
+		 * Did we write something? Try for more
+		 *
+		 * This is needed _before_ the b_more_io test because the
+		 * background writeback moves inodes to b_io and works on
+		 * them in batches (in order to sync old pages first).  The
+		 * completion of the current batch does not necessarily mean
+		 * the overall work is done.
 		 */
-		if (wbc.nr_to_write <= 0)
+		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
 			continue;
+
 		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * Nothing written and no more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
-		/*
-		 * Did we write something? Try for more
-		 */
-		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
-			continue;
+
 		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2010-07-22  5:09   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-background-retry.patch --]
[-- Type: text/plain, Size: 2493 bytes --]

writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
they only populate b_io when necessary at entrance time. When the queued
set of inodes are all synced, they just return, possibly with
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

This will livelock sync when there are heavy dirtiers. However in that case
sync will already be livelocked w/o this patch, as the current livelock
avoidance code is virtually a no-op (for one thing, wb_time should be
set statically at sync start time and be used in move_expired_inodes()).
The sync livelock problem will be addressed in other patches.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:54.000000000 +0800
@@ -640,20 +640,23 @@ static long wb_writeback(struct bdi_writ
 		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 
 		/*
-		 * If we consumed everything, see if we have more
+		 * Did we write something? Try for more
+		 *
+		 * This is needed _before_ the b_more_io test because the
+		 * background writeback moves inodes to b_io and works on
+		 * them in batches (in order to sync old pages first).  The
+		 * completion of the current batch does not necessarily mean
+		 * the overall work is done.
 		 */
-		if (wbc.nr_to_write <= 0)
+		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
 			continue;
+
 		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * Nothing written and no more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
-		/*
-		 * Did we write something? Try for more
-		 */
-		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
-			continue;
+
 		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2010-07-22  5:09   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-background-retry.patch --]
[-- Type: text/plain, Size: 2493 bytes --]

writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
they only populate b_io when necessary at entrance time. When the queued
set of inodes are all synced, they just return, possibly with
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

This will livelock sync when there are heavy dirtiers. However in that case
sync will already be livelocked w/o this patch, as the current livelock
avoidance code is virtually a no-op (for one thing, wb_time should be
set statically at sync start time and be used in move_expired_inodes()).
The sync livelock problem will be addressed in other patches.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:54.000000000 +0800
@@ -640,20 +640,23 @@ static long wb_writeback(struct bdi_writ
 		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 
 		/*
-		 * If we consumed everything, see if we have more
+		 * Did we write something? Try for more
+		 *
+		 * This is needed _before_ the b_more_io test because the
+		 * background writeback moves inodes to b_io and works on
+		 * them in batches (in order to sync old pages first).  The
+		 * completion of the current batch does not necessarily mean
+		 * the overall work is done.
 		 */
-		if (wbc.nr_to_write <= 0)
+		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
 			continue;
+
 		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * Nothing written and no more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
-		/*
-		 * Did we write something? Try for more
-		 */
-		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
-			continue;
+
 		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 6/6] writeback: introduce writeback_control.inodes_written
  2010-07-22  5:09 ` Wu Fengguang
  (?)
@ 2010-07-22  5:09   ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-inodes_written.patch --]
[-- Type: text/plain, Size: 1803 bytes --]

Introduce writeback_control.inodes_written to count successful
->write_inode() calls.  A non-zero value means there are some
progress on writeback, in which case more writeback will be tried.

This prevents aborting a background writeback work prematually when
the current set of inodes for IO happen to be metadata-only dirty.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    5 +++++
 include/linux/writeback.h |    1 +
 2 files changed, 6 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 13:07:54.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:58.000000000 +0800
@@ -379,6 +379,8 @@ writeback_single_inode(struct inode *ino
 		int err = write_inode(inode, wbc);
 		if (ret == 0)
 			ret = err;
+		if (!err)
+			wbc->inodes_written++;
 	}
 
 	spin_lock(&inode_lock);
@@ -628,6 +630,7 @@ static long wb_writeback(struct bdi_writ
 
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
+		wbc.inodes_written = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
@@ -650,6 +653,8 @@ static long wb_writeback(struct bdi_writ
 		 */
 		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
 			continue;
+		if (wbc.inodes_written)
+			continue;
 
 		/*
 		 * Nothing written and no more inodes for IO, bail
--- linux-next.orig/include/linux/writeback.h	2010-07-22 11:24:46.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-22 13:07:58.000000000 +0800
@@ -34,6 +34,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long inodes_written;		/* Number of inodes(metadata) synced */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 6/6] writeback: introduce writeback_control.inodes_written
@ 2010-07-22  5:09   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-inodes_written.patch --]
[-- Type: text/plain, Size: 2028 bytes --]

Introduce writeback_control.inodes_written to count successful
->write_inode() calls.  A non-zero value means there are some
progress on writeback, in which case more writeback will be tried.

This prevents aborting a background writeback work prematually when
the current set of inodes for IO happen to be metadata-only dirty.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    5 +++++
 include/linux/writeback.h |    1 +
 2 files changed, 6 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 13:07:54.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:58.000000000 +0800
@@ -379,6 +379,8 @@ writeback_single_inode(struct inode *ino
 		int err = write_inode(inode, wbc);
 		if (ret == 0)
 			ret = err;
+		if (!err)
+			wbc->inodes_written++;
 	}
 
 	spin_lock(&inode_lock);
@@ -628,6 +630,7 @@ static long wb_writeback(struct bdi_writ
 
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
+		wbc.inodes_written = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
@@ -650,6 +653,8 @@ static long wb_writeback(struct bdi_writ
 		 */
 		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
 			continue;
+		if (wbc.inodes_written)
+			continue;
 
 		/*
 		 * Nothing written and no more inodes for IO, bail
--- linux-next.orig/include/linux/writeback.h	2010-07-22 11:24:46.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-22 13:07:58.000000000 +0800
@@ -34,6 +34,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long inodes_written;		/* Number of inodes(metadata) synced */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 6/6] writeback: introduce writeback_control.inodes_written
@ 2010-07-22  5:09   ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-inodes_written.patch --]
[-- Type: text/plain, Size: 2028 bytes --]

Introduce writeback_control.inodes_written to count successful
->write_inode() calls.  A non-zero value means there are some
progress on writeback, in which case more writeback will be tried.

This prevents aborting a background writeback work prematually when
the current set of inodes for IO happen to be metadata-only dirty.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    5 +++++
 include/linux/writeback.h |    1 +
 2 files changed, 6 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 13:07:54.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:58.000000000 +0800
@@ -379,6 +379,8 @@ writeback_single_inode(struct inode *ino
 		int err = write_inode(inode, wbc);
 		if (ret == 0)
 			ret = err;
+		if (!err)
+			wbc->inodes_written++;
 	}
 
 	spin_lock(&inode_lock);
@@ -628,6 +630,7 @@ static long wb_writeback(struct bdi_writ
 
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
+		wbc.inodes_written = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
@@ -650,6 +653,8 @@ static long wb_writeback(struct bdi_writ
 		 */
 		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
 			continue;
+		if (wbc.inodes_written)
+			continue;
 
 		/*
 		 * Nothing written and no more inodes for IO, bail
--- linux-next.orig/include/linux/writeback.h	2010-07-22 11:24:46.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-22 13:07:58.000000000 +0800
@@ -34,6 +34,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long inodes_written;		/* Number of inodes(metadata) synced */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
  2010-07-22  5:09 ` Wu Fengguang
@ 2010-07-23 10:24   ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-23 10:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

I queued these up for testing yesterday before starting a review. For
anyone watching, the following patches are pre-requisites from
linux-next if one wants to test against 2.6.35-rc5. I did this because I
wanted to test as few changes as possible

a75db72d30a6402f4b1d841af3b4ce43682d0ac4 writeback: remove wb_list 
2225753c10aef6af9c764a295b71d11bc483c4d6 writeback: merge bdi_writeback_task and bdi_start_fn
aab24fcf6f5ccf0e8de3cc333559bddf9a46f11e writeback: Initial tracing support
f689fba23f3819e3e0bc237c104f2ec25decc219 writeback: Add tracing to balance_dirty_pages
ca43586868b49eb5a07d895708e4d257e2df814e simplify checks for I_CLEAR/I_FREEING

I applied your series on top of this and fired it up. The ordering of
patch application was still teh same

tracing
no direct writeback
Wu's patches and Christoph's pre-reqs from linux-next
Kick flusher threads when dirty pages applied

With them applied, btrfs failed to build but if it builds for you, it
just means I didn't bring a required patch from linux-next. I was
testing against XFS so I didn't dig too deep.

On Thu, Jul 22, 2010 at 01:09:28PM +0800, Wu Fengguang wrote:
> 
> The basic way of avoiding pageout() is to make the flusher sync inodes in the
> right order. Oldest dirty inodes contains oldest pages. The smaller inode it
> is, the more correlation between inode dirty time and its pages' dirty time.
> So for small dirty inodes, syncing in the order of inode dirty time is able to
> avoid pageout(). If pageout() is still triggered frequently in this case, the
> 30s dirty expire time may be too long and could be shrinked adaptively; or it
> may be a stressed memcg list whose dirty inodes/pages are more hard to track.
> 

Have you confirmed this theory with the trace points? It makes perfect
sense and is very rational but proof is a plus. I'm guessing you have
some decent writeback-related tests that might be of use. Mine have a
big mix of anon and file writeback so it's not as clear-cut.

Monitoring it isn't hard. Mount debugfs, enable the vmscan tracepoints
and read the tracing_pipe. To reduce interference, I always pipe it
through gzip and do post-processing afterwards offline with the script
included in Documentation/

Here is what I got from sysbench on x86-64 (other machines hours away)


SYSBENCH FTrace Reclaim Statistics
                    traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
Direct reclaims                                683        785        670        938 
Direct reclaim pages scanned                199776     161195     200400     166639 
Direct reclaim write file async I/O          64802          0          0          0 
Direct reclaim write anon async I/O           1009        419       1184      11390 
Direct reclaim write file sync I/O              18          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        685360     697255     691009     864602 
Kswapd wakeups                                1596       1517       1517       1545 
Kswapd pages scanned                      17527865   16817554   16816510   15032525 
Kswapd reclaim write file async I/O         888082     618123     649167     147903 
Kswapd reclaim write anon async I/O         229724     229123     233639     243561 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)             32.79      22.47      19.75       6.34 
Time kswapd awake (ms)                     2192.03    2165.17    2112.73    2055.90 

User/Sys Time Running Test (seconds)         663.3    656.37    664.14    654.63
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               6703.22   6468.78   6472.69   6479.62
Percentage Time kswapd Awake                 0.03%     0.00%     0.00%     0.00%

Flush oldest actually increased the number of pages written back by
kswapd but the anon writeback is also high as swap is involved. Kicking
flusher threads also helps a lot. It helps less than previous released
because I noticed I was kicking flusher threads for both anon and file
dirty pages which is cheating. It's now only waking the threads for
file. It's still a reduction of 84% overall so nothing to sneeze at.

What the patch did do was reduce time stalled in direct reclaim and time
kswapd spent awake so it still might be going the right direction. I
don't have a feeling for how much the writeback figures change between
runs because they take so long to run.

STRESS-HIGHALLOC FTrace Reclaim Statistics
                  stress-highalloc      stress-highalloc      stress-highalloc      stress-highalloc
                    traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
Direct reclaims                               1221       1284       1127       1252 
Direct reclaim pages scanned                146220     186156     142075     140617 
Direct reclaim write file async I/O           3433          0          0          0 
Direct reclaim write anon async I/O          25238      28758      23940      23247 
Direct reclaim write file sync I/O            3095          0          0          0 
Direct reclaim write anon sync I/O           10911     305579     281824     246251 
Wake kswapd requests                          1193       1196       1088       1209 
Kswapd wakeups                                 805        824        758        804 
Kswapd pages scanned                      30953364   52621368   42722498   30945547 
Kswapd reclaim write file async I/O         898087     241135     570467      54319 
Kswapd reclaim write anon async I/O        2278607    2201894    1885741    1949170 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)           8567.29    6628.83    6520.39    6947.23 
Time kswapd awake (ms)                     5847.60    3589.43    3900.74   15837.59 

User/Sys Time Running Test (seconds)       2824.76   2833.05   2833.26   2830.46
Percentage Time Spent Direct Reclaim         0.25%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)              10920.14   9021.17   8872.06   9301.86
Percentage Time kswapd Awake                 0.15%     0.00%     0.00%     0.00%

Same here, the number of pages written back by kswapd increased but
again anon writeback was a big factor. Kicking threads when dirty pages
are encountered still helps a lot with a 94% reduction of pages written
back overall..

Also, your patch really helped the time spent stalled by direct reclaim
and kswapd was awake a lot less less with tests completing far faster.

Overally, I still think your series if a big help (although I don't know if
the patches in linux-next are also making a difference) but it's not actually
reducing the pages encountered by direct reclaim. Maybe that is because
the tests were making more forward progress and so scanning faster. The
sysbench performance results are too varied to draw conclusions from but it
did slightly improve the success rate of high-order allocations.

The flush-forward patches would appear to be a requirement. Christoph
first described them as a band-aid but he didn't chuck rocks at me when
the patch was actually released. Right now, I'm leaning towards pushing
it and judge by the Swear Meter how good/bad others think it is. So far
it's, me pro, Rik pro, Christoph maybe.

> For a large dirty inode, it may flush lots of newly dirtied pages _after_
> syncing the expired pages. This is the normal case for a single-stream
> sequential dirtier, where older pages are in lower offsets.  In this case we
> shall not insist on syncing the whole large dirty inode before considering the
> other small dirty inodes. This risks wasting time syncing 1GB freshly dirtied
> pages before syncing the other N*1MB expired dirty pages who are approaching
> the end of the LRU list and hence pageout().
> 

Intuitively, this makes a lot of sense.

> For a large dirty inode, it may also flush lots of newly dirtied pages _before_
> hitting the desired old ones, in which case it helps for pageout() to do some
> clustered writeback, and/or set mapping->writeback_index to help the flusher
> focus on old pages.
> 

Will put this idea on the maybe pile.

> For a large dirty inode, it may also have intermixed old and new dirty pages.
> In this case we need to make sure the inode is queued for IO before some of
> its pages hit pageout(). Adaptive dirty expire time helps here.
> 
> OK, end of the vapour ideas. As for this patchset, it fixes the current
> kupdate/background writeback priority:
> 
> - the kupdate/background writeback shall include newly expired inodes at each
>   queue_io() time, as the large inodes left over from previous writeback rounds
>   are likely to have less density of old pages.
> 
> - the background writeback shall consider expired inodes first, just like the
>   kupdate writeback
> 

I haven't actually reviewed these. I got testing kicked off first
because it didn't require brains :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
@ 2010-07-23 10:24   ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-23 10:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

I queued these up for testing yesterday before starting a review. For
anyone watching, the following patches are pre-requisites from
linux-next if one wants to test against 2.6.35-rc5. I did this because I
wanted to test as few changes as possible

a75db72d30a6402f4b1d841af3b4ce43682d0ac4 writeback: remove wb_list 
2225753c10aef6af9c764a295b71d11bc483c4d6 writeback: merge bdi_writeback_task and bdi_start_fn
aab24fcf6f5ccf0e8de3cc333559bddf9a46f11e writeback: Initial tracing support
f689fba23f3819e3e0bc237c104f2ec25decc219 writeback: Add tracing to balance_dirty_pages
ca43586868b49eb5a07d895708e4d257e2df814e simplify checks for I_CLEAR/I_FREEING

I applied your series on top of this and fired it up. The ordering of
patch application was still teh same

tracing
no direct writeback
Wu's patches and Christoph's pre-reqs from linux-next
Kick flusher threads when dirty pages applied

With them applied, btrfs failed to build but if it builds for you, it
just means I didn't bring a required patch from linux-next. I was
testing against XFS so I didn't dig too deep.

On Thu, Jul 22, 2010 at 01:09:28PM +0800, Wu Fengguang wrote:
> 
> The basic way of avoiding pageout() is to make the flusher sync inodes in the
> right order. Oldest dirty inodes contains oldest pages. The smaller inode it
> is, the more correlation between inode dirty time and its pages' dirty time.
> So for small dirty inodes, syncing in the order of inode dirty time is able to
> avoid pageout(). If pageout() is still triggered frequently in this case, the
> 30s dirty expire time may be too long and could be shrinked adaptively; or it
> may be a stressed memcg list whose dirty inodes/pages are more hard to track.
> 

Have you confirmed this theory with the trace points? It makes perfect
sense and is very rational but proof is a plus. I'm guessing you have
some decent writeback-related tests that might be of use. Mine have a
big mix of anon and file writeback so it's not as clear-cut.

Monitoring it isn't hard. Mount debugfs, enable the vmscan tracepoints
and read the tracing_pipe. To reduce interference, I always pipe it
through gzip and do post-processing afterwards offline with the script
included in Documentation/

Here is what I got from sysbench on x86-64 (other machines hours away)


SYSBENCH FTrace Reclaim Statistics
                    traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
Direct reclaims                                683        785        670        938 
Direct reclaim pages scanned                199776     161195     200400     166639 
Direct reclaim write file async I/O          64802          0          0          0 
Direct reclaim write anon async I/O           1009        419       1184      11390 
Direct reclaim write file sync I/O              18          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        685360     697255     691009     864602 
Kswapd wakeups                                1596       1517       1517       1545 
Kswapd pages scanned                      17527865   16817554   16816510   15032525 
Kswapd reclaim write file async I/O         888082     618123     649167     147903 
Kswapd reclaim write anon async I/O         229724     229123     233639     243561 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)             32.79      22.47      19.75       6.34 
Time kswapd awake (ms)                     2192.03    2165.17    2112.73    2055.90 

User/Sys Time Running Test (seconds)         663.3    656.37    664.14    654.63
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               6703.22   6468.78   6472.69   6479.62
Percentage Time kswapd Awake                 0.03%     0.00%     0.00%     0.00%

Flush oldest actually increased the number of pages written back by
kswapd but the anon writeback is also high as swap is involved. Kicking
flusher threads also helps a lot. It helps less than previous released
because I noticed I was kicking flusher threads for both anon and file
dirty pages which is cheating. It's now only waking the threads for
file. It's still a reduction of 84% overall so nothing to sneeze at.

What the patch did do was reduce time stalled in direct reclaim and time
kswapd spent awake so it still might be going the right direction. I
don't have a feeling for how much the writeback figures change between
runs because they take so long to run.

STRESS-HIGHALLOC FTrace Reclaim Statistics
                  stress-highalloc      stress-highalloc      stress-highalloc      stress-highalloc
                    traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
Direct reclaims                               1221       1284       1127       1252 
Direct reclaim pages scanned                146220     186156     142075     140617 
Direct reclaim write file async I/O           3433          0          0          0 
Direct reclaim write anon async I/O          25238      28758      23940      23247 
Direct reclaim write file sync I/O            3095          0          0          0 
Direct reclaim write anon sync I/O           10911     305579     281824     246251 
Wake kswapd requests                          1193       1196       1088       1209 
Kswapd wakeups                                 805        824        758        804 
Kswapd pages scanned                      30953364   52621368   42722498   30945547 
Kswapd reclaim write file async I/O         898087     241135     570467      54319 
Kswapd reclaim write anon async I/O        2278607    2201894    1885741    1949170 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)           8567.29    6628.83    6520.39    6947.23 
Time kswapd awake (ms)                     5847.60    3589.43    3900.74   15837.59 

User/Sys Time Running Test (seconds)       2824.76   2833.05   2833.26   2830.46
Percentage Time Spent Direct Reclaim         0.25%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)              10920.14   9021.17   8872.06   9301.86
Percentage Time kswapd Awake                 0.15%     0.00%     0.00%     0.00%

Same here, the number of pages written back by kswapd increased but
again anon writeback was a big factor. Kicking threads when dirty pages
are encountered still helps a lot with a 94% reduction of pages written
back overall..

Also, your patch really helped the time spent stalled by direct reclaim
and kswapd was awake a lot less less with tests completing far faster.

Overally, I still think your series if a big help (although I don't know if
the patches in linux-next are also making a difference) but it's not actually
reducing the pages encountered by direct reclaim. Maybe that is because
the tests were making more forward progress and so scanning faster. The
sysbench performance results are too varied to draw conclusions from but it
did slightly improve the success rate of high-order allocations.

The flush-forward patches would appear to be a requirement. Christoph
first described them as a band-aid but he didn't chuck rocks at me when
the patch was actually released. Right now, I'm leaning towards pushing
it and judge by the Swear Meter how good/bad others think it is. So far
it's, me pro, Rik pro, Christoph maybe.

> For a large dirty inode, it may flush lots of newly dirtied pages _after_
> syncing the expired pages. This is the normal case for a single-stream
> sequential dirtier, where older pages are in lower offsets.  In this case we
> shall not insist on syncing the whole large dirty inode before considering the
> other small dirty inodes. This risks wasting time syncing 1GB freshly dirtied
> pages before syncing the other N*1MB expired dirty pages who are approaching
> the end of the LRU list and hence pageout().
> 

Intuitively, this makes a lot of sense.

> For a large dirty inode, it may also flush lots of newly dirtied pages _before_
> hitting the desired old ones, in which case it helps for pageout() to do some
> clustered writeback, and/or set mapping->writeback_index to help the flusher
> focus on old pages.
> 

Will put this idea on the maybe pile.

> For a large dirty inode, it may also have intermixed old and new dirty pages.
> In this case we need to make sure the inode is queued for IO before some of
> its pages hit pageout(). Adaptive dirty expire time helps here.
> 
> OK, end of the vapour ideas. As for this patchset, it fixes the current
> kupdate/background writeback priority:
> 
> - the kupdate/background writeback shall include newly expired inodes at each
>   queue_io() time, as the large inodes left over from previous writeback rounds
>   are likely to have less density of old pages.
> 
> - the background writeback shall consider expired inodes first, just like the
>   kupdate writeback
> 

I haven't actually reviewed these. I got testing kicked off first
because it didn't require brains :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-07-23 17:39     ` Jan Kara
  -1 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-23 17:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu 22-07-10 13:09:33, Wu Fengguang wrote:
> writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
> they only populate b_io when necessary at entrance time. When the queued
> set of inodes are all synced, they just return, possibly with
> wbc.nr_to_write > 0.
> 
> For kupdate and background writeback, there may be more eligible inodes
> sitting in b_dirty when the current set of b_io inodes are completed. So
> it is necessary to try another round of writeback as long as we made some
> progress in this round. When there are no more eligible inodes, no more
> inodes will be enqueued in queue_io(), hence nothing could/will be
> synced and we may safely bail.
> 
> This will livelock sync when there are heavy dirtiers. However in that case
> sync will already be livelocked w/o this patch, as the current livelock
> avoidance code is virtually a no-op (for one thing, wb_time should be
> set statically at sync start time and be used in move_expired_inodes()).
> The sync livelock problem will be addressed in other patches.
  Hmm, any reason why you don't solve this problem by just removing the
condition before queue_io()? It would also make the logic simpler - always
queue all inodes that are eligible for writeback...

								Honza


> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:54.000000000 +0800
> @@ -640,20 +640,23 @@ static long wb_writeback(struct bdi_writ
>  		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>  
>  		/*
> -		 * If we consumed everything, see if we have more
> +		 * Did we write something? Try for more
> +		 *
> +		 * This is needed _before_ the b_more_io test because the
> +		 * background writeback moves inodes to b_io and works on
> +		 * them in batches (in order to sync old pages first).  The
> +		 * completion of the current batch does not necessarily mean
> +		 * the overall work is done.
>  		 */
> -		if (wbc.nr_to_write <= 0)
> +		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
>  			continue;
> +
>  		/*
> -		 * Didn't write everything and we don't have more IO, bail
> +		 * Nothing written and no more inodes for IO, bail
>  		 */
>  		if (list_empty(&wb->b_more_io))
>  			break;
> -		/*
> -		 * Did we write something? Try for more
> -		 */
> -		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
> -			continue;
> +
>  		/*
>  		 * Nothing written. Wait for some inode to
>  		 * become available for writeback. Otherwise
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2010-07-23 17:39     ` Jan Kara
  0 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-23 17:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu 22-07-10 13:09:33, Wu Fengguang wrote:
> writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
> they only populate b_io when necessary at entrance time. When the queued
> set of inodes are all synced, they just return, possibly with
> wbc.nr_to_write > 0.
> 
> For kupdate and background writeback, there may be more eligible inodes
> sitting in b_dirty when the current set of b_io inodes are completed. So
> it is necessary to try another round of writeback as long as we made some
> progress in this round. When there are no more eligible inodes, no more
> inodes will be enqueued in queue_io(), hence nothing could/will be
> synced and we may safely bail.
> 
> This will livelock sync when there are heavy dirtiers. However in that case
> sync will already be livelocked w/o this patch, as the current livelock
> avoidance code is virtually a no-op (for one thing, wb_time should be
> set statically at sync start time and be used in move_expired_inodes()).
> The sync livelock problem will be addressed in other patches.
  Hmm, any reason why you don't solve this problem by just removing the
condition before queue_io()? It would also make the logic simpler - always
queue all inodes that are eligible for writeback...

								Honza


> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:54.000000000 +0800
> @@ -640,20 +640,23 @@ static long wb_writeback(struct bdi_writ
>  		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>  
>  		/*
> -		 * If we consumed everything, see if we have more
> +		 * Did we write something? Try for more
> +		 *
> +		 * This is needed _before_ the b_more_io test because the
> +		 * background writeback moves inodes to b_io and works on
> +		 * them in batches (in order to sync old pages first).  The
> +		 * completion of the current batch does not necessarily mean
> +		 * the overall work is done.
>  		 */
> -		if (wbc.nr_to_write <= 0)
> +		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
>  			continue;
> +
>  		/*
> -		 * Didn't write everything and we don't have more IO, bail
> +		 * Nothing written and no more inodes for IO, bail
>  		 */
>  		if (list_empty(&wb->b_more_io))
>  			break;
> -		/*
> -		 * Did we write something? Try for more
> -		 */
> -		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
> -			continue;
> +
>  		/*
>  		 * Nothing written. Wait for some inode to
>  		 * become available for writeback. Otherwise
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-07-23 18:15     ` Jan Kara
  -1 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-23 18:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Thu 22-07-10 13:09:32, Wu Fengguang wrote:
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> The policy is
> - enqueue all newly expired inodes at each queue_io() time
> - retry with halfed expire interval until get some inodes to sync
  Hmm, this logic looks a bit arbitrary to me. What I actually don't like
very much about this that when there aren't inodes older than say 2
seconds, you'll end up queueing just inodes between 2s and 1s. So I'd
rather just queue inodes older than the limit and if there are none, just
queue all other dirty inodes.

								Honza

> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
>  				struct writeback_control *wbc)
>  {
>  	unsigned long expire_interval = 0;
> -	unsigned long older_than_this;
> +	unsigned long older_than_this = 0; /* reset to kill gcc warning */
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
>  	struct inode *inode;
>  	int do_sb_sort = 0;
>  
> -	if (wbc->for_kupdate) {
> +	if (wbc->for_kupdate || wbc->for_background) {
>  		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
>  		older_than_this = jiffies - expire_interval;
>  	}
> @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
>  		if (expire_interval &&
> -		    inode_dirtied_after(inode, older_than_this))
> -			break;
> +		    inode_dirtied_after(inode, older_than_this)) {
> +			if (wbc->for_background &&
> +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> +				expire_interval >>= 1;
> +				older_than_this = jiffies - expire_interval;
> +				continue;
> +			} else
> +				break;
> +		}
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
>  		sb = inode->i_sb;
> @@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
>  
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +
> +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
> @@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
>  
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  	writeback_sb_inodes(sb, wb, wbc, true);
>  	spin_unlock(&inode_lock);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-23 18:15     ` Jan Kara
  0 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-23 18:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Thu 22-07-10 13:09:32, Wu Fengguang wrote:
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> The policy is
> - enqueue all newly expired inodes at each queue_io() time
> - retry with halfed expire interval until get some inodes to sync
  Hmm, this logic looks a bit arbitrary to me. What I actually don't like
very much about this that when there aren't inodes older than say 2
seconds, you'll end up queueing just inodes between 2s and 1s. So I'd
rather just queue inodes older than the limit and if there are none, just
queue all other dirty inodes.

								Honza

> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
>  				struct writeback_control *wbc)
>  {
>  	unsigned long expire_interval = 0;
> -	unsigned long older_than_this;
> +	unsigned long older_than_this = 0; /* reset to kill gcc warning */
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
>  	struct inode *inode;
>  	int do_sb_sort = 0;
>  
> -	if (wbc->for_kupdate) {
> +	if (wbc->for_kupdate || wbc->for_background) {
>  		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
>  		older_than_this = jiffies - expire_interval;
>  	}
> @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
>  		if (expire_interval &&
> -		    inode_dirtied_after(inode, older_than_this))
> -			break;
> +		    inode_dirtied_after(inode, older_than_this)) {
> +			if (wbc->for_background &&
> +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> +				expire_interval >>= 1;
> +				older_than_this = jiffies - expire_interval;
> +				continue;
> +			} else
> +				break;
> +		}
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
>  		sb = inode->i_sb;
> @@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
>  
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +
> +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
> @@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
>  
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  	writeback_sb_inodes(sb, wb, wbc, true);
>  	spin_unlock(&inode_lock);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-07-23 18:16     ` Jan Kara
  -1 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-23 18:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu 22-07-10 13:09:29, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
  Looks OK.

Acked-by: Jan Kara <jack@suse.cz>

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-21 20:12:38.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-21 20:14:38.000000000 +0800
> @@ -213,8 +213,8 @@ static bool inode_dirtied_after(struct i
>   * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
>   */
>  static void move_expired_inodes(struct list_head *delaying_queue,
> -			       struct list_head *dispatch_queue,
> -				unsigned long *older_than_this)
> +				struct list_head *dispatch_queue,
> +				struct writeback_control *wbc)
>  {
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
> @@ -224,8 +224,8 @@ static void move_expired_inodes(struct l
>  
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> -		if (older_than_this &&
> -		    inode_dirtied_after(inode, *older_than_this))
> +		if (wbc->older_than_this &&
> +		    inode_dirtied_after(inode, *wbc->older_than_this))
>  			break;
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
> @@ -257,10 +257,10 @@ static void move_expired_inodes(struct l
>   *                 => b_more_io inodes
>   *                 => remaining inodes in b_io => (dequeue for sync)
>   */
> -static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
> +static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
>  {
>  	list_splice_init(&wb->b_more_io, &wb->b_io);
> -	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
> +	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
>  }
>  
>  static int write_inode(struct inode *inode, struct writeback_control *wbc)
> @@ -519,7 +519,7 @@ void writeback_inodes_wb(struct bdi_writ
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
>  	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> -		queue_io(wb, wbc->older_than_this);
> +		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = list_entry(wb->b_io.prev,
> @@ -548,7 +548,7 @@ static void __writeback_inodes_sb(struct
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
>  	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> -		queue_io(wb, wbc->older_than_this);
> +		queue_io(wb, wbc);
>  	writeback_sb_inodes(sb, wb, wbc, true);
>  	spin_unlock(&inode_lock);
>  }
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2010-07-23 18:16     ` Jan Kara
  0 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-23 18:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu 22-07-10 13:09:29, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
  Looks OK.

Acked-by: Jan Kara <jack@suse.cz>

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-21 20:12:38.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-21 20:14:38.000000000 +0800
> @@ -213,8 +213,8 @@ static bool inode_dirtied_after(struct i
>   * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
>   */
>  static void move_expired_inodes(struct list_head *delaying_queue,
> -			       struct list_head *dispatch_queue,
> -				unsigned long *older_than_this)
> +				struct list_head *dispatch_queue,
> +				struct writeback_control *wbc)
>  {
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
> @@ -224,8 +224,8 @@ static void move_expired_inodes(struct l
>  
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> -		if (older_than_this &&
> -		    inode_dirtied_after(inode, *older_than_this))
> +		if (wbc->older_than_this &&
> +		    inode_dirtied_after(inode, *wbc->older_than_this))
>  			break;
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
> @@ -257,10 +257,10 @@ static void move_expired_inodes(struct l
>   *                 => b_more_io inodes
>   *                 => remaining inodes in b_io => (dequeue for sync)
>   */
> -static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
> +static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
>  {
>  	list_splice_init(&wb->b_more_io, &wb->b_io);
> -	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
> +	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
>  }
>  
>  static int write_inode(struct inode *inode, struct writeback_control *wbc)
> @@ -519,7 +519,7 @@ void writeback_inodes_wb(struct bdi_writ
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
>  	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> -		queue_io(wb, wbc->older_than_this);
> +		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = list_entry(wb->b_io.prev,
> @@ -548,7 +548,7 @@ static void __writeback_inodes_sb(struct
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
>  	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> -		queue_io(wb, wbc->older_than_this);
> +		queue_io(wb, wbc);
>  	writeback_sb_inodes(sb, wb, wbc, true);
>  	spin_unlock(&inode_lock);
>  }
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-07-23 18:17     ` Jan Kara
  -1 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-23 18:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Thu 22-07-10 13:09:30, Wu Fengguang wrote:
> Dynamicly compute the dirty expire timestamp at queue_io() time.
> Also remove writeback_control.older_than_this which is no longer used.
> 
> writeback_control.older_than_this used to be determined at entrance to
> the kupdate writeback work. This _static_ timestamp may go stale if the
> kupdate work runs on and on. The flusher may then stuck with some old
> busy inodes, never considering newly expired inodes thereafter.
  This seems to make sense. The patch looks fine as well.

Acked-by: Jan Kara <jack@suse.cz>

								Honza
> 
> This has two possible problems:
> 
> - It is unfair for a large dirty inode to delay (for a long time) the
>   writeback of small dirty inodes.
> 
> - As time goes by, the large and busy dirty inode may contain only
>   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
>   delaying the expired dirty pages to the end of LRU lists, triggering
>   the very bad pageout(). Neverthless this patch merely addresses part
>   of the problem.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |   24 +++++++++---------------
>  include/linux/writeback.h        |    2 --
>  include/trace/events/writeback.h |    6 +-----
>  mm/backing-dev.c                 |    1 -
>  mm/page-writeback.c              |    1 -
>  5 files changed, 10 insertions(+), 24 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-21 22:20:01.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
> @@ -216,16 +216,23 @@ static void move_expired_inodes(struct l
>  				struct list_head *dispatch_queue,
>  				struct writeback_control *wbc)
>  {
> +	unsigned long expire_interval = 0;
> +	unsigned long older_than_this;
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
>  	struct inode *inode;
>  	int do_sb_sort = 0;
>  
> +	if (wbc->for_kupdate) {
> +		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
> +		older_than_this = jiffies - expire_interval;
> +	}
> +
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> -		if (wbc->older_than_this &&
> -		    inode_dirtied_after(inode, *wbc->older_than_this))
> +		if (expire_interval &&
> +		    inode_dirtied_after(inode, older_than_this))
>  			break;
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
> @@ -583,29 +590,19 @@ static inline bool over_bground_thresh(v
>   * Try to run once per dirty_writeback_interval.  But if a writeback event
>   * takes longer than a dirty_writeback_interval interval, then leave a
>   * one-second gap.
> - *
> - * older_than_this takes precedence over nr_to_write.  So we'll only write back
> - * all dirty pages if they are all attached to "old" mappings.
>   */
>  static long wb_writeback(struct bdi_writeback *wb,
>  			 struct wb_writeback_work *work)
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= work->sync_mode,
> -		.older_than_this	= NULL,
>  		.for_kupdate		= work->for_kupdate,
>  		.for_background		= work->for_background,
>  		.range_cyclic		= work->range_cyclic,
>  	};
> -	unsigned long oldest_jif;
>  	long wrote = 0;
>  	struct inode *inode;
>  
> -	if (wbc.for_kupdate) {
> -		wbc.older_than_this = &oldest_jif;
> -		oldest_jif = jiffies -
> -				msecs_to_jiffies(dirty_expire_interval * 10);
> -	}
>  	if (!wbc.range_cyclic) {
>  		wbc.range_start = 0;
>  		wbc.range_end = LLONG_MAX;
> @@ -998,9 +995,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   * Write out a superblock's list of dirty inodes.  A wait will be performed
>   * upon no inodes, all inodes or the final one, depending upon sync_mode.
>   *
> - * If older_than_this is non-NULL, then only write out inodes which
> - * had their first dirtying at a time earlier than *older_than_this.
> - *
>   * If `bdi' is non-zero then we're being asked to writeback a specific queue.
>   * This function assumes that the blockdev superblock's inodes are backed by
>   * a variety of queues, so all inodes are searched.  For other superblocks,
> --- linux-next.orig/include/linux/writeback.h	2010-07-21 22:20:02.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
> @@ -28,8 +28,6 @@ enum writeback_sync_modes {
>   */
>  struct writeback_control {
>  	enum writeback_sync_modes sync_mode;
> -	unsigned long *older_than_this;	/* If !NULL, only write back inodes
> -					   older than this */
>  	unsigned long wb_start;         /* Time writeback_inodes_wb was
>  					   called. This is needed to avoid
>  					   extra jobs and livelock */
> --- linux-next.orig/include/trace/events/writeback.h	2010-07-21 22:20:02.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
> @@ -100,7 +100,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__field(int, for_reclaim)
>  		__field(int, range_cyclic)
>  		__field(int, more_io)
> -		__field(unsigned long, older_than_this)
>  		__field(long, range_start)
>  		__field(long, range_end)
>  	),
> @@ -115,14 +114,12 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_reclaim	= wbc->for_reclaim;
>  		__entry->range_cyclic	= wbc->range_cyclic;
>  		__entry->more_io	= wbc->more_io;
> -		__entry->older_than_this = wbc->older_than_this ?
> -						*wbc->older_than_this : 0;
>  		__entry->range_start	= (long)wbc->range_start;
>  		__entry->range_end	= (long)wbc->range_end;
>  	),
>  
>  	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> -		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
> +		"bgrd=%d reclm=%d cyclic=%d more=%d "
>  		"start=0x%lx end=0x%lx",
>  		__entry->name,
>  		__entry->nr_to_write,
> @@ -133,7 +130,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_reclaim,
>  		__entry->range_cyclic,
>  		__entry->more_io,
> -		__entry->older_than_this,
>  		__entry->range_start,
>  		__entry->range_end)
>  )
> --- linux-next.orig/mm/page-writeback.c	2010-07-21 22:20:02.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-07-21 22:20:03.000000000 +0800
> @@ -482,7 +482,6 @@ static void balance_dirty_pages(struct a
>  	for (;;) {
>  		struct writeback_control wbc = {
>  			.sync_mode	= WB_SYNC_NONE,
> -			.older_than_this = NULL,
>  			.nr_to_write	= write_chunk,
>  			.range_cyclic	= 1,
>  		};
> --- linux-next.orig/mm/backing-dev.c	2010-07-22 11:23:34.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2010-07-22 11:23:39.000000000 +0800
> @@ -271,7 +271,6 @@ static void bdi_flush_io(struct backing_
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= WB_SYNC_NONE,
> -		.older_than_this	= NULL,
>  		.range_cyclic		= 1,
>  		.nr_to_write		= 1024,
>  	};
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
@ 2010-07-23 18:17     ` Jan Kara
  0 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-23 18:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Thu 22-07-10 13:09:30, Wu Fengguang wrote:
> Dynamicly compute the dirty expire timestamp at queue_io() time.
> Also remove writeback_control.older_than_this which is no longer used.
> 
> writeback_control.older_than_this used to be determined at entrance to
> the kupdate writeback work. This _static_ timestamp may go stale if the
> kupdate work runs on and on. The flusher may then stuck with some old
> busy inodes, never considering newly expired inodes thereafter.
  This seems to make sense. The patch looks fine as well.

Acked-by: Jan Kara <jack@suse.cz>

								Honza
> 
> This has two possible problems:
> 
> - It is unfair for a large dirty inode to delay (for a long time) the
>   writeback of small dirty inodes.
> 
> - As time goes by, the large and busy dirty inode may contain only
>   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
>   delaying the expired dirty pages to the end of LRU lists, triggering
>   the very bad pageout(). Neverthless this patch merely addresses part
>   of the problem.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |   24 +++++++++---------------
>  include/linux/writeback.h        |    2 --
>  include/trace/events/writeback.h |    6 +-----
>  mm/backing-dev.c                 |    1 -
>  mm/page-writeback.c              |    1 -
>  5 files changed, 10 insertions(+), 24 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-21 22:20:01.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
> @@ -216,16 +216,23 @@ static void move_expired_inodes(struct l
>  				struct list_head *dispatch_queue,
>  				struct writeback_control *wbc)
>  {
> +	unsigned long expire_interval = 0;
> +	unsigned long older_than_this;
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
>  	struct inode *inode;
>  	int do_sb_sort = 0;
>  
> +	if (wbc->for_kupdate) {
> +		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
> +		older_than_this = jiffies - expire_interval;
> +	}
> +
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> -		if (wbc->older_than_this &&
> -		    inode_dirtied_after(inode, *wbc->older_than_this))
> +		if (expire_interval &&
> +		    inode_dirtied_after(inode, older_than_this))
>  			break;
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
> @@ -583,29 +590,19 @@ static inline bool over_bground_thresh(v
>   * Try to run once per dirty_writeback_interval.  But if a writeback event
>   * takes longer than a dirty_writeback_interval interval, then leave a
>   * one-second gap.
> - *
> - * older_than_this takes precedence over nr_to_write.  So we'll only write back
> - * all dirty pages if they are all attached to "old" mappings.
>   */
>  static long wb_writeback(struct bdi_writeback *wb,
>  			 struct wb_writeback_work *work)
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= work->sync_mode,
> -		.older_than_this	= NULL,
>  		.for_kupdate		= work->for_kupdate,
>  		.for_background		= work->for_background,
>  		.range_cyclic		= work->range_cyclic,
>  	};
> -	unsigned long oldest_jif;
>  	long wrote = 0;
>  	struct inode *inode;
>  
> -	if (wbc.for_kupdate) {
> -		wbc.older_than_this = &oldest_jif;
> -		oldest_jif = jiffies -
> -				msecs_to_jiffies(dirty_expire_interval * 10);
> -	}
>  	if (!wbc.range_cyclic) {
>  		wbc.range_start = 0;
>  		wbc.range_end = LLONG_MAX;
> @@ -998,9 +995,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   * Write out a superblock's list of dirty inodes.  A wait will be performed
>   * upon no inodes, all inodes or the final one, depending upon sync_mode.
>   *
> - * If older_than_this is non-NULL, then only write out inodes which
> - * had their first dirtying at a time earlier than *older_than_this.
> - *
>   * If `bdi' is non-zero then we're being asked to writeback a specific queue.
>   * This function assumes that the blockdev superblock's inodes are backed by
>   * a variety of queues, so all inodes are searched.  For other superblocks,
> --- linux-next.orig/include/linux/writeback.h	2010-07-21 22:20:02.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
> @@ -28,8 +28,6 @@ enum writeback_sync_modes {
>   */
>  struct writeback_control {
>  	enum writeback_sync_modes sync_mode;
> -	unsigned long *older_than_this;	/* If !NULL, only write back inodes
> -					   older than this */
>  	unsigned long wb_start;         /* Time writeback_inodes_wb was
>  					   called. This is needed to avoid
>  					   extra jobs and livelock */
> --- linux-next.orig/include/trace/events/writeback.h	2010-07-21 22:20:02.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
> @@ -100,7 +100,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__field(int, for_reclaim)
>  		__field(int, range_cyclic)
>  		__field(int, more_io)
> -		__field(unsigned long, older_than_this)
>  		__field(long, range_start)
>  		__field(long, range_end)
>  	),
> @@ -115,14 +114,12 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_reclaim	= wbc->for_reclaim;
>  		__entry->range_cyclic	= wbc->range_cyclic;
>  		__entry->more_io	= wbc->more_io;
> -		__entry->older_than_this = wbc->older_than_this ?
> -						*wbc->older_than_this : 0;
>  		__entry->range_start	= (long)wbc->range_start;
>  		__entry->range_end	= (long)wbc->range_end;
>  	),
>  
>  	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> -		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
> +		"bgrd=%d reclm=%d cyclic=%d more=%d "
>  		"start=0x%lx end=0x%lx",
>  		__entry->name,
>  		__entry->nr_to_write,
> @@ -133,7 +130,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_reclaim,
>  		__entry->range_cyclic,
>  		__entry->more_io,
> -		__entry->older_than_this,
>  		__entry->range_start,
>  		__entry->range_end)
>  )
> --- linux-next.orig/mm/page-writeback.c	2010-07-21 22:20:02.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-07-21 22:20:03.000000000 +0800
> @@ -482,7 +482,6 @@ static void balance_dirty_pages(struct a
>  	for (;;) {
>  		struct writeback_control wbc = {
>  			.sync_mode	= WB_SYNC_NONE,
> -			.older_than_this = NULL,
>  			.nr_to_write	= write_chunk,
>  			.range_cyclic	= 1,
>  		};
> --- linux-next.orig/mm/backing-dev.c	2010-07-22 11:23:34.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2010-07-22 11:23:39.000000000 +0800
> @@ -271,7 +271,6 @@ static void bdi_flush_io(struct backing_
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= WB_SYNC_NONE,
> -		.older_than_this	= NULL,
>  		.range_cyclic		= 1,
>  		.nr_to_write		= 1024,
>  	};
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-07-23 18:24     ` Jan Kara
  -1 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-23 18:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu 22-07-10 13:09:31, Wu Fengguang wrote:
> When wbc.more_io was first introduced, it indicates whether there are
> at least one superblock whose s_more_io contains more IO work. Now with
> the per-bdi writeback, it can be replaced with a simple b_more_io test.
  Looks fine.

Acked-by: Jan Kara <jack@suse.cz>

> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |    9 ++-------
>  include/linux/writeback.h        |    1 -
>  include/trace/events/writeback.h |    5 +----
>  3 files changed, 3 insertions(+), 12 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> @@ -507,12 +507,8 @@ static int writeback_sb_inodes(struct su
>  		iput(inode);
>  		cond_resched();
>  		spin_lock(&inode_lock);
> -		if (wbc->nr_to_write <= 0) {
> -			wbc->more_io = 1;
> +		if (wbc->nr_to_write <= 0)
>  			return 1;
> -		}
> -		if (!list_empty(&wb->b_more_io))
> -			wbc->more_io = 1;
>  	}
>  	/* b_io is empty */
>  	return 1;
> @@ -622,7 +618,6 @@ static long wb_writeback(struct bdi_writ
>  		if (work->for_background && !over_bground_thresh())
>  			break;
>  
> -		wbc.more_io = 0;
>  		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>  		wbc.pages_skipped = 0;
>  
> @@ -644,7 +639,7 @@ static long wb_writeback(struct bdi_writ
>  		/*
>  		 * Didn't write everything and we don't have more IO, bail
>  		 */
> -		if (!wbc.more_io)
> +		if (list_empty(&wb->b_more_io))
>  			break;
>  		/*
>  		 * Did we write something? Try for more
> --- linux-next.orig/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2010-07-22 11:24:46.000000000 +0800
> @@ -49,7 +49,6 @@ struct writeback_control {
>  	unsigned for_background:1;	/* A background writeback */
>  	unsigned for_reclaim:1;		/* Invoked from the page allocator */
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
> -	unsigned more_io:1;		/* more io to be dispatched */
>  };
>  
>  /*
> --- linux-next.orig/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2010-07-22 11:24:46.000000000 +0800
> @@ -99,7 +99,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__field(int, for_background)
>  		__field(int, for_reclaim)
>  		__field(int, range_cyclic)
> -		__field(int, more_io)
>  		__field(long, range_start)
>  		__field(long, range_end)
>  	),
> @@ -113,13 +112,12 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_background	= wbc->for_background;
>  		__entry->for_reclaim	= wbc->for_reclaim;
>  		__entry->range_cyclic	= wbc->range_cyclic;
> -		__entry->more_io	= wbc->more_io;
>  		__entry->range_start	= (long)wbc->range_start;
>  		__entry->range_end	= (long)wbc->range_end;
>  	),
>  
>  	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> -		"bgrd=%d reclm=%d cyclic=%d more=%d "
> +		"bgrd=%d reclm=%d cyclic=%d "
>  		"start=0x%lx end=0x%lx",
>  		__entry->name,
>  		__entry->nr_to_write,
> @@ -129,7 +127,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_background,
>  		__entry->for_reclaim,
>  		__entry->range_cyclic,
> -		__entry->more_io,
>  		__entry->range_start,
>  		__entry->range_end)
>  )
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
@ 2010-07-23 18:24     ` Jan Kara
  0 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-23 18:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu 22-07-10 13:09:31, Wu Fengguang wrote:
> When wbc.more_io was first introduced, it indicates whether there are
> at least one superblock whose s_more_io contains more IO work. Now with
> the per-bdi writeback, it can be replaced with a simple b_more_io test.
  Looks fine.

Acked-by: Jan Kara <jack@suse.cz>

> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |    9 ++-------
>  include/linux/writeback.h        |    1 -
>  include/trace/events/writeback.h |    5 +----
>  3 files changed, 3 insertions(+), 12 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> @@ -507,12 +507,8 @@ static int writeback_sb_inodes(struct su
>  		iput(inode);
>  		cond_resched();
>  		spin_lock(&inode_lock);
> -		if (wbc->nr_to_write <= 0) {
> -			wbc->more_io = 1;
> +		if (wbc->nr_to_write <= 0)
>  			return 1;
> -		}
> -		if (!list_empty(&wb->b_more_io))
> -			wbc->more_io = 1;
>  	}
>  	/* b_io is empty */
>  	return 1;
> @@ -622,7 +618,6 @@ static long wb_writeback(struct bdi_writ
>  		if (work->for_background && !over_bground_thresh())
>  			break;
>  
> -		wbc.more_io = 0;
>  		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>  		wbc.pages_skipped = 0;
>  
> @@ -644,7 +639,7 @@ static long wb_writeback(struct bdi_writ
>  		/*
>  		 * Didn't write everything and we don't have more IO, bail
>  		 */
> -		if (!wbc.more_io)
> +		if (list_empty(&wb->b_more_io))
>  			break;
>  		/*
>  		 * Did we write something? Try for more
> --- linux-next.orig/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2010-07-22 11:24:46.000000000 +0800
> @@ -49,7 +49,6 @@ struct writeback_control {
>  	unsigned for_background:1;	/* A background writeback */
>  	unsigned for_reclaim:1;		/* Invoked from the page allocator */
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
> -	unsigned more_io:1;		/* more io to be dispatched */
>  };
>  
>  /*
> --- linux-next.orig/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2010-07-22 11:24:46.000000000 +0800
> @@ -99,7 +99,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__field(int, for_background)
>  		__field(int, for_reclaim)
>  		__field(int, range_cyclic)
> -		__field(int, more_io)
>  		__field(long, range_start)
>  		__field(long, range_end)
>  	),
> @@ -113,13 +112,12 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_background	= wbc->for_background;
>  		__entry->for_reclaim	= wbc->for_reclaim;
>  		__entry->range_cyclic	= wbc->range_cyclic;
> -		__entry->more_io	= wbc->more_io;
>  		__entry->range_start	= (long)wbc->range_start;
>  		__entry->range_end	= (long)wbc->range_end;
>  	),
>  
>  	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> -		"bgrd=%d reclm=%d cyclic=%d more=%d "
> +		"bgrd=%d reclm=%d cyclic=%d "
>  		"start=0x%lx end=0x%lx",
>  		__entry->name,
>  		__entry->nr_to_write,
> @@ -129,7 +127,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_background,
>  		__entry->for_reclaim,
>  		__entry->range_cyclic,
> -		__entry->more_io,
>  		__entry->range_start,
>  		__entry->range_end)
>  )
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
  2010-07-23 10:24   ` Mel Gorman
@ 2010-07-26  7:18     ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26  7:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

> On Thu, Jul 22, 2010 at 01:09:28PM +0800, Wu Fengguang wrote:
> > 
> > The basic way of avoiding pageout() is to make the flusher sync inodes in the
> > right order. Oldest dirty inodes contains oldest pages. The smaller inode it
> > is, the more correlation between inode dirty time and its pages' dirty time.
> > So for small dirty inodes, syncing in the order of inode dirty time is able to
> > avoid pageout(). If pageout() is still triggered frequently in this case, the
> > 30s dirty expire time may be too long and could be shrinked adaptively; or it
> > may be a stressed memcg list whose dirty inodes/pages are more hard to track.
> > 
> 
> Have you confirmed this theory with the trace points? It makes perfect
> sense and is very rational but proof is a plus.

The proof would be simple.

On average, it takes longer time to dirty a large file than a small file.

For example, when uploading files to a file server with 1MB/s
throughput, it will take 10s for a 10MB file and 30s for a 30MB file.
This is the common case.

Another case is some fast dirtier. It may take 10ms to dirty a 100MB
file and 10s to dirty a 1G file -- the latter is dirty throttled to
the much lower IO throughput due to too many dirty pages. The opposite
may happen, however this is more likely in possibility. If both are
throttled, it degenerates to the above file server case.

So large files tend to contain dirty pages of more varied age.

> I'm guessing you have
> some decent writeback-related tests that might be of use. Mine have a
> big mix of anon and file writeback so it's not as clear-cut.

A neat trick is to run your test with `swapoff -a` :)

Seriously I have no scripts to monitor pageout() calls.
I'll explore ways to test it.

> Monitoring it isn't hard. Mount debugfs, enable the vmscan tracepoints
> and read the tracing_pipe. To reduce interference, I always pipe it
> through gzip and do post-processing afterwards offline with the script
> included in Documentation/

Thanks for the tip!

> Here is what I got from sysbench on x86-64 (other machines hours away)
> 
> 
> SYSBENCH FTrace Reclaim Statistics
>                     traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
> Direct reclaims                                683        785        670        938 
> Direct reclaim pages scanned                199776     161195     200400     166639 
> Direct reclaim write file async I/O          64802          0          0          0 
> Direct reclaim write anon async I/O           1009        419       1184      11390 
> Direct reclaim write file sync I/O              18          0          0          0 
> Direct reclaim write anon sync I/O               0          0          0          0 
> Wake kswapd requests                        685360     697255     691009     864602 
> Kswapd wakeups                                1596       1517       1517       1545 
> Kswapd pages scanned                      17527865   16817554   16816510   15032525 
> Kswapd reclaim write file async I/O         888082     618123     649167     147903 
> Kswapd reclaim write anon async I/O         229724     229123     233639     243561 
> Kswapd reclaim write file sync I/O               0          0          0          0 
> Kswapd reclaim write anon sync I/O               0          0          0          0 

> Time stalled direct reclaim (ms)             32.79      22.47      19.75       6.34 
> Time kswapd awake (ms)                     2192.03    2165.17    2112.73    2055.90 

I noticed that $total_direct_latency is divided by 1000 before
printing the above lines, so the unit should be seconds?

> User/Sys Time Running Test (seconds)         663.3    656.37    664.14    654.63
> Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%     0.00%
> Total Elapsed Time (seconds)               6703.22   6468.78   6472.69   6479.62
> Percentage Time kswapd Awake                 0.03%     0.00%     0.00%     0.00%

I don't see the code for generating the "Percentage" lines. And the
numbers seem too small to be true.

> Flush oldest actually increased the number of pages written back by
> kswapd but the anon writeback is also high as swap is involved. Kicking
> flusher threads also helps a lot. It helps less than previous released
> because I noticed I was kicking flusher threads for both anon and file
> dirty pages which is cheating. It's now only waking the threads for
> file. It's still a reduction of 84% overall so nothing to sneeze at.
> 
> What the patch did do was reduce time stalled in direct reclaim and time
> kswapd spent awake so it still might be going the right direction. I
> don't have a feeling for how much the writeback figures change between
> runs because they take so long to run.
> 
> STRESS-HIGHALLOC FTrace Reclaim Statistics
>                   stress-highalloc      stress-highalloc      stress-highalloc      stress-highalloc
>                     traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
> Direct reclaims                               1221       1284       1127       1252 
> Direct reclaim pages scanned                146220     186156     142075     140617 
> Direct reclaim write file async I/O           3433          0          0          0 
> Direct reclaim write anon async I/O          25238      28758      23940      23247 
> Direct reclaim write file sync I/O            3095          0          0          0 
> Direct reclaim write anon sync I/O           10911     305579     281824     246251 
> Wake kswapd requests                          1193       1196       1088       1209 
> Kswapd wakeups                                 805        824        758        804 
> Kswapd pages scanned                      30953364   52621368   42722498   30945547 
> Kswapd reclaim write file async I/O         898087     241135     570467      54319 
> Kswapd reclaim write anon async I/O        2278607    2201894    1885741    1949170 
> Kswapd reclaim write file sync I/O               0          0          0          0 
> Kswapd reclaim write anon sync I/O               0          0          0          0 
> Time stalled direct reclaim (ms)           8567.29    6628.83    6520.39    6947.23 
> Time kswapd awake (ms)                     5847.60    3589.43    3900.74   15837.59 
> 
> User/Sys Time Running Test (seconds)       2824.76   2833.05   2833.26   2830.46
> Percentage Time Spent Direct Reclaim         0.25%     0.00%     0.00%     0.00%
> Total Elapsed Time (seconds)              10920.14   9021.17   8872.06   9301.86
> Percentage Time kswapd Awake                 0.15%     0.00%     0.00%     0.00%
> 
> Same here, the number of pages written back by kswapd increased but
> again anon writeback was a big factor. Kicking threads when dirty pages
> are encountered still helps a lot with a 94% reduction of pages written
> back overall..

That is impressive! So it definitely helps to reduce total number of
dirty pages under memory pressure.

> Also, your patch really helped the time spent stalled by direct reclaim
> and kswapd was awake a lot less less with tests completing far faster.

Thanks. So it does improve the dirty page layout in the LRU lists.

> Overally, I still think your series if a big help (although I don't know if
> the patches in linux-next are also making a difference) but it's not actually
> reducing the pages encountered by direct reclaim. Maybe that is because
> the tests were making more forward progress and so scanning faster. The
> sysbench performance results are too varied to draw conclusions from but it
> did slightly improve the success rate of high-order allocations.
> 
> The flush-forward patches would appear to be a requirement. Christoph
> first described them as a band-aid but he didn't chuck rocks at me when
> the patch was actually released. Right now, I'm leaning towards pushing
> it and judge by the Swear Meter how good/bad others think it is. So far
> it's, me pro, Rik pro, Christoph maybe.

Sorry for the delay, I'll help review it.

> > For a large dirty inode, it may flush lots of newly dirtied pages _after_
> > syncing the expired pages. This is the normal case for a single-stream
> > sequential dirtier, where older pages are in lower offsets.  In this case we
> > shall not insist on syncing the whole large dirty inode before considering the
> > other small dirty inodes. This risks wasting time syncing 1GB freshly dirtied
> > pages before syncing the other N*1MB expired dirty pages who are approaching
> > the end of the LRU list and hence pageout().
> > 
> 
> Intuitively, this makes a lot of sense.
> 
> > For a large dirty inode, it may also flush lots of newly dirtied pages _before_
> > hitting the desired old ones, in which case it helps for pageout() to do some
> > clustered writeback, and/or set mapping->writeback_index to help the flusher
> > focus on old pages.
> > 
> 
> Will put this idea on the maybe pile.
> 
> > For a large dirty inode, it may also have intermixed old and new dirty pages.
> > In this case we need to make sure the inode is queued for IO before some of
> > its pages hit pageout(). Adaptive dirty expire time helps here.
> > 
> > OK, end of the vapour ideas. As for this patchset, it fixes the current
> > kupdate/background writeback priority:
> > 
> > - the kupdate/background writeback shall include newly expired inodes at each
> >   queue_io() time, as the large inodes left over from previous writeback rounds
> >   are likely to have less density of old pages.
> > 
> > - the background writeback shall consider expired inodes first, just like the
> >   kupdate writeback
> > 
> 
> I haven't actually reviewed these. I got testing kicked off first
> because it didn't require brains :)

Thanks all the same!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
@ 2010-07-26  7:18     ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26  7:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

> On Thu, Jul 22, 2010 at 01:09:28PM +0800, Wu Fengguang wrote:
> > 
> > The basic way of avoiding pageout() is to make the flusher sync inodes in the
> > right order. Oldest dirty inodes contains oldest pages. The smaller inode it
> > is, the more correlation between inode dirty time and its pages' dirty time.
> > So for small dirty inodes, syncing in the order of inode dirty time is able to
> > avoid pageout(). If pageout() is still triggered frequently in this case, the
> > 30s dirty expire time may be too long and could be shrinked adaptively; or it
> > may be a stressed memcg list whose dirty inodes/pages are more hard to track.
> > 
> 
> Have you confirmed this theory with the trace points? It makes perfect
> sense and is very rational but proof is a plus.

The proof would be simple.

On average, it takes longer time to dirty a large file than a small file.

For example, when uploading files to a file server with 1MB/s
throughput, it will take 10s for a 10MB file and 30s for a 30MB file.
This is the common case.

Another case is some fast dirtier. It may take 10ms to dirty a 100MB
file and 10s to dirty a 1G file -- the latter is dirty throttled to
the much lower IO throughput due to too many dirty pages. The opposite
may happen, however this is more likely in possibility. If both are
throttled, it degenerates to the above file server case.

So large files tend to contain dirty pages of more varied age.

> I'm guessing you have
> some decent writeback-related tests that might be of use. Mine have a
> big mix of anon and file writeback so it's not as clear-cut.

A neat trick is to run your test with `swapoff -a` :)

Seriously I have no scripts to monitor pageout() calls.
I'll explore ways to test it.

> Monitoring it isn't hard. Mount debugfs, enable the vmscan tracepoints
> and read the tracing_pipe. To reduce interference, I always pipe it
> through gzip and do post-processing afterwards offline with the script
> included in Documentation/

Thanks for the tip!

> Here is what I got from sysbench on x86-64 (other machines hours away)
> 
> 
> SYSBENCH FTrace Reclaim Statistics
>                     traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
> Direct reclaims                                683        785        670        938 
> Direct reclaim pages scanned                199776     161195     200400     166639 
> Direct reclaim write file async I/O          64802          0          0          0 
> Direct reclaim write anon async I/O           1009        419       1184      11390 
> Direct reclaim write file sync I/O              18          0          0          0 
> Direct reclaim write anon sync I/O               0          0          0          0 
> Wake kswapd requests                        685360     697255     691009     864602 
> Kswapd wakeups                                1596       1517       1517       1545 
> Kswapd pages scanned                      17527865   16817554   16816510   15032525 
> Kswapd reclaim write file async I/O         888082     618123     649167     147903 
> Kswapd reclaim write anon async I/O         229724     229123     233639     243561 
> Kswapd reclaim write file sync I/O               0          0          0          0 
> Kswapd reclaim write anon sync I/O               0          0          0          0 

> Time stalled direct reclaim (ms)             32.79      22.47      19.75       6.34 
> Time kswapd awake (ms)                     2192.03    2165.17    2112.73    2055.90 

I noticed that $total_direct_latency is divided by 1000 before
printing the above lines, so the unit should be seconds?

> User/Sys Time Running Test (seconds)         663.3    656.37    664.14    654.63
> Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%     0.00%
> Total Elapsed Time (seconds)               6703.22   6468.78   6472.69   6479.62
> Percentage Time kswapd Awake                 0.03%     0.00%     0.00%     0.00%

I don't see the code for generating the "Percentage" lines. And the
numbers seem too small to be true.

> Flush oldest actually increased the number of pages written back by
> kswapd but the anon writeback is also high as swap is involved. Kicking
> flusher threads also helps a lot. It helps less than previous released
> because I noticed I was kicking flusher threads for both anon and file
> dirty pages which is cheating. It's now only waking the threads for
> file. It's still a reduction of 84% overall so nothing to sneeze at.
> 
> What the patch did do was reduce time stalled in direct reclaim and time
> kswapd spent awake so it still might be going the right direction. I
> don't have a feeling for how much the writeback figures change between
> runs because they take so long to run.
> 
> STRESS-HIGHALLOC FTrace Reclaim Statistics
>                   stress-highalloc      stress-highalloc      stress-highalloc      stress-highalloc
>                     traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
> Direct reclaims                               1221       1284       1127       1252 
> Direct reclaim pages scanned                146220     186156     142075     140617 
> Direct reclaim write file async I/O           3433          0          0          0 
> Direct reclaim write anon async I/O          25238      28758      23940      23247 
> Direct reclaim write file sync I/O            3095          0          0          0 
> Direct reclaim write anon sync I/O           10911     305579     281824     246251 
> Wake kswapd requests                          1193       1196       1088       1209 
> Kswapd wakeups                                 805        824        758        804 
> Kswapd pages scanned                      30953364   52621368   42722498   30945547 
> Kswapd reclaim write file async I/O         898087     241135     570467      54319 
> Kswapd reclaim write anon async I/O        2278607    2201894    1885741    1949170 
> Kswapd reclaim write file sync I/O               0          0          0          0 
> Kswapd reclaim write anon sync I/O               0          0          0          0 
> Time stalled direct reclaim (ms)           8567.29    6628.83    6520.39    6947.23 
> Time kswapd awake (ms)                     5847.60    3589.43    3900.74   15837.59 
> 
> User/Sys Time Running Test (seconds)       2824.76   2833.05   2833.26   2830.46
> Percentage Time Spent Direct Reclaim         0.25%     0.00%     0.00%     0.00%
> Total Elapsed Time (seconds)              10920.14   9021.17   8872.06   9301.86
> Percentage Time kswapd Awake                 0.15%     0.00%     0.00%     0.00%
> 
> Same here, the number of pages written back by kswapd increased but
> again anon writeback was a big factor. Kicking threads when dirty pages
> are encountered still helps a lot with a 94% reduction of pages written
> back overall..

That is impressive! So it definitely helps to reduce total number of
dirty pages under memory pressure.

> Also, your patch really helped the time spent stalled by direct reclaim
> and kswapd was awake a lot less less with tests completing far faster.

Thanks. So it does improve the dirty page layout in the LRU lists.

> Overally, I still think your series if a big help (although I don't know if
> the patches in linux-next are also making a difference) but it's not actually
> reducing the pages encountered by direct reclaim. Maybe that is because
> the tests were making more forward progress and so scanning faster. The
> sysbench performance results are too varied to draw conclusions from but it
> did slightly improve the success rate of high-order allocations.
> 
> The flush-forward patches would appear to be a requirement. Christoph
> first described them as a band-aid but he didn't chuck rocks at me when
> the patch was actually released. Right now, I'm leaning towards pushing
> it and judge by the Swear Meter how good/bad others think it is. So far
> it's, me pro, Rik pro, Christoph maybe.

Sorry for the delay, I'll help review it.

> > For a large dirty inode, it may flush lots of newly dirtied pages _after_
> > syncing the expired pages. This is the normal case for a single-stream
> > sequential dirtier, where older pages are in lower offsets.  In this case we
> > shall not insist on syncing the whole large dirty inode before considering the
> > other small dirty inodes. This risks wasting time syncing 1GB freshly dirtied
> > pages before syncing the other N*1MB expired dirty pages who are approaching
> > the end of the LRU list and hence pageout().
> > 
> 
> Intuitively, this makes a lot of sense.
> 
> > For a large dirty inode, it may also flush lots of newly dirtied pages _before_
> > hitting the desired old ones, in which case it helps for pageout() to do some
> > clustered writeback, and/or set mapping->writeback_index to help the flusher
> > focus on old pages.
> > 
> 
> Will put this idea on the maybe pile.
> 
> > For a large dirty inode, it may also have intermixed old and new dirty pages.
> > In this case we need to make sure the inode is queued for IO before some of
> > its pages hit pageout(). Adaptive dirty expire time helps here.
> > 
> > OK, end of the vapour ideas. As for this patchset, it fixes the current
> > kupdate/background writeback priority:
> > 
> > - the kupdate/background writeback shall include newly expired inodes at each
> >   queue_io() time, as the large inodes left over from previous writeback rounds
> >   are likely to have less density of old pages.
> > 
> > - the background writeback shall consider expired inodes first, just like the
> >   kupdate writeback
> > 
> 
> I haven't actually reviewed these. I got testing kicked off first
> because it didn't require brains :)

Thanks all the same!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
  2010-07-22  5:09 ` Wu Fengguang
@ 2010-07-26 10:28   ` Itaru Kitayama
  -1 siblings, 0 replies; 98+ messages in thread
From: Itaru Kitayama @ 2010-07-26 10:28 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

Hi,
Here's a touch up patch on top of your changes against the latest
mmotm.

Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
---
 fs/btrfs/extent_io.c        |    2 --
 include/trace/events/ext4.h |    5 +----
 2 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cb9af26..b494dee 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2586,7 +2586,6 @@ int extent_write_full_page(struct extent_io_tree *tree, struct page *page,
        };
        struct writeback_control wbc_writepages = {
                .sync_mode      = wbc->sync_mode,
-               .older_than_this = NULL,
                .nr_to_write    = 64,
                .range_start    = page_offset(page) + PAGE_CACHE_SIZE,
                .range_end      = (loff_t)-1,
@@ -2619,7 +2618,6 @@ int extent_write_locked_range(struct extent_io_tree *tree, struct inode *inode,
        };
        struct writeback_control wbc_writepages = {
                .sync_mode      = mode,
-               .older_than_this = NULL,
                .nr_to_write    = nr_pages * 2,
                .range_start    = start,
                .range_end      = end + 1,
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index f3865c7..099598b 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -305,7 +305,6 @@ TRACE_EVENT(ext4_da_writepages_result,
                __field(        int,    ret                     )
                __field(        int,    pages_written           )
                __field(        long,   pages_skipped           )
-               __field(        char,   more_io                 )       
                __field(       pgoff_t, writeback_index         )
        ),
 
@@ -315,15 +314,13 @@ TRACE_EVENT(ext4_da_writepages_result,
                __entry->ret            = ret;
                __entry->pages_written  = pages_written;
                __entry->pages_skipped  = wbc->pages_skipped;
-               __entry->more_io        = wbc->more_io;
                __entry->writeback_index = inode->i_mapping->writeback_index;
        ),
 
-       TP_printk("dev %s ino %lu ret %d pages_written %d pages_skipped %ld more_io %d writeback_index %lu",
+       TP_printk("dev %s ino %lu ret %d pages_written %d pages_skipped %ld writeback_index %lu",
                  jbd2_dev_to_name(__entry->dev),
                  (unsigned long) __entry->ino, __entry->ret,
                  __entry->pages_written, __entry->pages_skipped,
-                 __entry->more_io,
                  (unsigned long) __entry->writeback_index)
 );
 
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
@ 2010-07-26 10:28   ` Itaru Kitayama
  0 siblings, 0 replies; 98+ messages in thread
From: Itaru Kitayama @ 2010-07-26 10:28 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

Hi,
Here's a touch up patch on top of your changes against the latest
mmotm.

Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
---
 fs/btrfs/extent_io.c        |    2 --
 include/trace/events/ext4.h |    5 +----
 2 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cb9af26..b494dee 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2586,7 +2586,6 @@ int extent_write_full_page(struct extent_io_tree *tree, struct page *page,
        };
        struct writeback_control wbc_writepages = {
                .sync_mode      = wbc->sync_mode,
-               .older_than_this = NULL,
                .nr_to_write    = 64,
                .range_start    = page_offset(page) + PAGE_CACHE_SIZE,
                .range_end      = (loff_t)-1,
@@ -2619,7 +2618,6 @@ int extent_write_locked_range(struct extent_io_tree *tree, struct inode *inode,
        };
        struct writeback_control wbc_writepages = {
                .sync_mode      = mode,
-               .older_than_this = NULL,
                .nr_to_write    = nr_pages * 2,
                .range_start    = start,
                .range_end      = end + 1,
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index f3865c7..099598b 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -305,7 +305,6 @@ TRACE_EVENT(ext4_da_writepages_result,
                __field(        int,    ret                     )
                __field(        int,    pages_written           )
                __field(        long,   pages_skipped           )
-               __field(        char,   more_io                 )       
                __field(       pgoff_t, writeback_index         )
        ),
 
@@ -315,15 +314,13 @@ TRACE_EVENT(ext4_da_writepages_result,
                __entry->ret            = ret;
                __entry->pages_written  = pages_written;
                __entry->pages_skipped  = wbc->pages_skipped;
-               __entry->more_io        = wbc->more_io;
                __entry->writeback_index = inode->i_mapping->writeback_index;
        ),
 
-       TP_printk("dev %s ino %lu ret %d pages_written %d pages_skipped %ld more_io %d writeback_index %lu",
+       TP_printk("dev %s ino %lu ret %d pages_written %d pages_skipped %ld writeback_index %lu",
                  jbd2_dev_to_name(__entry->dev),
                  (unsigned long) __entry->ino, __entry->ret,
                  __entry->pages_written, __entry->pages_skipped,
-                 __entry->more_io,
                  (unsigned long) __entry->writeback_index)
 );
 
-- 
1.7.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
  2010-07-26  7:18     ` Wu Fengguang
@ 2010-07-26 10:42       ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 10:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 03:18:03PM +0800, Wu Fengguang wrote:
> > On Thu, Jul 22, 2010 at 01:09:28PM +0800, Wu Fengguang wrote:
> > > 
> > > The basic way of avoiding pageout() is to make the flusher sync inodes in the
> > > right order. Oldest dirty inodes contains oldest pages. The smaller inode it
> > > is, the more correlation between inode dirty time and its pages' dirty time.
> > > So for small dirty inodes, syncing in the order of inode dirty time is able to
> > > avoid pageout(). If pageout() is still triggered frequently in this case, the
> > > 30s dirty expire time may be too long and could be shrinked adaptively; or it
> > > may be a stressed memcg list whose dirty inodes/pages are more hard to track.
> > > 
> > 
> > Have you confirmed this theory with the trace points? It makes perfect
> > sense and is very rational but proof is a plus.
> 
> The proof would be simple.
> 
> On average, it takes longer time to dirty a large file than a small file.
> 
> For example, when uploading files to a file server with 1MB/s
> throughput, it will take 10s for a 10MB file and 30s for a 30MB file.
> This is the common case.
> 
> Another case is some fast dirtier. It may take 10ms to dirty a 100MB
> file and 10s to dirty a 1G file -- the latter is dirty throttled to
> the much lower IO throughput due to too many dirty pages. The opposite
> may happen, however this is more likely in possibility. If both are
> throttled, it degenerates to the above file server case.
> 
> So large files tend to contain dirty pages of more varied age.
> 

Ok.

> > I'm guessing you have
> > some decent writeback-related tests that might be of use. Mine have a
> > big mix of anon and file writeback so it's not as clear-cut.
> 
> A neat trick is to run your test with `swapoff -a` :)
> 

Good point.

> Seriously I have no scripts to monitor pageout() calls.
> I'll explore ways to test it.
> 

I'll see about running tests with swapoff.

> > Monitoring it isn't hard. Mount debugfs, enable the vmscan tracepoints
> > and read the tracing_pipe. To reduce interference, I always pipe it
> > through gzip and do post-processing afterwards offline with the script
> > included in Documentation/
> 
> Thanks for the tip!
> 
> > Here is what I got from sysbench on x86-64 (other machines hours away)
> > 
> > 
> > SYSBENCH FTrace Reclaim Statistics
> >                     traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
> > Direct reclaims                                683        785        670        938 
> > Direct reclaim pages scanned                199776     161195     200400     166639 
> > Direct reclaim write file async I/O          64802          0          0          0 
> > Direct reclaim write anon async I/O           1009        419       1184      11390 
> > Direct reclaim write file sync I/O              18          0          0          0 
> > Direct reclaim write anon sync I/O               0          0          0          0 
> > Wake kswapd requests                        685360     697255     691009     864602 
> > Kswapd wakeups                                1596       1517       1517       1545 
> > Kswapd pages scanned                      17527865   16817554   16816510   15032525 
> > Kswapd reclaim write file async I/O         888082     618123     649167     147903 
> > Kswapd reclaim write anon async I/O         229724     229123     233639     243561 
> > Kswapd reclaim write file sync I/O               0          0          0          0 
> > Kswapd reclaim write anon sync I/O               0          0          0          0 
> 
> > Time stalled direct reclaim (ms)             32.79      22.47      19.75       6.34 
> > Time kswapd awake (ms)                     2192.03    2165.17    2112.73    2055.90 
> 
> I noticed that $total_direct_latency is divided by 1000 before
> printing the above lines, so the unit should be seconds?
> 

Correct. That figure was generated by another post-processing script that
creates the table. It got the units wrong so the percentage time lines are
wrong and the time spent awake is saying ms when it should say seconds. Sorry
about that.

> > User/Sys Time Running Test (seconds)         663.3    656.37    664.14    654.63
> > Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%     0.00%
> > Total Elapsed Time (seconds)               6703.22   6468.78   6472.69   6479.62
> > Percentage Time kswapd Awake                 0.03%     0.00%     0.00%     0.00%
> 
> I don't see the code for generating the "Percentage" lines. And the
> numbers seem too small to be true.
> 

The code is in a table-generation script that had access to data on the
length of time the test ran. I was ignoring the percentage time line
since early on and I missed the error.

The percentage time spent on direct reclaim is

direct_reclaim*100/(user_time+sys_time+stalled_time)

A report based on a corrected scipt looks like

                    traceonly-v5r6         nodirect-v5r9 flusholdest-v5r9     flushforward-v5r9
Direct reclaims                                683        528        808 943 
Direct reclaim pages scanned                199776     298562     125991 83325 
Direct reclaim write file async I/O          64802          0          0 0 
Direct reclaim write anon async I/O           1009       3340        926 2227 
Direct reclaim write file sync I/O              18          0          0 0 
Direct reclaim write anon sync I/O               0          0          0 0 
Wake kswapd requests                        685360     522123     763448 827895 
Kswapd wakeups                                1596       1538       1452 1565 
Kswapd pages scanned                      17527865   17020235   16367809 15415022 
Kswapd reclaim write file async I/O         888082     869540     536427 89004 
Kswapd reclaim write anon async I/O         229724     262934     253396 215861 
Kswapd reclaim write file sync I/O               0          0          0 0 
Kswapd reclaim write anon sync I/O               0          0          0 0 
Time stalled direct reclaim (seconds)        32.79      23.46      20.70 7.01 
Time kswapd awake (seconds)                2192.03    2172.22    2117.82 2166.53 

User/Sys Time Running Test (seconds)         663.3    644.43    637.34 680.53
Percentage Time Spent Direct Reclaim         4.71%     3.51%     3.15% 1.02%
Total Elapsed Time (seconds)               6703.22   6477.95   6503.39 6781.90
Percentage Time kswapd Awake                32.70%    33.53%    32.56% 31.95%

> > Flush oldest actually increased the number of pages written back by
> > kswapd but the anon writeback is also high as swap is involved. Kicking
> > flusher threads also helps a lot. It helps less than previous released
> > because I noticed I was kicking flusher threads for both anon and file
> > dirty pages which is cheating. It's now only waking the threads for
> > file. It's still a reduction of 84% overall so nothing to sneeze at.
> > 
> > What the patch did do was reduce time stalled in direct reclaim and time
> > kswapd spent awake so it still might be going the right direction. I
> > don't have a feeling for how much the writeback figures change between
> > runs because they take so long to run.
> > 
> > STRESS-HIGHALLOC FTrace Reclaim Statistics
> >                   stress-highalloc      stress-highalloc      stress-highalloc      stress-highalloc
> >                     traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
> > Direct reclaims                               1221       1284       1127       1252 
> > Direct reclaim pages scanned                146220     186156     142075     140617 
> > Direct reclaim write file async I/O           3433          0          0          0 
> > Direct reclaim write anon async I/O          25238      28758      23940      23247 
> > Direct reclaim write file sync I/O            3095          0          0          0 
> > Direct reclaim write anon sync I/O           10911     305579     281824     246251 
> > Wake kswapd requests                          1193       1196       1088       1209 
> > Kswapd wakeups                                 805        824        758        804 
> > Kswapd pages scanned                      30953364   52621368   42722498   30945547 
> > Kswapd reclaim write file async I/O         898087     241135     570467      54319 
> > Kswapd reclaim write anon async I/O        2278607    2201894    1885741    1949170 
> > Kswapd reclaim write file sync I/O               0          0          0          0 
> > Kswapd reclaim write anon sync I/O               0          0          0          0 
> > Time stalled direct reclaim (ms)           8567.29    6628.83    6520.39    6947.23 
> > Time kswapd awake (ms)                     5847.60    3589.43    3900.74   15837.59 
> > 
> > User/Sys Time Running Test (seconds)       2824.76   2833.05   2833.26   2830.46
> > Percentage Time Spent Direct Reclaim         0.25%     0.00%     0.00%     0.00%
> > Total Elapsed Time (seconds)              10920.14   9021.17   8872.06   9301.86
> > Percentage Time kswapd Awake                 0.15%     0.00%     0.00%     0.00%
> > 
> > Same here, the number of pages written back by kswapd increased but
> > again anon writeback was a big factor. Kicking threads when dirty pages
> > are encountered still helps a lot with a 94% reduction of pages written
> > back overall..
> 
> That is impressive! So it definitely helps to reduce total number of
> dirty pages under memory pressure.
> 

Yes.

> > Also, your patch really helped the time spent stalled by direct reclaim
> > and kswapd was awake a lot less less with tests completing far faster.
> 
> Thanks. So it does improve the dirty page layout in the LRU lists.
> 

It would appear to.

> > Overally, I still think your series if a big help (although I don't know if
> > the patches in linux-next are also making a difference) but it's not actually
> > reducing the pages encountered by direct reclaim. Maybe that is because
> > the tests were making more forward progress and so scanning faster. The
> > sysbench performance results are too varied to draw conclusions from but it
> > did slightly improve the success rate of high-order allocations.
> > 
> > The flush-forward patches would appear to be a requirement. Christoph
> > first described them as a band-aid but he didn't chuck rocks at me when
> > the patch was actually released. Right now, I'm leaning towards pushing
> > it and judge by the Swear Meter how good/bad others think it is. So far
> > it's, me pro, Rik pro, Christoph maybe.
> 
> Sorry for the delay, I'll help review it.
> 

Don't be sorry, I still haven't reviewed the writeback patches.

> > > For a large dirty inode, it may flush lots of newly dirtied pages _after_
> > > syncing the expired pages. This is the normal case for a single-stream
> > > sequential dirtier, where older pages are in lower offsets.  In this case we
> > > shall not insist on syncing the whole large dirty inode before considering the
> > > other small dirty inodes. This risks wasting time syncing 1GB freshly dirtied
> > > pages before syncing the other N*1MB expired dirty pages who are approaching
> > > the end of the LRU list and hence pageout().
> > > 
> > 
> > Intuitively, this makes a lot of sense.
> > 
> > > For a large dirty inode, it may also flush lots of newly dirtied pages _before_
> > > hitting the desired old ones, in which case it helps for pageout() to do some
> > > clustered writeback, and/or set mapping->writeback_index to help the flusher
> > > focus on old pages.
> > > 
> > 
> > Will put this idea on the maybe pile.
> > 
> > > For a large dirty inode, it may also have intermixed old and new dirty pages.
> > > In this case we need to make sure the inode is queued for IO before some of
> > > its pages hit pageout(). Adaptive dirty expire time helps here.
> > > 
> > > OK, end of the vapour ideas. As for this patchset, it fixes the current
> > > kupdate/background writeback priority:
> > > 
> > > - the kupdate/background writeback shall include newly expired inodes at each
> > >   queue_io() time, as the large inodes left over from previous writeback rounds
> > >   are likely to have less density of old pages.
> > > 
> > > - the background writeback shall consider expired inodes first, just like the
> > >   kupdate writeback
> > > 
> > 
> > I haven't actually reviewed these. I got testing kicked off first
> > because it didn't require brains :)
> 
> Thanks all the same!
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
@ 2010-07-26 10:42       ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 10:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 03:18:03PM +0800, Wu Fengguang wrote:
> > On Thu, Jul 22, 2010 at 01:09:28PM +0800, Wu Fengguang wrote:
> > > 
> > > The basic way of avoiding pageout() is to make the flusher sync inodes in the
> > > right order. Oldest dirty inodes contains oldest pages. The smaller inode it
> > > is, the more correlation between inode dirty time and its pages' dirty time.
> > > So for small dirty inodes, syncing in the order of inode dirty time is able to
> > > avoid pageout(). If pageout() is still triggered frequently in this case, the
> > > 30s dirty expire time may be too long and could be shrinked adaptively; or it
> > > may be a stressed memcg list whose dirty inodes/pages are more hard to track.
> > > 
> > 
> > Have you confirmed this theory with the trace points? It makes perfect
> > sense and is very rational but proof is a plus.
> 
> The proof would be simple.
> 
> On average, it takes longer time to dirty a large file than a small file.
> 
> For example, when uploading files to a file server with 1MB/s
> throughput, it will take 10s for a 10MB file and 30s for a 30MB file.
> This is the common case.
> 
> Another case is some fast dirtier. It may take 10ms to dirty a 100MB
> file and 10s to dirty a 1G file -- the latter is dirty throttled to
> the much lower IO throughput due to too many dirty pages. The opposite
> may happen, however this is more likely in possibility. If both are
> throttled, it degenerates to the above file server case.
> 
> So large files tend to contain dirty pages of more varied age.
> 

Ok.

> > I'm guessing you have
> > some decent writeback-related tests that might be of use. Mine have a
> > big mix of anon and file writeback so it's not as clear-cut.
> 
> A neat trick is to run your test with `swapoff -a` :)
> 

Good point.

> Seriously I have no scripts to monitor pageout() calls.
> I'll explore ways to test it.
> 

I'll see about running tests with swapoff.

> > Monitoring it isn't hard. Mount debugfs, enable the vmscan tracepoints
> > and read the tracing_pipe. To reduce interference, I always pipe it
> > through gzip and do post-processing afterwards offline with the script
> > included in Documentation/
> 
> Thanks for the tip!
> 
> > Here is what I got from sysbench on x86-64 (other machines hours away)
> > 
> > 
> > SYSBENCH FTrace Reclaim Statistics
> >                     traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
> > Direct reclaims                                683        785        670        938 
> > Direct reclaim pages scanned                199776     161195     200400     166639 
> > Direct reclaim write file async I/O          64802          0          0          0 
> > Direct reclaim write anon async I/O           1009        419       1184      11390 
> > Direct reclaim write file sync I/O              18          0          0          0 
> > Direct reclaim write anon sync I/O               0          0          0          0 
> > Wake kswapd requests                        685360     697255     691009     864602 
> > Kswapd wakeups                                1596       1517       1517       1545 
> > Kswapd pages scanned                      17527865   16817554   16816510   15032525 
> > Kswapd reclaim write file async I/O         888082     618123     649167     147903 
> > Kswapd reclaim write anon async I/O         229724     229123     233639     243561 
> > Kswapd reclaim write file sync I/O               0          0          0          0 
> > Kswapd reclaim write anon sync I/O               0          0          0          0 
> 
> > Time stalled direct reclaim (ms)             32.79      22.47      19.75       6.34 
> > Time kswapd awake (ms)                     2192.03    2165.17    2112.73    2055.90 
> 
> I noticed that $total_direct_latency is divided by 1000 before
> printing the above lines, so the unit should be seconds?
> 

Correct. That figure was generated by another post-processing script that
creates the table. It got the units wrong so the percentage time lines are
wrong and the time spent awake is saying ms when it should say seconds. Sorry
about that.

> > User/Sys Time Running Test (seconds)         663.3    656.37    664.14    654.63
> > Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%     0.00%
> > Total Elapsed Time (seconds)               6703.22   6468.78   6472.69   6479.62
> > Percentage Time kswapd Awake                 0.03%     0.00%     0.00%     0.00%
> 
> I don't see the code for generating the "Percentage" lines. And the
> numbers seem too small to be true.
> 

The code is in a table-generation script that had access to data on the
length of time the test ran. I was ignoring the percentage time line
since early on and I missed the error.

The percentage time spent on direct reclaim is

direct_reclaim*100/(user_time+sys_time+stalled_time)

A report based on a corrected scipt looks like

                    traceonly-v5r6         nodirect-v5r9 flusholdest-v5r9     flushforward-v5r9
Direct reclaims                                683        528        808 943 
Direct reclaim pages scanned                199776     298562     125991 83325 
Direct reclaim write file async I/O          64802          0          0 0 
Direct reclaim write anon async I/O           1009       3340        926 2227 
Direct reclaim write file sync I/O              18          0          0 0 
Direct reclaim write anon sync I/O               0          0          0 0 
Wake kswapd requests                        685360     522123     763448 827895 
Kswapd wakeups                                1596       1538       1452 1565 
Kswapd pages scanned                      17527865   17020235   16367809 15415022 
Kswapd reclaim write file async I/O         888082     869540     536427 89004 
Kswapd reclaim write anon async I/O         229724     262934     253396 215861 
Kswapd reclaim write file sync I/O               0          0          0 0 
Kswapd reclaim write anon sync I/O               0          0          0 0 
Time stalled direct reclaim (seconds)        32.79      23.46      20.70 7.01 
Time kswapd awake (seconds)                2192.03    2172.22    2117.82 2166.53 

User/Sys Time Running Test (seconds)         663.3    644.43    637.34 680.53
Percentage Time Spent Direct Reclaim         4.71%     3.51%     3.15% 1.02%
Total Elapsed Time (seconds)               6703.22   6477.95   6503.39 6781.90
Percentage Time kswapd Awake                32.70%    33.53%    32.56% 31.95%

> > Flush oldest actually increased the number of pages written back by
> > kswapd but the anon writeback is also high as swap is involved. Kicking
> > flusher threads also helps a lot. It helps less than previous released
> > because I noticed I was kicking flusher threads for both anon and file
> > dirty pages which is cheating. It's now only waking the threads for
> > file. It's still a reduction of 84% overall so nothing to sneeze at.
> > 
> > What the patch did do was reduce time stalled in direct reclaim and time
> > kswapd spent awake so it still might be going the right direction. I
> > don't have a feeling for how much the writeback figures change between
> > runs because they take so long to run.
> > 
> > STRESS-HIGHALLOC FTrace Reclaim Statistics
> >                   stress-highalloc      stress-highalloc      stress-highalloc      stress-highalloc
> >                     traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
> > Direct reclaims                               1221       1284       1127       1252 
> > Direct reclaim pages scanned                146220     186156     142075     140617 
> > Direct reclaim write file async I/O           3433          0          0          0 
> > Direct reclaim write anon async I/O          25238      28758      23940      23247 
> > Direct reclaim write file sync I/O            3095          0          0          0 
> > Direct reclaim write anon sync I/O           10911     305579     281824     246251 
> > Wake kswapd requests                          1193       1196       1088       1209 
> > Kswapd wakeups                                 805        824        758        804 
> > Kswapd pages scanned                      30953364   52621368   42722498   30945547 
> > Kswapd reclaim write file async I/O         898087     241135     570467      54319 
> > Kswapd reclaim write anon async I/O        2278607    2201894    1885741    1949170 
> > Kswapd reclaim write file sync I/O               0          0          0          0 
> > Kswapd reclaim write anon sync I/O               0          0          0          0 
> > Time stalled direct reclaim (ms)           8567.29    6628.83    6520.39    6947.23 
> > Time kswapd awake (ms)                     5847.60    3589.43    3900.74   15837.59 
> > 
> > User/Sys Time Running Test (seconds)       2824.76   2833.05   2833.26   2830.46
> > Percentage Time Spent Direct Reclaim         0.25%     0.00%     0.00%     0.00%
> > Total Elapsed Time (seconds)              10920.14   9021.17   8872.06   9301.86
> > Percentage Time kswapd Awake                 0.15%     0.00%     0.00%     0.00%
> > 
> > Same here, the number of pages written back by kswapd increased but
> > again anon writeback was a big factor. Kicking threads when dirty pages
> > are encountered still helps a lot with a 94% reduction of pages written
> > back overall..
> 
> That is impressive! So it definitely helps to reduce total number of
> dirty pages under memory pressure.
> 

Yes.

> > Also, your patch really helped the time spent stalled by direct reclaim
> > and kswapd was awake a lot less less with tests completing far faster.
> 
> Thanks. So it does improve the dirty page layout in the LRU lists.
> 

It would appear to.

> > Overally, I still think your series if a big help (although I don't know if
> > the patches in linux-next are also making a difference) but it's not actually
> > reducing the pages encountered by direct reclaim. Maybe that is because
> > the tests were making more forward progress and so scanning faster. The
> > sysbench performance results are too varied to draw conclusions from but it
> > did slightly improve the success rate of high-order allocations.
> > 
> > The flush-forward patches would appear to be a requirement. Christoph
> > first described them as a band-aid but he didn't chuck rocks at me when
> > the patch was actually released. Right now, I'm leaning towards pushing
> > it and judge by the Swear Meter how good/bad others think it is. So far
> > it's, me pro, Rik pro, Christoph maybe.
> 
> Sorry for the delay, I'll help review it.
> 

Don't be sorry, I still haven't reviewed the writeback patches.

> > > For a large dirty inode, it may flush lots of newly dirtied pages _after_
> > > syncing the expired pages. This is the normal case for a single-stream
> > > sequential dirtier, where older pages are in lower offsets.  In this case we
> > > shall not insist on syncing the whole large dirty inode before considering the
> > > other small dirty inodes. This risks wasting time syncing 1GB freshly dirtied
> > > pages before syncing the other N*1MB expired dirty pages who are approaching
> > > the end of the LRU list and hence pageout().
> > > 
> > 
> > Intuitively, this makes a lot of sense.
> > 
> > > For a large dirty inode, it may also flush lots of newly dirtied pages _before_
> > > hitting the desired old ones, in which case it helps for pageout() to do some
> > > clustered writeback, and/or set mapping->writeback_index to help the flusher
> > > focus on old pages.
> > > 
> > 
> > Will put this idea on the maybe pile.
> > 
> > > For a large dirty inode, it may also have intermixed old and new dirty pages.
> > > In this case we need to make sure the inode is queued for IO before some of
> > > its pages hit pageout(). Adaptive dirty expire time helps here.
> > > 
> > > OK, end of the vapour ideas. As for this patchset, it fixes the current
> > > kupdate/background writeback priority:
> > > 
> > > - the kupdate/background writeback shall include newly expired inodes at each
> > >   queue_io() time, as the large inodes left over from previous writeback rounds
> > >   are likely to have less density of old pages.
> > > 
> > > - the background writeback shall consider expired inodes first, just like the
> > >   kupdate writeback
> > > 
> > 
> > I haven't actually reviewed these. I got testing kicked off first
> > because it didn't require brains :)
> 
> Thanks all the same!
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-07-26 10:44     ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 10:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:29PM +0800, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Can't see any problem.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2010-07-26 10:44     ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 10:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:29PM +0800, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Can't see any problem.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-07-26 10:52     ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 10:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:30PM +0800, Wu Fengguang wrote:
> Dynamicly compute the dirty expire timestamp at queue_io() time.
> Also remove writeback_control.older_than_this which is no longer used.
> 
> writeback_control.older_than_this used to be determined at entrance to
> the kupdate writeback work. This _static_ timestamp may go stale if the
> kupdate work runs on and on. The flusher may then stuck with some old
> busy inodes, never considering newly expired inodes thereafter.
> 
> This has two possible problems:
> 
> - It is unfair for a large dirty inode to delay (for a long time) the
>   writeback of small dirty inodes.
> 
> - As time goes by, the large and busy dirty inode may contain only
>   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
>   delaying the expired dirty pages to the end of LRU lists, triggering
>   the very bad pageout(). Neverthless this patch merely addresses part
>   of the problem.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Again, makes sense and I can't see a problem. There are some worth
smithing issues in the changelog such as Dynamicly -> Dynamically and
s/writeback_control.older_than_this used/writeback_control.older_than_this is used/
but other than that.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
@ 2010-07-26 10:52     ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 10:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:30PM +0800, Wu Fengguang wrote:
> Dynamicly compute the dirty expire timestamp at queue_io() time.
> Also remove writeback_control.older_than_this which is no longer used.
> 
> writeback_control.older_than_this used to be determined at entrance to
> the kupdate writeback work. This _static_ timestamp may go stale if the
> kupdate work runs on and on. The flusher may then stuck with some old
> busy inodes, never considering newly expired inodes thereafter.
> 
> This has two possible problems:
> 
> - It is unfair for a large dirty inode to delay (for a long time) the
>   writeback of small dirty inodes.
> 
> - As time goes by, the large and busy dirty inode may contain only
>   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
>   delaying the expired dirty pages to the end of LRU lists, triggering
>   the very bad pageout(). Neverthless this patch merely addresses part
>   of the problem.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Again, makes sense and I can't see a problem. There are some worth
smithing issues in the changelog such as Dynamicly -> Dynamically and
s/writeback_control.older_than_this used/writeback_control.older_than_this is used/
but other than that.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-07-26 10:53     ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 10:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:31PM +0800, Wu Fengguang wrote:
> When wbc.more_io was first introduced, it indicates whether there are
> at least one superblock whose s_more_io contains more IO work. Now with
> the per-bdi writeback, it can be replaced with a simple b_more_io test.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
@ 2010-07-26 10:53     ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 10:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:31PM +0800, Wu Fengguang wrote:
> When wbc.more_io was first introduced, it indicates whether there are
> at least one superblock whose s_more_io contains more IO work. Now with
> the per-bdi writeback, it can be replaced with a simple b_more_io test.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-07-26 10:57     ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 10:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> The policy is
> - enqueue all newly expired inodes at each queue_io() time
> - retry with halfed expire interval until get some inodes to sync
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Ok, intuitively this would appear to tie into pageout where we want
older inodes to be cleaned first by background flushers to limit the
number of dirty pages encountered by page reclaim. If this is accurate,
it should be detailed in the changelog.

> ---
>  fs/fs-writeback.c |   20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
>  				struct writeback_control *wbc)
>  {
>  	unsigned long expire_interval = 0;
> -	unsigned long older_than_this;
> +	unsigned long older_than_this = 0; /* reset to kill gcc warning */
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
>  	struct inode *inode;
>  	int do_sb_sort = 0;
>  
> -	if (wbc->for_kupdate) {
> +	if (wbc->for_kupdate || wbc->for_background) {
>  		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
>  		older_than_this = jiffies - expire_interval;
>  	}
> @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
>  		if (expire_interval &&
> -		    inode_dirtied_after(inode, older_than_this))
> -			break;
> +		    inode_dirtied_after(inode, older_than_this)) {
> +			if (wbc->for_background &&
> +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> +				expire_interval >>= 1;
> +				older_than_this = jiffies - expire_interval;
> +				continue;
> +			} else
> +				break;
> +		}

This needs a comment.

I think what it is saying is that if background flush is active but no
inodes are old enough, consider newer inodes. This is on the assumption
that page reclaim has encountered dirty pages and the dirty inodes are
still too young.

>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
>  		sb = inode->i_sb;
> @@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
>  
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +
> +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
> @@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
>  
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  	writeback_sb_inodes(sb, wb, wbc, true);
>  	spin_unlock(&inode_lock);
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-26 10:57     ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 10:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> The policy is
> - enqueue all newly expired inodes at each queue_io() time
> - retry with halfed expire interval until get some inodes to sync
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Ok, intuitively this would appear to tie into pageout where we want
older inodes to be cleaned first by background flushers to limit the
number of dirty pages encountered by page reclaim. If this is accurate,
it should be detailed in the changelog.

> ---
>  fs/fs-writeback.c |   20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
>  				struct writeback_control *wbc)
>  {
>  	unsigned long expire_interval = 0;
> -	unsigned long older_than_this;
> +	unsigned long older_than_this = 0; /* reset to kill gcc warning */
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
>  	struct inode *inode;
>  	int do_sb_sort = 0;
>  
> -	if (wbc->for_kupdate) {
> +	if (wbc->for_kupdate || wbc->for_background) {
>  		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
>  		older_than_this = jiffies - expire_interval;
>  	}
> @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
>  		if (expire_interval &&
> -		    inode_dirtied_after(inode, older_than_this))
> -			break;
> +		    inode_dirtied_after(inode, older_than_this)) {
> +			if (wbc->for_background &&
> +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> +				expire_interval >>= 1;
> +				older_than_this = jiffies - expire_interval;
> +				continue;
> +			} else
> +				break;
> +		}

This needs a comment.

I think what it is saying is that if background flush is active but no
inodes are old enough, consider newer inodes. This is on the assumption
that page reclaim has encountered dirty pages and the dirty inodes are
still too young.

>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
>  		sb = inode->i_sb;
> @@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
>  
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +
> +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
> @@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
>  
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  	writeback_sb_inodes(sb, wb, wbc, true);
>  	spin_unlock(&inode_lock);
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-07-26 11:01     ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 11:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:33PM +0800, Wu Fengguang wrote:
> writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
> they only populate b_io when necessary at entrance time. When the queued
> set of inodes are all synced, they just return, possibly with
> wbc.nr_to_write > 0.
> 
> For kupdate and background writeback, there may be more eligible inodes
> sitting in b_dirty when the current set of b_io inodes are completed. So
> it is necessary to try another round of writeback as long as we made some
> progress in this round. When there are no more eligible inodes, no more
> inodes will be enqueued in queue_io(), hence nothing could/will be
> synced and we may safely bail.
> 
> This will livelock sync when there are heavy dirtiers. However in that case
> sync will already be livelocked w/o this patch, as the current livelock
> avoidance code is virtually a no-op (for one thing, wb_time should be
> set statically at sync start time and be used in move_expired_inodes()).
> The sync livelock problem will be addressed in other patches.
> 

There does seem to be a livelock issue. During iozone, I see messages in
the console log with this series applied that look like

[ 1687.132034] INFO: task iozone:21225 blocked for more than 120 seconds.
[ 1687.211425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1687.305204] iozone        D ffff880001b13640     0 21225  21108 0x00000000
[ 1687.387677]  ffff880037419d48 0000000000000082 0000000000000348 0000000000013640
[ 1687.476594]  ffff880037419fd8 ffff880037419fd8 ffff880065892da0 0000000000013640
[ 1687.565512]  0000000000013640 0000000000013640 ffff880065892da0 ffff88007f411510
[ 1687.654431] Call Trace:
[ 1687.683663]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
[ 1687.747204]  [<ffffffff812d8f67>] schedule_timeout+0x2d/0x214
[ 1687.815947]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
[ 1687.879489]  [<ffffffff812d8527>] wait_for_common+0xd2/0x14a
[ 1687.947195]  [<ffffffff8103ef1e>] ? default_wake_function+0x0/0x14
[ 1688.021132]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
[ 1688.084680]  [<ffffffff811160f0>] ? sync_one_sb+0x0/0x22
[ 1688.148223]  [<ffffffff812d8657>] wait_for_completion+0x1d/0x1f
[ 1688.219051]  [<ffffffff811121c4>] sync_inodes_sb+0x92/0x14c
[ 1688.285710]  [<ffffffff811160f0>] ? sync_one_sb+0x0/0x22
[ 1688.349249]  [<ffffffff811160b9>] __sync_filesystem+0x4c/0x83
[ 1688.417995]  [<ffffffff81116110>] sync_one_sb+0x20/0x22
[ 1688.480505]  [<ffffffff810f6a23>] iterate_supers+0x66/0xa4
[ 1688.546124]  [<ffffffff81116157>] sys_sync+0x45/0x5c
[ 1688.605509]  [<ffffffff81002c72>] system_call_fastpath+0x16/0x1b

Similar messages do not appear without the patch. iozone does complete though
and the performance figures are not affected. Should I be worried?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2010-07-26 11:01     ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 11:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:33PM +0800, Wu Fengguang wrote:
> writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
> they only populate b_io when necessary at entrance time. When the queued
> set of inodes are all synced, they just return, possibly with
> wbc.nr_to_write > 0.
> 
> For kupdate and background writeback, there may be more eligible inodes
> sitting in b_dirty when the current set of b_io inodes are completed. So
> it is necessary to try another round of writeback as long as we made some
> progress in this round. When there are no more eligible inodes, no more
> inodes will be enqueued in queue_io(), hence nothing could/will be
> synced and we may safely bail.
> 
> This will livelock sync when there are heavy dirtiers. However in that case
> sync will already be livelocked w/o this patch, as the current livelock
> avoidance code is virtually a no-op (for one thing, wb_time should be
> set statically at sync start time and be used in move_expired_inodes()).
> The sync livelock problem will be addressed in other patches.
> 

There does seem to be a livelock issue. During iozone, I see messages in
the console log with this series applied that look like

[ 1687.132034] INFO: task iozone:21225 blocked for more than 120 seconds.
[ 1687.211425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1687.305204] iozone        D ffff880001b13640     0 21225  21108 0x00000000
[ 1687.387677]  ffff880037419d48 0000000000000082 0000000000000348 0000000000013640
[ 1687.476594]  ffff880037419fd8 ffff880037419fd8 ffff880065892da0 0000000000013640
[ 1687.565512]  0000000000013640 0000000000013640 ffff880065892da0 ffff88007f411510
[ 1687.654431] Call Trace:
[ 1687.683663]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
[ 1687.747204]  [<ffffffff812d8f67>] schedule_timeout+0x2d/0x214
[ 1687.815947]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
[ 1687.879489]  [<ffffffff812d8527>] wait_for_common+0xd2/0x14a
[ 1687.947195]  [<ffffffff8103ef1e>] ? default_wake_function+0x0/0x14
[ 1688.021132]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
[ 1688.084680]  [<ffffffff811160f0>] ? sync_one_sb+0x0/0x22
[ 1688.148223]  [<ffffffff812d8657>] wait_for_completion+0x1d/0x1f
[ 1688.219051]  [<ffffffff811121c4>] sync_inodes_sb+0x92/0x14c
[ 1688.285710]  [<ffffffff811160f0>] ? sync_one_sb+0x0/0x22
[ 1688.349249]  [<ffffffff811160b9>] __sync_filesystem+0x4c/0x83
[ 1688.417995]  [<ffffffff81116110>] sync_one_sb+0x20/0x22
[ 1688.480505]  [<ffffffff810f6a23>] iterate_supers+0x66/0xa4
[ 1688.546124]  [<ffffffff81116157>] sys_sync+0x45/0x5c
[ 1688.605509]  [<ffffffff81002c72>] system_call_fastpath+0x16/0x1b

Similar messages do not appear without the patch. iozone does complete though
and the performance figures are not affected. Should I be worried?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 6/6] writeback: introduce writeback_control.inodes_written
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-07-26 11:04     ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 11:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:34PM +0800, Wu Fengguang wrote:
> Introduce writeback_control.inodes_written to count successful
> ->write_inode() calls.  A non-zero value means there are some
> progress on writeback, in which case more writeback will be tried.
> 
> This prevents aborting a background writeback work prematually when
> the current set of inodes for IO happen to be metadata-only dirty.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Seems reasonable.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 6/6] writeback: introduce writeback_control.inodes_written
@ 2010-07-26 11:04     ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 11:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:34PM +0800, Wu Fengguang wrote:
> Introduce writeback_control.inodes_written to count successful
> ->write_inode() calls.  A non-zero value means there are some
> progress on writeback, in which case more writeback will be tried.
> 
> This prevents aborting a background writeback work prematually when
> the current set of inodes for IO happen to be metadata-only dirty.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Seems reasonable.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
  2010-07-26 10:52     ` Mel Gorman
@ 2010-07-26 11:32       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 06:52:00PM +0800, Mel Gorman wrote:
> On Thu, Jul 22, 2010 at 01:09:30PM +0800, Wu Fengguang wrote:
> > Dynamicly compute the dirty expire timestamp at queue_io() time.
> > Also remove writeback_control.older_than_this which is no longer used.
> > 
> > writeback_control.older_than_this used to be determined at entrance to
> > the kupdate writeback work. This _static_ timestamp may go stale if the
> > kupdate work runs on and on. The flusher may then stuck with some old
> > busy inodes, never considering newly expired inodes thereafter.
> > 
> > This has two possible problems:
> > 
> > - It is unfair for a large dirty inode to delay (for a long time) the
> >   writeback of small dirty inodes.
> > 
> > - As time goes by, the large and busy dirty inode may contain only
> >   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
> >   delaying the expired dirty pages to the end of LRU lists, triggering
> >   the very bad pageout(). Neverthless this patch merely addresses part
> >   of the problem.
> > 
> > CC: Jan Kara <jack@suse.cz>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> Again, makes sense and I can't see a problem. There are some worth
> smithing issues in the changelog such as Dynamicly -> Dynamically and

Hah forgot to enable spell checking.

> s/writeback_control.older_than_this used/writeback_control.older_than_this is used/

It's "used to", my god.

> but other than that.
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
@ 2010-07-26 11:32       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 06:52:00PM +0800, Mel Gorman wrote:
> On Thu, Jul 22, 2010 at 01:09:30PM +0800, Wu Fengguang wrote:
> > Dynamicly compute the dirty expire timestamp at queue_io() time.
> > Also remove writeback_control.older_than_this which is no longer used.
> > 
> > writeback_control.older_than_this used to be determined at entrance to
> > the kupdate writeback work. This _static_ timestamp may go stale if the
> > kupdate work runs on and on. The flusher may then stuck with some old
> > busy inodes, never considering newly expired inodes thereafter.
> > 
> > This has two possible problems:
> > 
> > - It is unfair for a large dirty inode to delay (for a long time) the
> >   writeback of small dirty inodes.
> > 
> > - As time goes by, the large and busy dirty inode may contain only
> >   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
> >   delaying the expired dirty pages to the end of LRU lists, triggering
> >   the very bad pageout(). Neverthless this patch merely addresses part
> >   of the problem.
> > 
> > CC: Jan Kara <jack@suse.cz>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> Again, makes sense and I can't see a problem. There are some worth
> smithing issues in the changelog such as Dynamicly -> Dynamically and

Hah forgot to enable spell checking.

> s/writeback_control.older_than_this used/writeback_control.older_than_this is used/

It's "used to", my god.

> but other than that.
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2010-07-26 11:01     ` Mel Gorman
@ 2010-07-26 11:39       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 07:01:25PM +0800, Mel Gorman wrote:
> On Thu, Jul 22, 2010 at 01:09:33PM +0800, Wu Fengguang wrote:
> > writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
> > they only populate b_io when necessary at entrance time. When the queued
> > set of inodes are all synced, they just return, possibly with
> > wbc.nr_to_write > 0.
> > 
> > For kupdate and background writeback, there may be more eligible inodes
> > sitting in b_dirty when the current set of b_io inodes are completed. So
> > it is necessary to try another round of writeback as long as we made some
> > progress in this round. When there are no more eligible inodes, no more
> > inodes will be enqueued in queue_io(), hence nothing could/will be
> > synced and we may safely bail.
> > 
> > This will livelock sync when there are heavy dirtiers. However in that case
> > sync will already be livelocked w/o this patch, as the current livelock
> > avoidance code is virtually a no-op (for one thing, wb_time should be
> > set statically at sync start time and be used in move_expired_inodes()).
> > The sync livelock problem will be addressed in other patches.
> > 
> 
> There does seem to be a livelock issue. During iozone, I see messages in
> the console log with this series applied that look like
> 
> [ 1687.132034] INFO: task iozone:21225 blocked for more than 120 seconds.
> [ 1687.211425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1687.305204] iozone        D ffff880001b13640     0 21225  21108 0x00000000
> [ 1687.387677]  ffff880037419d48 0000000000000082 0000000000000348 0000000000013640
> [ 1687.476594]  ffff880037419fd8 ffff880037419fd8 ffff880065892da0 0000000000013640
> [ 1687.565512]  0000000000013640 0000000000013640 ffff880065892da0 ffff88007f411510
> [ 1687.654431] Call Trace:
> [ 1687.683663]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
> [ 1687.747204]  [<ffffffff812d8f67>] schedule_timeout+0x2d/0x214
> [ 1687.815947]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
> [ 1687.879489]  [<ffffffff812d8527>] wait_for_common+0xd2/0x14a
> [ 1687.947195]  [<ffffffff8103ef1e>] ? default_wake_function+0x0/0x14
> [ 1688.021132]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
> [ 1688.084680]  [<ffffffff811160f0>] ? sync_one_sb+0x0/0x22
> [ 1688.148223]  [<ffffffff812d8657>] wait_for_completion+0x1d/0x1f
> [ 1688.219051]  [<ffffffff811121c4>] sync_inodes_sb+0x92/0x14c
> [ 1688.285710]  [<ffffffff811160f0>] ? sync_one_sb+0x0/0x22
> [ 1688.349249]  [<ffffffff811160b9>] __sync_filesystem+0x4c/0x83
> [ 1688.417995]  [<ffffffff81116110>] sync_one_sb+0x20/0x22
> [ 1688.480505]  [<ffffffff810f6a23>] iterate_supers+0x66/0xa4
> [ 1688.546124]  [<ffffffff81116157>] sys_sync+0x45/0x5c
> [ 1688.605509]  [<ffffffff81002c72>] system_call_fastpath+0x16/0x1b
> 
> Similar messages do not appear without the patch. iozone does complete though
> and the performance figures are not affected. Should I be worried?

The patch does add a bit more livelock possibility. But don't worry,
I'll fix that.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2010-07-26 11:39       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 07:01:25PM +0800, Mel Gorman wrote:
> On Thu, Jul 22, 2010 at 01:09:33PM +0800, Wu Fengguang wrote:
> > writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
> > they only populate b_io when necessary at entrance time. When the queued
> > set of inodes are all synced, they just return, possibly with
> > wbc.nr_to_write > 0.
> > 
> > For kupdate and background writeback, there may be more eligible inodes
> > sitting in b_dirty when the current set of b_io inodes are completed. So
> > it is necessary to try another round of writeback as long as we made some
> > progress in this round. When there are no more eligible inodes, no more
> > inodes will be enqueued in queue_io(), hence nothing could/will be
> > synced and we may safely bail.
> > 
> > This will livelock sync when there are heavy dirtiers. However in that case
> > sync will already be livelocked w/o this patch, as the current livelock
> > avoidance code is virtually a no-op (for one thing, wb_time should be
> > set statically at sync start time and be used in move_expired_inodes()).
> > The sync livelock problem will be addressed in other patches.
> > 
> 
> There does seem to be a livelock issue. During iozone, I see messages in
> the console log with this series applied that look like
> 
> [ 1687.132034] INFO: task iozone:21225 blocked for more than 120 seconds.
> [ 1687.211425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1687.305204] iozone        D ffff880001b13640     0 21225  21108 0x00000000
> [ 1687.387677]  ffff880037419d48 0000000000000082 0000000000000348 0000000000013640
> [ 1687.476594]  ffff880037419fd8 ffff880037419fd8 ffff880065892da0 0000000000013640
> [ 1687.565512]  0000000000013640 0000000000013640 ffff880065892da0 ffff88007f411510
> [ 1687.654431] Call Trace:
> [ 1687.683663]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
> [ 1687.747204]  [<ffffffff812d8f67>] schedule_timeout+0x2d/0x214
> [ 1687.815947]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
> [ 1687.879489]  [<ffffffff812d8527>] wait_for_common+0xd2/0x14a
> [ 1687.947195]  [<ffffffff8103ef1e>] ? default_wake_function+0x0/0x14
> [ 1688.021132]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
> [ 1688.084680]  [<ffffffff811160f0>] ? sync_one_sb+0x0/0x22
> [ 1688.148223]  [<ffffffff812d8657>] wait_for_completion+0x1d/0x1f
> [ 1688.219051]  [<ffffffff811121c4>] sync_inodes_sb+0x92/0x14c
> [ 1688.285710]  [<ffffffff811160f0>] ? sync_one_sb+0x0/0x22
> [ 1688.349249]  [<ffffffff811160b9>] __sync_filesystem+0x4c/0x83
> [ 1688.417995]  [<ffffffff81116110>] sync_one_sb+0x20/0x22
> [ 1688.480505]  [<ffffffff810f6a23>] iterate_supers+0x66/0xa4
> [ 1688.546124]  [<ffffffff81116157>] sys_sync+0x45/0x5c
> [ 1688.605509]  [<ffffffff81002c72>] system_call_fastpath+0x16/0x1b
> 
> Similar messages do not appear without the patch. iozone does complete though
> and the performance figures are not affected. Should I be worried?

The patch does add a bit more livelock possibility. But don't worry,
I'll fix that.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
  2010-07-26 10:28   ` Itaru Kitayama
@ 2010-07-26 11:47     ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:47 UTC (permalink / raw)
  To: Itaru Kitayama
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

> Here's a touch up patch on top of your changes against the latest
> mmotm.
>
> Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>

Applied, Thanks!

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
@ 2010-07-26 11:47     ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:47 UTC (permalink / raw)
  To: Itaru Kitayama
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

> Here's a touch up patch on top of your changes against the latest
> mmotm.
>
> Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>

Applied, Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-23 18:15     ` Jan Kara
@ 2010-07-26 11:51       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Sat, Jul 24, 2010 at 02:15:21AM +0800, Jan Kara wrote:
> On Thu 22-07-10 13:09:32, Wu Fengguang wrote:
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> > 
> > The policy is
> > - enqueue all newly expired inodes at each queue_io() time
> > - retry with halfed expire interval until get some inodes to sync
>   Hmm, this logic looks a bit arbitrary to me. What I actually don't like
> very much about this that when there aren't inodes older than say 2
> seconds, you'll end up queueing just inodes between 2s and 1s. So I'd
> rather just queue inodes older than the limit and if there are none, just
> queue all other dirty inodes.

You are proposing

-				expire_interval >>= 1;
+				expire_interval = 0;

IMO this does not really simplify code or concept. If we can get the
"smoother" behavior in original patch without extra cost, why not? 

Thanks,
Fengguang


> > CC: Jan Kara <jack@suse.cz>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c |   20 ++++++++++++++------
> >  1 file changed, 14 insertions(+), 6 deletions(-)
> > 
> > --- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> > @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
> >  				struct writeback_control *wbc)
> >  {
> >  	unsigned long expire_interval = 0;
> > -	unsigned long older_than_this;
> > +	unsigned long older_than_this = 0; /* reset to kill gcc warning */
> >  	LIST_HEAD(tmp);
> >  	struct list_head *pos, *node;
> >  	struct super_block *sb = NULL;
> >  	struct inode *inode;
> >  	int do_sb_sort = 0;
> >  
> > -	if (wbc->for_kupdate) {
> > +	if (wbc->for_kupdate || wbc->for_background) {
> >  		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
> >  		older_than_this = jiffies - expire_interval;
> >  	}
> > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> >  	while (!list_empty(delaying_queue)) {
> >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> >  		if (expire_interval &&
> > -		    inode_dirtied_after(inode, older_than_this))
> > -			break;
> > +		    inode_dirtied_after(inode, older_than_this)) {
> > +			if (wbc->for_background &&
> > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > +				expire_interval >>= 1;
> > +				older_than_this = jiffies - expire_interval;
> > +				continue;
> > +			} else
> > +				break;
> > +		}
> >  		if (sb && sb != inode->i_sb)
> >  			do_sb_sort = 1;
> >  		sb = inode->i_sb;
> > @@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
> >  
> >  	wbc->wb_start = jiffies; /* livelock avoidance */
> >  	spin_lock(&inode_lock);
> > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > +
> > +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
> >  		queue_io(wb, wbc);
> >  
> >  	while (!list_empty(&wb->b_io)) {
> > @@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
> >  
> >  	wbc->wb_start = jiffies; /* livelock avoidance */
> >  	spin_lock(&inode_lock);
> > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
> >  		queue_io(wb, wbc);
> >  	writeback_sb_inodes(sb, wb, wbc, true);
> >  	spin_unlock(&inode_lock);
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-26 11:51       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Sat, Jul 24, 2010 at 02:15:21AM +0800, Jan Kara wrote:
> On Thu 22-07-10 13:09:32, Wu Fengguang wrote:
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> > 
> > The policy is
> > - enqueue all newly expired inodes at each queue_io() time
> > - retry with halfed expire interval until get some inodes to sync
>   Hmm, this logic looks a bit arbitrary to me. What I actually don't like
> very much about this that when there aren't inodes older than say 2
> seconds, you'll end up queueing just inodes between 2s and 1s. So I'd
> rather just queue inodes older than the limit and if there are none, just
> queue all other dirty inodes.

You are proposing

-				expire_interval >>= 1;
+				expire_interval = 0;

IMO this does not really simplify code or concept. If we can get the
"smoother" behavior in original patch without extra cost, why not? 

Thanks,
Fengguang


> > CC: Jan Kara <jack@suse.cz>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c |   20 ++++++++++++++------
> >  1 file changed, 14 insertions(+), 6 deletions(-)
> > 
> > --- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> > @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
> >  				struct writeback_control *wbc)
> >  {
> >  	unsigned long expire_interval = 0;
> > -	unsigned long older_than_this;
> > +	unsigned long older_than_this = 0; /* reset to kill gcc warning */
> >  	LIST_HEAD(tmp);
> >  	struct list_head *pos, *node;
> >  	struct super_block *sb = NULL;
> >  	struct inode *inode;
> >  	int do_sb_sort = 0;
> >  
> > -	if (wbc->for_kupdate) {
> > +	if (wbc->for_kupdate || wbc->for_background) {
> >  		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
> >  		older_than_this = jiffies - expire_interval;
> >  	}
> > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> >  	while (!list_empty(delaying_queue)) {
> >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> >  		if (expire_interval &&
> > -		    inode_dirtied_after(inode, older_than_this))
> > -			break;
> > +		    inode_dirtied_after(inode, older_than_this)) {
> > +			if (wbc->for_background &&
> > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > +				expire_interval >>= 1;
> > +				older_than_this = jiffies - expire_interval;
> > +				continue;
> > +			} else
> > +				break;
> > +		}
> >  		if (sb && sb != inode->i_sb)
> >  			do_sb_sort = 1;
> >  		sb = inode->i_sb;
> > @@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
> >  
> >  	wbc->wb_start = jiffies; /* livelock avoidance */
> >  	spin_lock(&inode_lock);
> > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > +
> > +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
> >  		queue_io(wb, wbc);
> >  
> >  	while (!list_empty(&wb->b_io)) {
> > @@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
> >  
> >  	wbc->wb_start = jiffies; /* livelock avoidance */
> >  	spin_lock(&inode_lock);
> > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
> >  		queue_io(wb, wbc);
> >  	writeback_sb_inodes(sb, wb, wbc, true);
> >  	spin_unlock(&inode_lock);
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 10:57     ` Mel Gorman
@ 2010-07-26 12:00       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> > 
> > The policy is
> > - enqueue all newly expired inodes at each queue_io() time
> > - retry with halfed expire interval until get some inodes to sync
> > 
> > CC: Jan Kara <jack@suse.cz>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> Ok, intuitively this would appear to tie into pageout where we want
> older inodes to be cleaned first by background flushers to limit the
> number of dirty pages encountered by page reclaim. If this is accurate,
> it should be detailed in the changelog.

Good suggestion. I'll add these lines:

This is to help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-26 12:00       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> > 
> > The policy is
> > - enqueue all newly expired inodes at each queue_io() time
> > - retry with halfed expire interval until get some inodes to sync
> > 
> > CC: Jan Kara <jack@suse.cz>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> Ok, intuitively this would appear to tie into pageout where we want
> older inodes to be cleaned first by background flushers to limit the
> number of dirty pages encountered by page reclaim. If this is accurate,
> it should be detailed in the changelog.

Good suggestion. I'll add these lines:

This is to help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 11:51       ` Wu Fengguang
@ 2010-07-26 12:12         ` Jan Kara
  -1 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-26 12:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Dave Chinner, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Mon 26-07-10 19:51:53, Wu Fengguang wrote:
> On Sat, Jul 24, 2010 at 02:15:21AM +0800, Jan Kara wrote:
> > On Thu 22-07-10 13:09:32, Wu Fengguang wrote:
> > > A background flush work may run for ever. So it's reasonable for it to
> > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > 
> > > The policy is
> > > - enqueue all newly expired inodes at each queue_io() time
> > > - retry with halfed expire interval until get some inodes to sync
> >   Hmm, this logic looks a bit arbitrary to me. What I actually don't like
> > very much about this that when there aren't inodes older than say 2
> > seconds, you'll end up queueing just inodes between 2s and 1s. So I'd
> > rather just queue inodes older than the limit and if there are none, just
> > queue all other dirty inodes.
> 
> You are proposing
> 
> -				expire_interval >>= 1;
> +				expire_interval = 0;
> 
> IMO this does not really simplify code or concept. If we can get the
> "smoother" behavior in original patch without extra cost, why not? 
  I agree there's no substantial code simplification. But I see a
substantial "behavior" simplification (just two sweeps instead of 10 or
so). But I don't really insist on the two sweeps, it's just that I don't
see a justification for the exponencial back off here... I mean what's the
point if the interval we queue gets really small? Why not just use
expire_interval/2 as a step if you want a smoother behavior?

								Honza
> > > CC: Jan Kara <jack@suse.cz>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  fs/fs-writeback.c |   20 ++++++++++++++------
> > >  1 file changed, 14 insertions(+), 6 deletions(-)
> > > 
> > > --- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> > > +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> > > @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
> > >  				struct writeback_control *wbc)
> > >  {
> > >  	unsigned long expire_interval = 0;
> > > -	unsigned long older_than_this;
> > > +	unsigned long older_than_this = 0; /* reset to kill gcc warning */
> > >  	LIST_HEAD(tmp);
> > >  	struct list_head *pos, *node;
> > >  	struct super_block *sb = NULL;
> > >  	struct inode *inode;
> > >  	int do_sb_sort = 0;
> > >  
> > > -	if (wbc->for_kupdate) {
> > > +	if (wbc->for_kupdate || wbc->for_background) {
> > >  		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
> > >  		older_than_this = jiffies - expire_interval;
> > >  	}
> > > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> > >  	while (!list_empty(delaying_queue)) {
> > >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > >  		if (expire_interval &&
> > > -		    inode_dirtied_after(inode, older_than_this))
> > > -			break;
> > > +		    inode_dirtied_after(inode, older_than_this)) {
> > > +			if (wbc->for_background &&
> > > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > > +				expire_interval >>= 1;
> > > +				older_than_this = jiffies - expire_interval;
> > > +				continue;
> > > +			} else
> > > +				break;
> > > +		}
> > >  		if (sb && sb != inode->i_sb)
> > >  			do_sb_sort = 1;
> > >  		sb = inode->i_sb;
> > > @@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
> > >  
> > >  	wbc->wb_start = jiffies; /* livelock avoidance */
> > >  	spin_lock(&inode_lock);
> > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > +
> > > +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
> > >  		queue_io(wb, wbc);
> > >  
> > >  	while (!list_empty(&wb->b_io)) {
> > > @@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
> > >  
> > >  	wbc->wb_start = jiffies; /* livelock avoidance */
> > >  	spin_lock(&inode_lock);
> > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
> > >  		queue_io(wb, wbc);
> > >  	writeback_sb_inodes(sb, wb, wbc, true);
> > >  	spin_unlock(&inode_lock);
> > > 
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > -- 
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-26 12:12         ` Jan Kara
  0 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-26 12:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Dave Chinner, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Mon 26-07-10 19:51:53, Wu Fengguang wrote:
> On Sat, Jul 24, 2010 at 02:15:21AM +0800, Jan Kara wrote:
> > On Thu 22-07-10 13:09:32, Wu Fengguang wrote:
> > > A background flush work may run for ever. So it's reasonable for it to
> > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > 
> > > The policy is
> > > - enqueue all newly expired inodes at each queue_io() time
> > > - retry with halfed expire interval until get some inodes to sync
> >   Hmm, this logic looks a bit arbitrary to me. What I actually don't like
> > very much about this that when there aren't inodes older than say 2
> > seconds, you'll end up queueing just inodes between 2s and 1s. So I'd
> > rather just queue inodes older than the limit and if there are none, just
> > queue all other dirty inodes.
> 
> You are proposing
> 
> -				expire_interval >>= 1;
> +				expire_interval = 0;
> 
> IMO this does not really simplify code or concept. If we can get the
> "smoother" behavior in original patch without extra cost, why not? 
  I agree there's no substantial code simplification. But I see a
substantial "behavior" simplification (just two sweeps instead of 10 or
so). But I don't really insist on the two sweeps, it's just that I don't
see a justification for the exponencial back off here... I mean what's the
point if the interval we queue gets really small? Why not just use
expire_interval/2 as a step if you want a smoother behavior?

								Honza
> > > CC: Jan Kara <jack@suse.cz>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  fs/fs-writeback.c |   20 ++++++++++++++------
> > >  1 file changed, 14 insertions(+), 6 deletions(-)
> > > 
> > > --- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> > > +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> > > @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
> > >  				struct writeback_control *wbc)
> > >  {
> > >  	unsigned long expire_interval = 0;
> > > -	unsigned long older_than_this;
> > > +	unsigned long older_than_this = 0; /* reset to kill gcc warning */
> > >  	LIST_HEAD(tmp);
> > >  	struct list_head *pos, *node;
> > >  	struct super_block *sb = NULL;
> > >  	struct inode *inode;
> > >  	int do_sb_sort = 0;
> > >  
> > > -	if (wbc->for_kupdate) {
> > > +	if (wbc->for_kupdate || wbc->for_background) {
> > >  		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
> > >  		older_than_this = jiffies - expire_interval;
> > >  	}
> > > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> > >  	while (!list_empty(delaying_queue)) {
> > >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > >  		if (expire_interval &&
> > > -		    inode_dirtied_after(inode, older_than_this))
> > > -			break;
> > > +		    inode_dirtied_after(inode, older_than_this)) {
> > > +			if (wbc->for_background &&
> > > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > > +				expire_interval >>= 1;
> > > +				older_than_this = jiffies - expire_interval;
> > > +				continue;
> > > +			} else
> > > +				break;
> > > +		}
> > >  		if (sb && sb != inode->i_sb)
> > >  			do_sb_sort = 1;
> > >  		sb = inode->i_sb;
> > > @@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
> > >  
> > >  	wbc->wb_start = jiffies; /* livelock avoidance */
> > >  	spin_lock(&inode_lock);
> > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > +
> > > +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
> > >  		queue_io(wb, wbc);
> > >  
> > >  	while (!list_empty(&wb->b_io)) {
> > > @@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
> > >  
> > >  	wbc->wb_start = jiffies; /* livelock avoidance */
> > >  	spin_lock(&inode_lock);
> > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
> > >  		queue_io(wb, wbc);
> > >  	writeback_sb_inodes(sb, wb, wbc, true);
> > >  	spin_unlock(&inode_lock);
> > > 
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > -- 
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 12:00       ` Wu Fengguang
@ 2010-07-26 12:20         ` Jan Kara
  -1 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-26 12:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Mel Gorman, Andrew Morton, Dave Chinner, Jan Kara,
	Christoph Hellwig, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Mon 26-07-10 20:00:11, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> > On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > > A background flush work may run for ever. So it's reasonable for it to
> > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > 
> > > The policy is
> > > - enqueue all newly expired inodes at each queue_io() time
> > > - retry with halfed expire interval until get some inodes to sync
> > > 
> > > CC: Jan Kara <jack@suse.cz>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > 
> > Ok, intuitively this would appear to tie into pageout where we want
> > older inodes to be cleaned first by background flushers to limit the
> > number of dirty pages encountered by page reclaim. If this is accurate,
> > it should be detailed in the changelog.
> 
> Good suggestion. I'll add these lines:
> 
> This is to help reduce the number of dirty pages encountered by page
> reclaim, eg. the pageout() calls. Normally older inodes contain older
> dirty pages, which are more close to the end of the LRU lists. So
  Well, this kind of implicitely assumes that once page is written, it
doesn't get accessed anymore, right? Which I imagine is often true but
not for all workloads... Anyway I think this behavior is a good start
also because it is kind of natural to users to see "old" files written
first.

> syncing older inodes first helps reducing the dirty pages reached by
> the page reclaim code.

								Honza  
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-26 12:20         ` Jan Kara
  0 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-26 12:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Mel Gorman, Andrew Morton, Dave Chinner, Jan Kara,
	Christoph Hellwig, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Mon 26-07-10 20:00:11, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> > On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > > A background flush work may run for ever. So it's reasonable for it to
> > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > 
> > > The policy is
> > > - enqueue all newly expired inodes at each queue_io() time
> > > - retry with halfed expire interval until get some inodes to sync
> > > 
> > > CC: Jan Kara <jack@suse.cz>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > 
> > Ok, intuitively this would appear to tie into pageout where we want
> > older inodes to be cleaned first by background flushers to limit the
> > number of dirty pages encountered by page reclaim. If this is accurate,
> > it should be detailed in the changelog.
> 
> Good suggestion. I'll add these lines:
> 
> This is to help reduce the number of dirty pages encountered by page
> reclaim, eg. the pageout() calls. Normally older inodes contain older
> dirty pages, which are more close to the end of the LRU lists. So
  Well, this kind of implicitely assumes that once page is written, it
doesn't get accessed anymore, right? Which I imagine is often true but
not for all workloads... Anyway I think this behavior is a good start
also because it is kind of natural to users to see "old" files written
first.

> syncing older inodes first helps reducing the dirty pages reached by
> the page reclaim code.

								Honza  
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 12:12         ` Jan Kara
@ 2010-07-26 12:29           ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 08:12:59PM +0800, Jan Kara wrote:
> On Mon 26-07-10 19:51:53, Wu Fengguang wrote:
> > On Sat, Jul 24, 2010 at 02:15:21AM +0800, Jan Kara wrote:
> > > On Thu 22-07-10 13:09:32, Wu Fengguang wrote:
> > > > A background flush work may run for ever. So it's reasonable for it to
> > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > 
> > > > The policy is
> > > > - enqueue all newly expired inodes at each queue_io() time
> > > > - retry with halfed expire interval until get some inodes to sync
> > >   Hmm, this logic looks a bit arbitrary to me. What I actually don't like
> > > very much about this that when there aren't inodes older than say 2
> > > seconds, you'll end up queueing just inodes between 2s and 1s. So I'd
> > > rather just queue inodes older than the limit and if there are none, just
> > > queue all other dirty inodes.
> > 
> > You are proposing
> > 
> > -				expire_interval >>= 1;
> > +				expire_interval = 0;
> > 
> > IMO this does not really simplify code or concept. If we can get the
> > "smoother" behavior in original patch without extra cost, why not? 
>   I agree there's no substantial code simplification. But I see a
> substantial "behavior" simplification (just two sweeps instead of 10 or
> so). But I don't really insist on the two sweeps, it's just that I don't
> see a justification for the exponencial back off here... I mean what's the
> point if the interval we queue gets really small? Why not just use
> expire_interval/2 as a step if you want a smoother behavior?

Yeah, the _non-linear_ backoff is not good. You have a point about the
behavior simplification, and it does remove one line. So I'll follow
your way.

Thanks,
Fengguang
---
Subject: writeback: sync expired inodes first in background writeback
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Wed Jul 21 20:11:53 CST 2010

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- enqueue all dirty inodes if there are no more expired inodes to sync

This will help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-26 20:19:01.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-26 20:25:01.000000000 +0800
@@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long older_than_this = 0; /* reset to kill gcc warning */
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -232,8 +232,14 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
-			break;
+		    inode_dirtied_after(inode, older_than_this)) {
+			if (wbc->for_background &&
+			    list_empty(dispatch_queue) && list_empty(&tmp)) {
+				expire_interval = 0;
+				continue;
+			} else
+				break;
+		}
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -521,7 +527,8 @@ void writeback_inodes_wb(struct bdi_writ
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
@@ -550,7 +557,7 @@ static void __writeback_inodes_sb(struct
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-26 12:29           ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 08:12:59PM +0800, Jan Kara wrote:
> On Mon 26-07-10 19:51:53, Wu Fengguang wrote:
> > On Sat, Jul 24, 2010 at 02:15:21AM +0800, Jan Kara wrote:
> > > On Thu 22-07-10 13:09:32, Wu Fengguang wrote:
> > > > A background flush work may run for ever. So it's reasonable for it to
> > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > 
> > > > The policy is
> > > > - enqueue all newly expired inodes at each queue_io() time
> > > > - retry with halfed expire interval until get some inodes to sync
> > >   Hmm, this logic looks a bit arbitrary to me. What I actually don't like
> > > very much about this that when there aren't inodes older than say 2
> > > seconds, you'll end up queueing just inodes between 2s and 1s. So I'd
> > > rather just queue inodes older than the limit and if there are none, just
> > > queue all other dirty inodes.
> > 
> > You are proposing
> > 
> > -				expire_interval >>= 1;
> > +				expire_interval = 0;
> > 
> > IMO this does not really simplify code or concept. If we can get the
> > "smoother" behavior in original patch without extra cost, why not? 
>   I agree there's no substantial code simplification. But I see a
> substantial "behavior" simplification (just two sweeps instead of 10 or
> so). But I don't really insist on the two sweeps, it's just that I don't
> see a justification for the exponencial back off here... I mean what's the
> point if the interval we queue gets really small? Why not just use
> expire_interval/2 as a step if you want a smoother behavior?

Yeah, the _non-linear_ backoff is not good. You have a point about the
behavior simplification, and it does remove one line. So I'll follow
your way.

Thanks,
Fengguang
---
Subject: writeback: sync expired inodes first in background writeback
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Wed Jul 21 20:11:53 CST 2010

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- enqueue all dirty inodes if there are no more expired inodes to sync

This will help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-26 20:19:01.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-26 20:25:01.000000000 +0800
@@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long older_than_this = 0; /* reset to kill gcc warning */
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -232,8 +232,14 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
-			break;
+		    inode_dirtied_after(inode, older_than_this)) {
+			if (wbc->for_background &&
+			    list_empty(dispatch_queue) && list_empty(&tmp)) {
+				expire_interval = 0;
+				continue;
+			} else
+				break;
+		}
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -521,7 +527,8 @@ void writeback_inodes_wb(struct bdi_writ
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
@@ -550,7 +557,7 @@ static void __writeback_inodes_sb(struct
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 12:20         ` Jan Kara
@ 2010-07-26 12:31           ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mel Gorman, Andrew Morton, Dave Chinner, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 08:20:54PM +0800, Jan Kara wrote:
> On Mon 26-07-10 20:00:11, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> > > On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > > > A background flush work may run for ever. So it's reasonable for it to
> > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > 
> > > > The policy is
> > > > - enqueue all newly expired inodes at each queue_io() time
> > > > - retry with halfed expire interval until get some inodes to sync
> > > > 
> > > > CC: Jan Kara <jack@suse.cz>
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > 
> > > Ok, intuitively this would appear to tie into pageout where we want
> > > older inodes to be cleaned first by background flushers to limit the
> > > number of dirty pages encountered by page reclaim. If this is accurate,
> > > it should be detailed in the changelog.
> > 
> > Good suggestion. I'll add these lines:
> > 
> > This is to help reduce the number of dirty pages encountered by page
> > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > dirty pages, which are more close to the end of the LRU lists. So
>   Well, this kind of implicitely assumes that once page is written, it
> doesn't get accessed anymore, right?

No, this patch is not evicting the page :)

> Which I imagine is often true but
> not for all workloads... Anyway I think this behavior is a good start
> also because it is kind of natural to users to see "old" files written
> first.

Thanks,
Fengguang

> > syncing older inodes first helps reducing the dirty pages reached by
> > the page reclaim code.
> 
> 								Honza  
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-26 12:31           ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mel Gorman, Andrew Morton, Dave Chinner, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 08:20:54PM +0800, Jan Kara wrote:
> On Mon 26-07-10 20:00:11, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> > > On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > > > A background flush work may run for ever. So it's reasonable for it to
> > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > 
> > > > The policy is
> > > > - enqueue all newly expired inodes at each queue_io() time
> > > > - retry with halfed expire interval until get some inodes to sync
> > > > 
> > > > CC: Jan Kara <jack@suse.cz>
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > 
> > > Ok, intuitively this would appear to tie into pageout where we want
> > > older inodes to be cleaned first by background flushers to limit the
> > > number of dirty pages encountered by page reclaim. If this is accurate,
> > > it should be detailed in the changelog.
> > 
> > Good suggestion. I'll add these lines:
> > 
> > This is to help reduce the number of dirty pages encountered by page
> > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > dirty pages, which are more close to the end of the LRU lists. So
>   Well, this kind of implicitely assumes that once page is written, it
> doesn't get accessed anymore, right?

No, this patch is not evicting the page :)

> Which I imagine is often true but
> not for all workloads... Anyway I think this behavior is a good start
> also because it is kind of natural to users to see "old" files written
> first.

Thanks,
Fengguang

> > syncing older inodes first helps reducing the dirty pages reached by
> > the page reclaim code.
> 
> 								Honza  
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 12:31           ` Wu Fengguang
@ 2010-07-26 12:39             ` Jan Kara
  -1 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-26 12:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Mel Gorman, Andrew Morton, Dave Chinner,
	Christoph Hellwig, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Mon 26-07-10 20:31:41, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 08:20:54PM +0800, Jan Kara wrote:
> > On Mon 26-07-10 20:00:11, Wu Fengguang wrote:
> > > On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> > > > On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > > > > A background flush work may run for ever. So it's reasonable for it to
> > > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > > 
> > > > > The policy is
> > > > > - enqueue all newly expired inodes at each queue_io() time
> > > > > - retry with halfed expire interval until get some inodes to sync
> > > > > 
> > > > > CC: Jan Kara <jack@suse.cz>
> > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > 
> > > > Ok, intuitively this would appear to tie into pageout where we want
> > > > older inodes to be cleaned first by background flushers to limit the
> > > > number of dirty pages encountered by page reclaim. If this is accurate,
> > > > it should be detailed in the changelog.
> > > 
> > > Good suggestion. I'll add these lines:
> > > 
> > > This is to help reduce the number of dirty pages encountered by page
> > > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > > dirty pages, which are more close to the end of the LRU lists. So
> >   Well, this kind of implicitely assumes that once page is written, it
> > doesn't get accessed anymore, right?
> 
> No, this patch is not evicting the page :)
  Sorry, I probably wasn't clear enough :) I meant: The claim that "older
inodes contain older dirty pages, which are more close to the end of the
LRU lists" assumes that once page is written it doesn't get accessed
again. For example files which get continual random access (like DB files)
can have rather old dirtied_when but some of their pages are accessed quite
often...

> > Which I imagine is often true but
> > not for all workloads... Anyway I think this behavior is a good start
> > also because it is kind of natural to users to see "old" files written
> > first.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-26 12:39             ` Jan Kara
  0 siblings, 0 replies; 98+ messages in thread
From: Jan Kara @ 2010-07-26 12:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Mel Gorman, Andrew Morton, Dave Chinner,
	Christoph Hellwig, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Mon 26-07-10 20:31:41, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 08:20:54PM +0800, Jan Kara wrote:
> > On Mon 26-07-10 20:00:11, Wu Fengguang wrote:
> > > On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> > > > On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > > > > A background flush work may run for ever. So it's reasonable for it to
> > > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > > 
> > > > > The policy is
> > > > > - enqueue all newly expired inodes at each queue_io() time
> > > > > - retry with halfed expire interval until get some inodes to sync
> > > > > 
> > > > > CC: Jan Kara <jack@suse.cz>
> > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > 
> > > > Ok, intuitively this would appear to tie into pageout where we want
> > > > older inodes to be cleaned first by background flushers to limit the
> > > > number of dirty pages encountered by page reclaim. If this is accurate,
> > > > it should be detailed in the changelog.
> > > 
> > > Good suggestion. I'll add these lines:
> > > 
> > > This is to help reduce the number of dirty pages encountered by page
> > > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > > dirty pages, which are more close to the end of the LRU lists. So
> >   Well, this kind of implicitely assumes that once page is written, it
> > doesn't get accessed anymore, right?
> 
> No, this patch is not evicting the page :)
  Sorry, I probably wasn't clear enough :) I meant: The claim that "older
inodes contain older dirty pages, which are more close to the end of the
LRU lists" assumes that once page is written it doesn't get accessed
again. For example files which get continual random access (like DB files)
can have rather old dirtied_when but some of their pages are accessed quite
often...

> > Which I imagine is often true but
> > not for all workloads... Anyway I think this behavior is a good start
> > also because it is kind of natural to users to see "old" files written
> > first.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2010-07-23 17:39     ` Jan Kara
@ 2010-07-26 12:39       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Sat, Jul 24, 2010 at 01:39:54AM +0800, Jan Kara wrote:
> On Thu 22-07-10 13:09:33, Wu Fengguang wrote:
> > writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
> > they only populate b_io when necessary at entrance time. When the queued
> > set of inodes are all synced, they just return, possibly with
> > wbc.nr_to_write > 0.
> > 
> > For kupdate and background writeback, there may be more eligible inodes
> > sitting in b_dirty when the current set of b_io inodes are completed. So
> > it is necessary to try another round of writeback as long as we made some
> > progress in this round. When there are no more eligible inodes, no more
> > inodes will be enqueued in queue_io(), hence nothing could/will be
> > synced and we may safely bail.
> > 
> > This will livelock sync when there are heavy dirtiers. However in that case
> > sync will already be livelocked w/o this patch, as the current livelock
> > avoidance code is virtually a no-op (for one thing, wb_time should be
> > set statically at sync start time and be used in move_expired_inodes()).
> > The sync livelock problem will be addressed in other patches.
>   Hmm, any reason why you don't solve this problem by just removing the
> condition before queue_io()? It would also make the logic simpler - always

Yeah I'll remove queue_io() in the coming sync livelock patchset.
This patchset does the below. Though awkward, it avoids unnecessary
behavior changes for non-background cases.

-       if (!wbc->for_kupdate || list_empty(&wb->b_io))
+       if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
                queue_io(wb, wbc);


> queue all inodes that are eligible for writeback...

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
@ 2010-07-26 12:39       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Sat, Jul 24, 2010 at 01:39:54AM +0800, Jan Kara wrote:
> On Thu 22-07-10 13:09:33, Wu Fengguang wrote:
> > writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
> > they only populate b_io when necessary at entrance time. When the queued
> > set of inodes are all synced, they just return, possibly with
> > wbc.nr_to_write > 0.
> > 
> > For kupdate and background writeback, there may be more eligible inodes
> > sitting in b_dirty when the current set of b_io inodes are completed. So
> > it is necessary to try another round of writeback as long as we made some
> > progress in this round. When there are no more eligible inodes, no more
> > inodes will be enqueued in queue_io(), hence nothing could/will be
> > synced and we may safely bail.
> > 
> > This will livelock sync when there are heavy dirtiers. However in that case
> > sync will already be livelocked w/o this patch, as the current livelock
> > avoidance code is virtually a no-op (for one thing, wb_time should be
> > set statically at sync start time and be used in move_expired_inodes()).
> > The sync livelock problem will be addressed in other patches.
>   Hmm, any reason why you don't solve this problem by just removing the
> condition before queue_io()? It would also make the logic simpler - always

Yeah I'll remove queue_io() in the coming sync livelock patchset.
This patchset does the below. Though awkward, it avoids unnecessary
behavior changes for non-background cases.

-       if (!wbc->for_kupdate || list_empty(&wb->b_io))
+       if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
                queue_io(wb, wbc);


> queue all inodes that are eligible for writeback...

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 12:39             ` Jan Kara
@ 2010-07-26 12:47               ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mel Gorman, Andrew Morton, Dave Chinner, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 08:39:07PM +0800, Jan Kara wrote:
> On Mon 26-07-10 20:31:41, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 08:20:54PM +0800, Jan Kara wrote:
> > > On Mon 26-07-10 20:00:11, Wu Fengguang wrote:
> > > > On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> > > > > On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > > > > > A background flush work may run for ever. So it's reasonable for it to
> > > > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > > > 
> > > > > > The policy is
> > > > > > - enqueue all newly expired inodes at each queue_io() time
> > > > > > - retry with halfed expire interval until get some inodes to sync
> > > > > > 
> > > > > > CC: Jan Kara <jack@suse.cz>
> > > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > > 
> > > > > Ok, intuitively this would appear to tie into pageout where we want
> > > > > older inodes to be cleaned first by background flushers to limit the
> > > > > number of dirty pages encountered by page reclaim. If this is accurate,
> > > > > it should be detailed in the changelog.
> > > > 
> > > > Good suggestion. I'll add these lines:
> > > > 
> > > > This is to help reduce the number of dirty pages encountered by page
> > > > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > > > dirty pages, which are more close to the end of the LRU lists. So
> > >   Well, this kind of implicitely assumes that once page is written, it
> > > doesn't get accessed anymore, right?
> > 
> > No, this patch is not evicting the page :)
>   Sorry, I probably wasn't clear enough :) I meant: The claim that "older
> inodes contain older dirty pages, which are more close to the end of the
> LRU lists" assumes that once page is written it doesn't get accessed
> again. For example files which get continual random access (like DB files)
> can have rather old dirtied_when but some of their pages are accessed quite
> often...

Ah yes. That leads to another fact: smaller inodes tend to have more
strong correlations between its inode dirty age and pages' dirty age. 

This is one of the reason to not sync huge dirty inode in one shot.
Instead of

        sync  1G for inode A
        sync 10M for inode B
        sync 10M for inode C
        sync 10M for inode D

It's better to

        sync 128M for inode A
        sync  10M for inode B
        sync  10M for inode C
        sync  10M for inode D
        sync 128M for inode A
        sync 128M for inode A
        sync 128M for inode A
        sync  10M for inode E (newly expired)
        sync 128M for inode A
        ...

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-26 12:47               ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mel Gorman, Andrew Morton, Dave Chinner, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 08:39:07PM +0800, Jan Kara wrote:
> On Mon 26-07-10 20:31:41, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 08:20:54PM +0800, Jan Kara wrote:
> > > On Mon 26-07-10 20:00:11, Wu Fengguang wrote:
> > > > On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> > > > > On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > > > > > A background flush work may run for ever. So it's reasonable for it to
> > > > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > > > 
> > > > > > The policy is
> > > > > > - enqueue all newly expired inodes at each queue_io() time
> > > > > > - retry with halfed expire interval until get some inodes to sync
> > > > > > 
> > > > > > CC: Jan Kara <jack@suse.cz>
> > > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > > 
> > > > > Ok, intuitively this would appear to tie into pageout where we want
> > > > > older inodes to be cleaned first by background flushers to limit the
> > > > > number of dirty pages encountered by page reclaim. If this is accurate,
> > > > > it should be detailed in the changelog.
> > > > 
> > > > Good suggestion. I'll add these lines:
> > > > 
> > > > This is to help reduce the number of dirty pages encountered by page
> > > > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > > > dirty pages, which are more close to the end of the LRU lists. So
> > >   Well, this kind of implicitely assumes that once page is written, it
> > > doesn't get accessed anymore, right?
> > 
> > No, this patch is not evicting the page :)
>   Sorry, I probably wasn't clear enough :) I meant: The claim that "older
> inodes contain older dirty pages, which are more close to the end of the
> LRU lists" assumes that once page is written it doesn't get accessed
> again. For example files which get continual random access (like DB files)
> can have rather old dirtied_when but some of their pages are accessed quite
> often...

Ah yes. That leads to another fact: smaller inodes tend to have more
strong correlations between its inode dirty age and pages' dirty age. 

This is one of the reason to not sync huge dirty inode in one shot.
Instead of

        sync  1G for inode A
        sync 10M for inode B
        sync 10M for inode C
        sync 10M for inode D

It's better to

        sync 128M for inode A
        sync  10M for inode B
        sync  10M for inode C
        sync  10M for inode D
        sync 128M for inode A
        sync 128M for inode A
        sync 128M for inode A
        sync  10M for inode E (newly expired)
        sync 128M for inode A
        ...

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 10:57     ` Mel Gorman
@ 2010-07-26 12:56       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

> > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> >  	while (!list_empty(delaying_queue)) {
> >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> >  		if (expire_interval &&
> > -		    inode_dirtied_after(inode, older_than_this))
> > -			break;
> > +		    inode_dirtied_after(inode, older_than_this)) {
> > +			if (wbc->for_background &&
> > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > +				expire_interval >>= 1;
> > +				older_than_this = jiffies - expire_interval;
> > +				continue;
> > +			} else
> > +				break;
> > +		}
> 
> This needs a comment.
> 
> I think what it is saying is that if background flush is active but no
> inodes are old enough, consider newer inodes. This is on the assumption
> that page reclaim has encountered dirty pages and the dirty inodes are
> still too young.

Yes this should be commented. How about this one?

@@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
        while (!list_empty(delaying_queue)) {
                inode = list_entry(delaying_queue->prev, struct inode, i_list);
                if (expire_interval &&
-                   inode_dirtied_after(inode, older_than_this))
+                   inode_dirtied_after(inode, older_than_this)) {
+                       /*
+                        * background writeback will start with expired inodes,
+                        * and then fresh inodes. This order helps reducing
+                        * the number of dirty pages reaching the end of LRU
+                        * lists and cause trouble to the page reclaim.
+                        */
+                       if (wbc->for_background &&
+                           list_empty(dispatch_queue) && list_empty(&tmp)) {
+                               expire_interval = 0;
+                               continue;
+                       }
                        break;
+               }
                if (sb && sb != inode->i_sb)
                        do_sb_sort = 1;
                sb = inode->i_sb;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-26 12:56       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

> > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> >  	while (!list_empty(delaying_queue)) {
> >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> >  		if (expire_interval &&
> > -		    inode_dirtied_after(inode, older_than_this))
> > -			break;
> > +		    inode_dirtied_after(inode, older_than_this)) {
> > +			if (wbc->for_background &&
> > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > +				expire_interval >>= 1;
> > +				older_than_this = jiffies - expire_interval;
> > +				continue;
> > +			} else
> > +				break;
> > +		}
> 
> This needs a comment.
> 
> I think what it is saying is that if background flush is active but no
> inodes are old enough, consider newer inodes. This is on the assumption
> that page reclaim has encountered dirty pages and the dirty inodes are
> still too young.

Yes this should be commented. How about this one?

@@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
        while (!list_empty(delaying_queue)) {
                inode = list_entry(delaying_queue->prev, struct inode, i_list);
                if (expire_interval &&
-                   inode_dirtied_after(inode, older_than_this))
+                   inode_dirtied_after(inode, older_than_this)) {
+                       /*
+                        * background writeback will start with expired inodes,
+                        * and then fresh inodes. This order helps reducing
+                        * the number of dirty pages reaching the end of LRU
+                        * lists and cause trouble to the page reclaim.
+                        */
+                       if (wbc->for_background &&
+                           list_empty(dispatch_queue) && list_empty(&tmp)) {
+                               expire_interval = 0;
+                               continue;
+                       }
                        break;
+               }
                if (sb && sb != inode->i_sb)
                        do_sb_sort = 1;
                sb = inode->i_sb;

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 12:56       ` Wu Fengguang
@ 2010-07-26 12:59         ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 12:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 08:56:35PM +0800, Wu Fengguang wrote:
> > > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> > >  	while (!list_empty(delaying_queue)) {
> > >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > >  		if (expire_interval &&
> > > -		    inode_dirtied_after(inode, older_than_this))
> > > -			break;
> > > +		    inode_dirtied_after(inode, older_than_this)) {
> > > +			if (wbc->for_background &&
> > > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > > +				expire_interval >>= 1;
> > > +				older_than_this = jiffies - expire_interval;
> > > +				continue;
> > > +			} else
> > > +				break;
> > > +		}
> > 
> > This needs a comment.
> > 
> > I think what it is saying is that if background flush is active but no
> > inodes are old enough, consider newer inodes. This is on the assumption
> > that page reclaim has encountered dirty pages and the dirty inodes are
> > still too young.
> 
> Yes this should be commented. How about this one?
> 
> @@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
>         while (!list_empty(delaying_queue)) {
>                 inode = list_entry(delaying_queue->prev, struct inode, i_list);
>                 if (expire_interval &&
> -                   inode_dirtied_after(inode, older_than_this))
> +                   inode_dirtied_after(inode, older_than_this)) {
> +                       /*
> +                        * background writeback will start with expired inodes,
> +                        * and then fresh inodes. This order helps reducing
> +                        * the number of dirty pages reaching the end of LRU
> +                        * lists and cause trouble to the page reclaim.
> +                        */

s/reducing/reduce/

Otherwise, it's enough detail to know what is going on. Thanks

Thanks

> +                       if (wbc->for_background &&
> +                           list_empty(dispatch_queue) && list_empty(&tmp)) {
> +                               expire_interval = 0;
> +                               continue;
> +                       }
>                         break;
> +               }
>                 if (sb && sb != inode->i_sb)
>                         do_sb_sort = 1;
>                 sb = inode->i_sb;
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-26 12:59         ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-26 12:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 08:56:35PM +0800, Wu Fengguang wrote:
> > > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> > >  	while (!list_empty(delaying_queue)) {
> > >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > >  		if (expire_interval &&
> > > -		    inode_dirtied_after(inode, older_than_this))
> > > -			break;
> > > +		    inode_dirtied_after(inode, older_than_this)) {
> > > +			if (wbc->for_background &&
> > > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > > +				expire_interval >>= 1;
> > > +				older_than_this = jiffies - expire_interval;
> > > +				continue;
> > > +			} else
> > > +				break;
> > > +		}
> > 
> > This needs a comment.
> > 
> > I think what it is saying is that if background flush is active but no
> > inodes are old enough, consider newer inodes. This is on the assumption
> > that page reclaim has encountered dirty pages and the dirty inodes are
> > still too young.
> 
> Yes this should be commented. How about this one?
> 
> @@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
>         while (!list_empty(delaying_queue)) {
>                 inode = list_entry(delaying_queue->prev, struct inode, i_list);
>                 if (expire_interval &&
> -                   inode_dirtied_after(inode, older_than_this))
> +                   inode_dirtied_after(inode, older_than_this)) {
> +                       /*
> +                        * background writeback will start with expired inodes,
> +                        * and then fresh inodes. This order helps reducing
> +                        * the number of dirty pages reaching the end of LRU
> +                        * lists and cause trouble to the page reclaim.
> +                        */

s/reducing/reduce/

Otherwise, it's enough detail to know what is going on. Thanks

Thanks

> +                       if (wbc->for_background &&
> +                           list_empty(dispatch_queue) && list_empty(&tmp)) {
> +                               expire_interval = 0;
> +                               continue;
> +                       }
>                         break;
> +               }
>                 if (sb && sb != inode->i_sb)
>                         do_sb_sort = 1;
>                 sb = inode->i_sb;
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 12:59         ` Mel Gorman
@ 2010-07-26 13:11           ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 13:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 08:59:55PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 08:56:35PM +0800, Wu Fengguang wrote:
> > > > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> > > >  	while (!list_empty(delaying_queue)) {
> > > >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > > >  		if (expire_interval &&
> > > > -		    inode_dirtied_after(inode, older_than_this))
> > > > -			break;
> > > > +		    inode_dirtied_after(inode, older_than_this)) {
> > > > +			if (wbc->for_background &&
> > > > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > > > +				expire_interval >>= 1;
> > > > +				older_than_this = jiffies - expire_interval;
> > > > +				continue;
> > > > +			} else
> > > > +				break;
> > > > +		}
> > > 
> > > This needs a comment.
> > > 
> > > I think what it is saying is that if background flush is active but no
> > > inodes are old enough, consider newer inodes. This is on the assumption
> > > that page reclaim has encountered dirty pages and the dirty inodes are
> > > still too young.
> > 
> > Yes this should be commented. How about this one?
> > 
> > @@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
> >         while (!list_empty(delaying_queue)) {
> >                 inode = list_entry(delaying_queue->prev, struct inode, i_list);
> >                 if (expire_interval &&
> > -                   inode_dirtied_after(inode, older_than_this))
> > +                   inode_dirtied_after(inode, older_than_this)) {
> > +                       /*
> > +                        * background writeback will start with expired inodes,
> > +                        * and then fresh inodes. This order helps reducing
> > +                        * the number of dirty pages reaching the end of LRU
> > +                        * lists and cause trouble to the page reclaim.
> > +                        */
> 
> s/reducing/reduce/
> 
> Otherwise, it's enough detail to know what is going on. Thanks

Thanks. Here is the updated patch.
---
Subject: writeback: sync expired inodes first in background writeback
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Wed Jul 21 20:11:53 CST 2010

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- enqueue all dirty inodes if there are no more expired inodes to sync

This will help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-26 20:19:01.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-26 21:10:42.000000000 +0800
@@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long older_than_this = 0; /* reset to kill gcc warning */
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
+		    inode_dirtied_after(inode, older_than_this)) {
+			/*
+			 * background writeback will start with expired inodes,
+			 * and then fresh inodes. This order helps reduce the
+			 * number of dirty pages reaching the end of LRU lists
+			 * and cause trouble to the page reclaim.
+			 */
+			if (wbc->for_background &&
+			    list_empty(dispatch_queue) && list_empty(&tmp)) {
+				expire_interval = 0;
+				continue;
+			}
 			break;
+		}
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -521,7 +533,8 @@ void writeback_inodes_wb(struct bdi_writ
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
@@ -550,7 +563,7 @@ static void __writeback_inodes_sb(struct
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-26 13:11           ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-07-26 13:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 08:59:55PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 08:56:35PM +0800, Wu Fengguang wrote:
> > > > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> > > >  	while (!list_empty(delaying_queue)) {
> > > >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > > >  		if (expire_interval &&
> > > > -		    inode_dirtied_after(inode, older_than_this))
> > > > -			break;
> > > > +		    inode_dirtied_after(inode, older_than_this)) {
> > > > +			if (wbc->for_background &&
> > > > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > > > +				expire_interval >>= 1;
> > > > +				older_than_this = jiffies - expire_interval;
> > > > +				continue;
> > > > +			} else
> > > > +				break;
> > > > +		}
> > > 
> > > This needs a comment.
> > > 
> > > I think what it is saying is that if background flush is active but no
> > > inodes are old enough, consider newer inodes. This is on the assumption
> > > that page reclaim has encountered dirty pages and the dirty inodes are
> > > still too young.
> > 
> > Yes this should be commented. How about this one?
> > 
> > @@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
> >         while (!list_empty(delaying_queue)) {
> >                 inode = list_entry(delaying_queue->prev, struct inode, i_list);
> >                 if (expire_interval &&
> > -                   inode_dirtied_after(inode, older_than_this))
> > +                   inode_dirtied_after(inode, older_than_this)) {
> > +                       /*
> > +                        * background writeback will start with expired inodes,
> > +                        * and then fresh inodes. This order helps reducing
> > +                        * the number of dirty pages reaching the end of LRU
> > +                        * lists and cause trouble to the page reclaim.
> > +                        */
> 
> s/reducing/reduce/
> 
> Otherwise, it's enough detail to know what is going on. Thanks

Thanks. Here is the updated patch.
---
Subject: writeback: sync expired inodes first in background writeback
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Wed Jul 21 20:11:53 CST 2010

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- enqueue all dirty inodes if there are no more expired inodes to sync

This will help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-26 20:19:01.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-26 21:10:42.000000000 +0800
@@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long older_than_this = 0; /* reset to kill gcc warning */
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
+		    inode_dirtied_after(inode, older_than_this)) {
+			/*
+			 * background writeback will start with expired inodes,
+			 * and then fresh inodes. This order helps reduce the
+			 * number of dirty pages reaching the end of LRU lists
+			 * and cause trouble to the page reclaim.
+			 */
+			if (wbc->for_background &&
+			    list_empty(dispatch_queue) && list_empty(&tmp)) {
+				expire_interval = 0;
+				continue;
+			}
 			break;
+		}
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -521,7 +533,8 @@ void writeback_inodes_wb(struct bdi_writ
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
@@ -550,7 +563,7 @@ static void __writeback_inodes_sb(struct
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 13:11           ` Wu Fengguang
@ 2010-07-27  9:45             ` Mel Gorman
  -1 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-27  9:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 09:11:52PM +0800, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 08:59:55PM +0800, Mel Gorman wrote:
> > On Mon, Jul 26, 2010 at 08:56:35PM +0800, Wu Fengguang wrote:
> > > > > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> > > > >  	while (!list_empty(delaying_queue)) {
> > > > >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > > > >  		if (expire_interval &&
> > > > > -		    inode_dirtied_after(inode, older_than_this))
> > > > > -			break;
> > > > > +		    inode_dirtied_after(inode, older_than_this)) {
> > > > > +			if (wbc->for_background &&
> > > > > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > > > > +				expire_interval >>= 1;
> > > > > +				older_than_this = jiffies - expire_interval;
> > > > > +				continue;
> > > > > +			} else
> > > > > +				break;
> > > > > +		}
> > > > 
> > > > This needs a comment.
> > > > 
> > > > I think what it is saying is that if background flush is active but no
> > > > inodes are old enough, consider newer inodes. This is on the assumption
> > > > that page reclaim has encountered dirty pages and the dirty inodes are
> > > > still too young.
> > > 
> > > Yes this should be commented. How about this one?
> > > 
> > > @@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
> > >         while (!list_empty(delaying_queue)) {
> > >                 inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > >                 if (expire_interval &&
> > > -                   inode_dirtied_after(inode, older_than_this))
> > > +                   inode_dirtied_after(inode, older_than_this)) {
> > > +                       /*
> > > +                        * background writeback will start with expired inodes,
> > > +                        * and then fresh inodes. This order helps reducing
> > > +                        * the number of dirty pages reaching the end of LRU
> > > +                        * lists and cause trouble to the page reclaim.
> > > +                        */
> > 
> > s/reducing/reduce/
> > 
> > Otherwise, it's enough detail to know what is going on. Thanks
> 
> Thanks. Here is the updated patch.
> ---
> Subject: writeback: sync expired inodes first in background writeback
> From: Wu Fengguang <fengguang.wu@intel.com>
> Date: Wed Jul 21 20:11:53 CST 2010
> 
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> The policy is
> - enqueue all newly expired inodes at each queue_io() time
> - enqueue all dirty inodes if there are no more expired inodes to sync
> 
> This will help reduce the number of dirty pages encountered by page
> reclaim, eg. the pageout() calls. Normally older inodes contain older
> dirty pages, which are more close to the end of the LRU lists. So
> syncing older inodes first helps reducing the dirty pages reached by
> the page reclaim code.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-07-27  9:45             ` Mel Gorman
  0 siblings, 0 replies; 98+ messages in thread
From: Mel Gorman @ 2010-07-27  9:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Mon, Jul 26, 2010 at 09:11:52PM +0800, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 08:59:55PM +0800, Mel Gorman wrote:
> > On Mon, Jul 26, 2010 at 08:56:35PM +0800, Wu Fengguang wrote:
> > > > > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> > > > >  	while (!list_empty(delaying_queue)) {
> > > > >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > > > >  		if (expire_interval &&
> > > > > -		    inode_dirtied_after(inode, older_than_this))
> > > > > -			break;
> > > > > +		    inode_dirtied_after(inode, older_than_this)) {
> > > > > +			if (wbc->for_background &&
> > > > > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > > > > +				expire_interval >>= 1;
> > > > > +				older_than_this = jiffies - expire_interval;
> > > > > +				continue;
> > > > > +			} else
> > > > > +				break;
> > > > > +		}
> > > > 
> > > > This needs a comment.
> > > > 
> > > > I think what it is saying is that if background flush is active but no
> > > > inodes are old enough, consider newer inodes. This is on the assumption
> > > > that page reclaim has encountered dirty pages and the dirty inodes are
> > > > still too young.
> > > 
> > > Yes this should be commented. How about this one?
> > > 
> > > @@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
> > >         while (!list_empty(delaying_queue)) {
> > >                 inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > >                 if (expire_interval &&
> > > -                   inode_dirtied_after(inode, older_than_this))
> > > +                   inode_dirtied_after(inode, older_than_this)) {
> > > +                       /*
> > > +                        * background writeback will start with expired inodes,
> > > +                        * and then fresh inodes. This order helps reducing
> > > +                        * the number of dirty pages reaching the end of LRU
> > > +                        * lists and cause trouble to the page reclaim.
> > > +                        */
> > 
> > s/reducing/reduce/
> > 
> > Otherwise, it's enough detail to know what is going on. Thanks
> 
> Thanks. Here is the updated patch.
> ---
> Subject: writeback: sync expired inodes first in background writeback
> From: Wu Fengguang <fengguang.wu@intel.com>
> Date: Wed Jul 21 20:11:53 CST 2010
> 
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> The policy is
> - enqueue all newly expired inodes at each queue_io() time
> - enqueue all dirty inodes if there are no more expired inodes to sync
> 
> This will help reduce the number of dirty pages encountered by page
> reclaim, eg. the pageout() calls. Normally older inodes contain older
> dirty pages, which are more close to the end of the LRU lists. So
> syncing older inodes first helps reducing the dirty pages reached by
> the page reclaim code.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 13:11           ` Wu Fengguang
@ 2010-08-01 15:15             ` Minchan Kim
  -1 siblings, 0 replies; 98+ messages in thread
From: Minchan Kim @ 2010-08-01 15:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Mel Gorman, Andrew Morton, Dave Chinner, Jan Kara,
	Christoph Hellwig, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

Hi Wu, 

> Subject: writeback: sync expired inodes first in background writeback
> From: Wu Fengguang <fengguang.wu@intel.com>
> Date: Wed Jul 21 20:11:53 CST 2010
> 
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> The policy is
> - enqueue all newly expired inodes at each queue_io() time
> - enqueue all dirty inodes if there are no more expired inodes to sync
> 
> This will help reduce the number of dirty pages encountered by page
> reclaim, eg. the pageout() calls. Normally older inodes contain older
> dirty pages, which are more close to the end of the LRU lists. So
> syncing older inodes first helps reducing the dirty pages reached by
> the page reclaim code.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   23 ++++++++++++++++++-----
>  1 file changed, 18 insertions(+), 5 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-26 20:19:01.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-26 21:10:42.000000000 +0800
> @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
>  				struct writeback_control *wbc)
>  {
>  	unsigned long expire_interval = 0;
> -	unsigned long older_than_this;
> +	unsigned long older_than_this = 0; /* reset to kill gcc warning */

Maybe I am rather late. 

Nitpick. 
uninitialized_var is consistent. :)

I haven't followed up this patch series. but his patch series is a fundamental way 
to go for reducing pageout. 
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
@ 2010-08-01 15:15             ` Minchan Kim
  0 siblings, 0 replies; 98+ messages in thread
From: Minchan Kim @ 2010-08-01 15:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Mel Gorman, Andrew Morton, Dave Chinner, Jan Kara,
	Christoph Hellwig, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

Hi Wu, 

> Subject: writeback: sync expired inodes first in background writeback
> From: Wu Fengguang <fengguang.wu@intel.com>
> Date: Wed Jul 21 20:11:53 CST 2010
> 
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> The policy is
> - enqueue all newly expired inodes at each queue_io() time
> - enqueue all dirty inodes if there are no more expired inodes to sync
> 
> This will help reduce the number of dirty pages encountered by page
> reclaim, eg. the pageout() calls. Normally older inodes contain older
> dirty pages, which are more close to the end of the LRU lists. So
> syncing older inodes first helps reducing the dirty pages reached by
> the page reclaim code.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   23 ++++++++++++++++++-----
>  1 file changed, 18 insertions(+), 5 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-26 20:19:01.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-26 21:10:42.000000000 +0800
> @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
>  				struct writeback_control *wbc)
>  {
>  	unsigned long expire_interval = 0;
> -	unsigned long older_than_this;
> +	unsigned long older_than_this = 0; /* reset to kill gcc warning */

Maybe I am rather late. 

Nitpick. 
uninitialized_var is consistent. :)

I haven't followed up this patch series. but his patch series is a fundamental way 
to go for reducing pageout. 
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-08-01 15:23     ` Minchan Kim
  -1 siblings, 0 replies; 98+ messages in thread
From: Minchan Kim @ 2010-08-01 15:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:29PM +0800, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
@ 2010-08-01 15:23     ` Minchan Kim
  0 siblings, 0 replies; 98+ messages in thread
From: Minchan Kim @ 2010-08-01 15:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:29PM +0800, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-08-01 15:29     ` Minchan Kim
  -1 siblings, 0 replies; 98+ messages in thread
From: Minchan Kim @ 2010-08-01 15:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Thu, Jul 22, 2010 at 01:09:30PM +0800, Wu Fengguang wrote:
> Dynamicly compute the dirty expire timestamp at queue_io() time.
> Also remove writeback_control.older_than_this which is no longer used.
> 
> writeback_control.older_than_this used to be determined at entrance to
> the kupdate writeback work. This _static_ timestamp may go stale if the
> kupdate work runs on and on. The flusher may then stuck with some old
> busy inodes, never considering newly expired inodes thereafter.
> 
> This has two possible problems:
> 
> - It is unfair for a large dirty inode to delay (for a long time) the
>   writeback of small dirty inodes.
> 
> - As time goes by, the large and busy dirty inode may contain only
>   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
>   delaying the expired dirty pages to the end of LRU lists, triggering
>   the very bad pageout(). Neverthless this patch merely addresses part
>   of the problem.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |   24 +++++++++---------------
>  include/linux/writeback.h        |    2 --
>  include/trace/events/writeback.h |    6 +-----
>  mm/backing-dev.c                 |    1 -
>  mm/page-writeback.c              |    1 -
>  5 files changed, 10 insertions(+), 24 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-21 22:20:01.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
> @@ -216,16 +216,23 @@ static void move_expired_inodes(struct l
>  				struct list_head *dispatch_queue,
>  				struct writeback_control *wbc)
>  {
> +	unsigned long expire_interval = 0;
> +	unsigned long older_than_this;
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
>  	struct inode *inode;
>  	int do_sb_sort = 0;
>  
> +	if (wbc->for_kupdate) {
> +		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
> +		older_than_this = jiffies - expire_interval;
> +	}
> +
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> -		if (wbc->older_than_this &&
> -		    inode_dirtied_after(inode, *wbc->older_than_this))
> +		if (expire_interval &&
> +		    inode_dirtied_after(inode, older_than_this))
>  			break;
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
> @@ -583,29 +590,19 @@ static inline bool over_bground_thresh(v
>   * Try to run once per dirty_writeback_interval.  But if a writeback event
>   * takes longer than a dirty_writeback_interval interval, then leave a
>   * one-second gap.
> - *
> - * older_than_this takes precedence over nr_to_write.  So we'll only write back
> - * all dirty pages if they are all attached to "old" mappings.
>   */
>  static long wb_writeback(struct bdi_writeback *wb,
>  			 struct wb_writeback_work *work)
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= work->sync_mode,
> -		.older_than_this	= NULL,
>  		.for_kupdate		= work->for_kupdate,
>  		.for_background		= work->for_background,
>  		.range_cyclic		= work->range_cyclic,
>  	};
> -	unsigned long oldest_jif;
>  	long wrote = 0;
>  	struct inode *inode;
>  
> -	if (wbc.for_kupdate) {
> -		wbc.older_than_this = &oldest_jif;
> -		oldest_jif = jiffies -
> -				msecs_to_jiffies(dirty_expire_interval * 10);
> -	}
>  	if (!wbc.range_cyclic) {
>  		wbc.range_start = 0;
>  		wbc.range_end = LLONG_MAX;
> @@ -998,9 +995,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   * Write out a superblock's list of dirty inodes.  A wait will be performed
>   * upon no inodes, all inodes or the final one, depending upon sync_mode.
>   *
> - * If older_than_this is non-NULL, then only write out inodes which
> - * had their first dirtying at a time earlier than *older_than_this.
> - *
>   * If `bdi' is non-zero then we're being asked to writeback a specific queue.
>   * This function assumes that the blockdev superblock's inodes are backed by
>   * a variety of queues, so all inodes are searched.  For other superblocks,
> --- linux-next.orig/include/linux/writeback.h	2010-07-21 22:20:02.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
> @@ -28,8 +28,6 @@ enum writeback_sync_modes {
>   */
>  struct writeback_control {
>  	enum writeback_sync_modes sync_mode;
> -	unsigned long *older_than_this;	/* If !NULL, only write back inodes
> -					   older than this */
>  	unsigned long wb_start;         /* Time writeback_inodes_wb was
>  					   called. This is needed to avoid
>  					   extra jobs and livelock */

In addtion, We shuld remove older_than_this in btrfs and reiser4. 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
@ 2010-08-01 15:29     ` Minchan Kim
  0 siblings, 0 replies; 98+ messages in thread
From: Minchan Kim @ 2010-08-01 15:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Thu, Jul 22, 2010 at 01:09:30PM +0800, Wu Fengguang wrote:
> Dynamicly compute the dirty expire timestamp at queue_io() time.
> Also remove writeback_control.older_than_this which is no longer used.
> 
> writeback_control.older_than_this used to be determined at entrance to
> the kupdate writeback work. This _static_ timestamp may go stale if the
> kupdate work runs on and on. The flusher may then stuck with some old
> busy inodes, never considering newly expired inodes thereafter.
> 
> This has two possible problems:
> 
> - It is unfair for a large dirty inode to delay (for a long time) the
>   writeback of small dirty inodes.
> 
> - As time goes by, the large and busy dirty inode may contain only
>   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
>   delaying the expired dirty pages to the end of LRU lists, triggering
>   the very bad pageout(). Neverthless this patch merely addresses part
>   of the problem.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |   24 +++++++++---------------
>  include/linux/writeback.h        |    2 --
>  include/trace/events/writeback.h |    6 +-----
>  mm/backing-dev.c                 |    1 -
>  mm/page-writeback.c              |    1 -
>  5 files changed, 10 insertions(+), 24 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-21 22:20:01.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
> @@ -216,16 +216,23 @@ static void move_expired_inodes(struct l
>  				struct list_head *dispatch_queue,
>  				struct writeback_control *wbc)
>  {
> +	unsigned long expire_interval = 0;
> +	unsigned long older_than_this;
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
>  	struct inode *inode;
>  	int do_sb_sort = 0;
>  
> +	if (wbc->for_kupdate) {
> +		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
> +		older_than_this = jiffies - expire_interval;
> +	}
> +
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> -		if (wbc->older_than_this &&
> -		    inode_dirtied_after(inode, *wbc->older_than_this))
> +		if (expire_interval &&
> +		    inode_dirtied_after(inode, older_than_this))
>  			break;
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
> @@ -583,29 +590,19 @@ static inline bool over_bground_thresh(v
>   * Try to run once per dirty_writeback_interval.  But if a writeback event
>   * takes longer than a dirty_writeback_interval interval, then leave a
>   * one-second gap.
> - *
> - * older_than_this takes precedence over nr_to_write.  So we'll only write back
> - * all dirty pages if they are all attached to "old" mappings.
>   */
>  static long wb_writeback(struct bdi_writeback *wb,
>  			 struct wb_writeback_work *work)
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= work->sync_mode,
> -		.older_than_this	= NULL,
>  		.for_kupdate		= work->for_kupdate,
>  		.for_background		= work->for_background,
>  		.range_cyclic		= work->range_cyclic,
>  	};
> -	unsigned long oldest_jif;
>  	long wrote = 0;
>  	struct inode *inode;
>  
> -	if (wbc.for_kupdate) {
> -		wbc.older_than_this = &oldest_jif;
> -		oldest_jif = jiffies -
> -				msecs_to_jiffies(dirty_expire_interval * 10);
> -	}
>  	if (!wbc.range_cyclic) {
>  		wbc.range_start = 0;
>  		wbc.range_end = LLONG_MAX;
> @@ -998,9 +995,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   * Write out a superblock's list of dirty inodes.  A wait will be performed
>   * upon no inodes, all inodes or the final one, depending upon sync_mode.
>   *
> - * If older_than_this is non-NULL, then only write out inodes which
> - * had their first dirtying at a time earlier than *older_than_this.
> - *
>   * If `bdi' is non-zero then we're being asked to writeback a specific queue.
>   * This function assumes that the blockdev superblock's inodes are backed by
>   * a variety of queues, so all inodes are searched.  For other superblocks,
> --- linux-next.orig/include/linux/writeback.h	2010-07-21 22:20:02.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
> @@ -28,8 +28,6 @@ enum writeback_sync_modes {
>   */
>  struct writeback_control {
>  	enum writeback_sync_modes sync_mode;
> -	unsigned long *older_than_this;	/* If !NULL, only write back inodes
> -					   older than this */
>  	unsigned long wb_start;         /* Time writeback_inodes_wb was
>  					   called. This is needed to avoid
>  					   extra jobs and livelock */

In addtion, We shuld remove older_than_this in btrfs and reiser4. 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-07-22  5:09   ` Wu Fengguang
@ 2010-08-01 15:34     ` Minchan Kim
  -1 siblings, 0 replies; 98+ messages in thread
From: Minchan Kim @ 2010-08-01 15:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:31PM +0800, Wu Fengguang wrote:
> When wbc.more_io was first introduced, it indicates whether there are
> at least one superblock whose s_more_io contains more IO work. Now with
> the per-bdi writeback, it can be replaced with a simple b_more_io test.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |    9 ++-------
>  include/linux/writeback.h        |    1 -
>  include/trace/events/writeback.h |    5 +----
>  3 files changed, 3 insertions(+), 12 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> @@ -507,12 +507,8 @@ static int writeback_sb_inodes(struct su
>  		iput(inode);
>  		cond_resched();
>  		spin_lock(&inode_lock);
> -		if (wbc->nr_to_write <= 0) {
> -			wbc->more_io = 1;
> +		if (wbc->nr_to_write <= 0)
>  			return 1;
> -		}
> -		if (!list_empty(&wb->b_more_io))
> -			wbc->more_io = 1;
>  	}
>  	/* b_io is empty */
>  	return 1;
> @@ -622,7 +618,6 @@ static long wb_writeback(struct bdi_writ
>  		if (work->for_background && !over_bground_thresh())
>  			break;
>  
> -		wbc.more_io = 0;
>  		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>  		wbc.pages_skipped = 0;
>  
> @@ -644,7 +639,7 @@ static long wb_writeback(struct bdi_writ
>  		/*
>  		 * Didn't write everything and we don't have more IO, bail
>  		 */
> -		if (!wbc.more_io)
> +		if (list_empty(&wb->b_more_io))
>  			break;
>  		/*
>  		 * Did we write something? Try for more
> --- linux-next.orig/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2010-07-22 11:24:46.000000000 +0800
> @@ -49,7 +49,6 @@ struct writeback_control {
>  	unsigned for_background:1;	/* A background writeback */
>  	unsigned for_reclaim:1;		/* Invoked from the page allocator */
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
> -	unsigned more_io:1;		/* more io to be dispatched */
>  };
>  
>  /*
> --- linux-next.orig/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2010-07-22 11:24:46.000000000 +0800
> @@ -99,7 +99,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__field(int, for_background)
>  		__field(int, for_reclaim)
>  		__field(int, range_cyclic)
> -		__field(int, more_io)
>  		__field(long, range_start)
>  		__field(long, range_end)
>  	),
> @@ -113,13 +112,12 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_background	= wbc->for_background;
>  		__entry->for_reclaim	= wbc->for_reclaim;
>  		__entry->range_cyclic	= wbc->range_cyclic;
> -		__entry->more_io	= wbc->more_io;
>  		__entry->range_start	= (long)wbc->range_start;
>  		__entry->range_end	= (long)wbc->range_end;
>  	),
>  
>  	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> -		"bgrd=%d reclm=%d cyclic=%d more=%d "
> +		"bgrd=%d reclm=%d cyclic=%d "
>  		"start=0x%lx end=0x%lx",
>  		__entry->name,
>  		__entry->nr_to_write,
> @@ -129,7 +127,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_background,
>  		__entry->for_reclaim,
>  		__entry->range_cyclic,
> -		__entry->more_io,
>  		__entry->range_start,
>  		__entry->range_end)
>  )
> 
> 
> --

include/trace/events/ext4.h also have more_io field. 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
@ 2010-08-01 15:34     ` Minchan Kim
  0 siblings, 0 replies; 98+ messages in thread
From: Minchan Kim @ 2010-08-01 15:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:31PM +0800, Wu Fengguang wrote:
> When wbc.more_io was first introduced, it indicates whether there are
> at least one superblock whose s_more_io contains more IO work. Now with
> the per-bdi writeback, it can be replaced with a simple b_more_io test.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |    9 ++-------
>  include/linux/writeback.h        |    1 -
>  include/trace/events/writeback.h |    5 +----
>  3 files changed, 3 insertions(+), 12 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> @@ -507,12 +507,8 @@ static int writeback_sb_inodes(struct su
>  		iput(inode);
>  		cond_resched();
>  		spin_lock(&inode_lock);
> -		if (wbc->nr_to_write <= 0) {
> -			wbc->more_io = 1;
> +		if (wbc->nr_to_write <= 0)
>  			return 1;
> -		}
> -		if (!list_empty(&wb->b_more_io))
> -			wbc->more_io = 1;
>  	}
>  	/* b_io is empty */
>  	return 1;
> @@ -622,7 +618,6 @@ static long wb_writeback(struct bdi_writ
>  		if (work->for_background && !over_bground_thresh())
>  			break;
>  
> -		wbc.more_io = 0;
>  		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>  		wbc.pages_skipped = 0;
>  
> @@ -644,7 +639,7 @@ static long wb_writeback(struct bdi_writ
>  		/*
>  		 * Didn't write everything and we don't have more IO, bail
>  		 */
> -		if (!wbc.more_io)
> +		if (list_empty(&wb->b_more_io))
>  			break;
>  		/*
>  		 * Did we write something? Try for more
> --- linux-next.orig/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2010-07-22 11:24:46.000000000 +0800
> @@ -49,7 +49,6 @@ struct writeback_control {
>  	unsigned for_background:1;	/* A background writeback */
>  	unsigned for_reclaim:1;		/* Invoked from the page allocator */
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
> -	unsigned more_io:1;		/* more io to be dispatched */
>  };
>  
>  /*
> --- linux-next.orig/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2010-07-22 11:24:46.000000000 +0800
> @@ -99,7 +99,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__field(int, for_background)
>  		__field(int, for_reclaim)
>  		__field(int, range_cyclic)
> -		__field(int, more_io)
>  		__field(long, range_start)
>  		__field(long, range_end)
>  	),
> @@ -113,13 +112,12 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_background	= wbc->for_background;
>  		__entry->for_reclaim	= wbc->for_reclaim;
>  		__entry->range_cyclic	= wbc->range_cyclic;
> -		__entry->more_io	= wbc->more_io;
>  		__entry->range_start	= (long)wbc->range_start;
>  		__entry->range_end	= (long)wbc->range_end;
>  	),
>  
>  	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> -		"bgrd=%d reclm=%d cyclic=%d more=%d "
> +		"bgrd=%d reclm=%d cyclic=%d "
>  		"start=0x%lx end=0x%lx",
>  		__entry->name,
>  		__entry->nr_to_write,
> @@ -129,7 +127,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_background,
>  		__entry->for_reclaim,
>  		__entry->range_cyclic,
> -		__entry->more_io,
>  		__entry->range_start,
>  		__entry->range_end)
>  )
> 
> 
> --

include/trace/events/ext4.h also have more_io field. 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-08-01 15:34     ` Minchan Kim
@ 2010-08-05 14:50       ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-08-05 14:50 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

> include/trace/events/ext4.h also have more_io field. 

I didn't find it in linux-next. What's your kernel version?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
@ 2010-08-05 14:50       ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-08-05 14:50 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

> include/trace/events/ext4.h also have more_io field. 

I didn't find it in linux-next. What's your kernel version?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-08-05 14:50       ` Wu Fengguang
@ 2010-08-05 14:55         ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-08-05 14:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Aug 05, 2010 at 10:50:53PM +0800, Wu Fengguang wrote:
> > include/trace/events/ext4.h also have more_io field. 
> 
> I didn't find it in linux-next. What's your kernel version?

Oh it's in mmotm :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
@ 2010-08-05 14:55         ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-08-05 14:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Aug 05, 2010 at 10:50:53PM +0800, Wu Fengguang wrote:
> > include/trace/events/ext4.h also have more_io field. 
> 
> I didn't find it in linux-next. What's your kernel version?

Oh it's in mmotm :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-08-05 14:50       ` Wu Fengguang
@ 2010-08-05 14:56         ` Minchan Kim
  -1 siblings, 0 replies; 98+ messages in thread
From: Minchan Kim @ 2010-08-05 14:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Aug 05, 2010 at 10:50:53PM +0800, Wu Fengguang wrote:
> > include/trace/events/ext4.h also have more_io field. 
> 
> I didn't find it in linux-next. What's your kernel version?

I used mmotm-07-29. 

> 
> Thanks,
> Fengguang

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
@ 2010-08-05 14:56         ` Minchan Kim
  0 siblings, 0 replies; 98+ messages in thread
From: Minchan Kim @ 2010-08-05 14:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Aug 05, 2010 at 10:50:53PM +0800, Wu Fengguang wrote:
> > include/trace/events/ext4.h also have more_io field. 
> 
> I didn't find it in linux-next. What's your kernel version?

I used mmotm-07-29. 

> 
> Thanks,
> Fengguang

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-08-05 14:56         ` Minchan Kim
@ 2010-08-05 15:26           ` Wu Fengguang
  -1 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-08-05 15:26 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Aug 05, 2010 at 10:56:06PM +0800, Minchan Kim wrote:
> On Thu, Aug 05, 2010 at 10:50:53PM +0800, Wu Fengguang wrote:
> > > include/trace/events/ext4.h also have more_io field. 
> > 
> > I didn't find it in linux-next. What's your kernel version?
> 
> I used mmotm-07-29. 

Heh it's in linux-next too -- I didn't find the field because the
chunk to remove it slipped into a previous patch..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
@ 2010-08-05 15:26           ` Wu Fengguang
  0 siblings, 0 replies; 98+ messages in thread
From: Wu Fengguang @ 2010-08-05 15:26 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Aug 05, 2010 at 10:56:06PM +0800, Minchan Kim wrote:
> On Thu, Aug 05, 2010 at 10:50:53PM +0800, Wu Fengguang wrote:
> > > include/trace/events/ext4.h also have more_io field. 
> > 
> > I didn't find it in linux-next. What's your kernel version?
> 
> I used mmotm-07-29. 

Heh it's in linux-next too -- I didn't find the field because the
chunk to remove it slipped into a previous patch..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2010-08-05 15:27 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-22  5:09 [PATCH 0/6] [RFC] writeback: try to write older pages first Wu Fengguang
2010-07-22  5:09 ` Wu Fengguang
2010-07-22  5:09 ` Wu Fengguang
2010-07-22  5:09 ` [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
2010-07-22  5:09   ` Wu Fengguang
2010-07-22  5:09   ` Wu Fengguang
2010-07-23 18:16   ` Jan Kara
2010-07-23 18:16     ` Jan Kara
2010-07-26 10:44   ` Mel Gorman
2010-07-26 10:44     ` Mel Gorman
2010-08-01 15:23   ` Minchan Kim
2010-08-01 15:23     ` Minchan Kim
2010-07-22  5:09 ` [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
2010-07-22  5:09   ` Wu Fengguang
2010-07-22  5:09   ` Wu Fengguang
2010-07-23 18:17   ` Jan Kara
2010-07-23 18:17     ` Jan Kara
2010-07-26 10:52   ` Mel Gorman
2010-07-26 10:52     ` Mel Gorman
2010-07-26 11:32     ` Wu Fengguang
2010-07-26 11:32       ` Wu Fengguang
2010-08-01 15:29   ` Minchan Kim
2010-08-01 15:29     ` Minchan Kim
2010-07-22  5:09 ` [PATCH 3/6] writeback: kill writeback_control.more_io Wu Fengguang
2010-07-22  5:09   ` Wu Fengguang
2010-07-22  5:09   ` Wu Fengguang
2010-07-23 18:24   ` Jan Kara
2010-07-23 18:24     ` Jan Kara
2010-07-26 10:53   ` Mel Gorman
2010-07-26 10:53     ` Mel Gorman
2010-08-01 15:34   ` Minchan Kim
2010-08-01 15:34     ` Minchan Kim
2010-08-05 14:50     ` Wu Fengguang
2010-08-05 14:50       ` Wu Fengguang
2010-08-05 14:55       ` Wu Fengguang
2010-08-05 14:55         ` Wu Fengguang
2010-08-05 14:56       ` Minchan Kim
2010-08-05 14:56         ` Minchan Kim
2010-08-05 15:26         ` Wu Fengguang
2010-08-05 15:26           ` Wu Fengguang
2010-07-22  5:09 ` [PATCH 4/6] writeback: sync expired inodes first in background writeback Wu Fengguang
2010-07-22  5:09   ` Wu Fengguang
2010-07-23 18:15   ` Jan Kara
2010-07-23 18:15     ` Jan Kara
2010-07-26 11:51     ` Wu Fengguang
2010-07-26 11:51       ` Wu Fengguang
2010-07-26 12:12       ` Jan Kara
2010-07-26 12:12         ` Jan Kara
2010-07-26 12:29         ` Wu Fengguang
2010-07-26 12:29           ` Wu Fengguang
2010-07-26 10:57   ` Mel Gorman
2010-07-26 10:57     ` Mel Gorman
2010-07-26 12:00     ` Wu Fengguang
2010-07-26 12:00       ` Wu Fengguang
2010-07-26 12:20       ` Jan Kara
2010-07-26 12:20         ` Jan Kara
2010-07-26 12:31         ` Wu Fengguang
2010-07-26 12:31           ` Wu Fengguang
2010-07-26 12:39           ` Jan Kara
2010-07-26 12:39             ` Jan Kara
2010-07-26 12:47             ` Wu Fengguang
2010-07-26 12:47               ` Wu Fengguang
2010-07-26 12:56     ` Wu Fengguang
2010-07-26 12:56       ` Wu Fengguang
2010-07-26 12:59       ` Mel Gorman
2010-07-26 12:59         ` Mel Gorman
2010-07-26 13:11         ` Wu Fengguang
2010-07-26 13:11           ` Wu Fengguang
2010-07-27  9:45           ` Mel Gorman
2010-07-27  9:45             ` Mel Gorman
2010-08-01 15:15           ` Minchan Kim
2010-08-01 15:15             ` Minchan Kim
2010-07-22  5:09 ` [PATCH 5/6] writeback: try more writeback as long as something was written Wu Fengguang
2010-07-22  5:09   ` Wu Fengguang
2010-07-22  5:09   ` Wu Fengguang
2010-07-23 17:39   ` Jan Kara
2010-07-23 17:39     ` Jan Kara
2010-07-26 12:39     ` Wu Fengguang
2010-07-26 12:39       ` Wu Fengguang
2010-07-26 11:01   ` Mel Gorman
2010-07-26 11:01     ` Mel Gorman
2010-07-26 11:39     ` Wu Fengguang
2010-07-26 11:39       ` Wu Fengguang
2010-07-22  5:09 ` [PATCH 6/6] writeback: introduce writeback_control.inodes_written Wu Fengguang
2010-07-22  5:09   ` Wu Fengguang
2010-07-22  5:09   ` Wu Fengguang
2010-07-26 11:04   ` Mel Gorman
2010-07-26 11:04     ` Mel Gorman
2010-07-23 10:24 ` [PATCH 0/6] [RFC] writeback: try to write older pages first Mel Gorman
2010-07-23 10:24   ` Mel Gorman
2010-07-26  7:18   ` Wu Fengguang
2010-07-26  7:18     ` Wu Fengguang
2010-07-26 10:42     ` Mel Gorman
2010-07-26 10:42       ` Mel Gorman
2010-07-26 10:28 ` Itaru Kitayama
2010-07-26 10:28   ` Itaru Kitayama
2010-07-26 11:47   ` Wu Fengguang
2010-07-26 11:47     ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.