All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
@ 2009-08-30  2:54 ` Theodore Ts'o
  0 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2009-08-30  2:54 UTC (permalink / raw)
  To: linux-mm, Ext4 Developers List; +Cc: Theodore Ts'o

MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a concern of not
holding I_SYNC for too long.  But this shouldn't be a concern since
I_LOCK and I_SYNC have been separated.  So make it be a tunable and
change the default to be 32768.

This change is helpful for ext4 since it means we write out large file
in bigger chunks than just 4 megabytes at a time, so that when we have
multiple large files in the page cache waiting for writeback, the
files don't end up getting interleaved.  There shouldn't be any downside.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 include/linux/writeback.h |    1 +
 kernel/sysctl.c           |    8 ++++++++
 mm/page-writeback.c       |   27 ++++++++++++++-------------
 3 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 3224820..c1e6c08 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -110,6 +110,7 @@ extern int vm_dirty_ratio;
 extern unsigned long vm_dirty_bytes;
 extern unsigned int dirty_writeback_interval;
 extern unsigned int dirty_expire_interval;
+extern unsigned int max_writeback_pages;
 extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 58be760..06d1c4c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1104,6 +1104,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_writeback_pages",
+		.data		= &max_writeback_pages,
+		.maxlen		= sizeof(max_writeback_pages),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= VM_NR_PDFLUSH_THREADS,
 		.procname	= "nr_pdflush_threads",
 		.data		= &nr_pdflush_threads,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..eac8026 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -36,15 +36,6 @@
 #include <linux/pagevec.h>
 
 /*
- * The maximum number of pages to writeout in a single bdflush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES	1024
-
-/*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
  */
@@ -64,6 +55,16 @@ static inline long sync_writeback_pages(void)
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
+ * The maximum number of pages to writeout in a single bdflush/kupdate
+ * operation.  We used to limit this to 1024 pages to avoid holding
+ * I_SYNC against an inode for a long period of times, but since
+ * I_SYNC has been separated out from I_LOCK, the only time a process
+ * waits for I_SYNC is when it is calling fsync() or otherwise forcing
+ * out the inode.
+ */
+unsigned int max_writeback_pages = 32768;
+
+/*
  * Start background writeback (via pdflush) at this percentage
  */
 int dirty_background_ratio = 10;
@@ -708,10 +709,10 @@ static void background_writeout(unsigned long _min_pages)
 			break;
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.nr_to_write = max_writeback_pages;
 		wbc.pages_skipped = 0;
 		writeback_inodes(&wbc);
-		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		min_pages -= max_writeback_pages - wbc.nr_to_write;
 		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
 			/* Wrote less than expected */
 			if (wbc.encountered_congestion || wbc.more_io)
@@ -783,7 +784,7 @@ static void wb_kupdate(unsigned long arg)
 	while (nr_to_write > 0) {
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.nr_to_write = max_writeback_pages;
 		writeback_inodes(&wbc);
 		if (wbc.nr_to_write > 0) {
 			if (wbc.encountered_congestion || wbc.more_io)
@@ -791,7 +792,7 @@ static void wb_kupdate(unsigned long arg)
 			else
 				break;	/* All the old data is written */
 		}
-		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		nr_to_write -= max_writeback_pages - wbc.nr_to_write;
 	}
 	if (time_before(next_jif, jiffies + HZ))
 		next_jif = jiffies + HZ;
-- 
1.6.3.2.1.gb9f7d.dirty


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
@ 2009-08-30  2:54 ` Theodore Ts'o
  0 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2009-08-30  2:54 UTC (permalink / raw)
  To: linux-mm, Ext4 Developers List; +Cc: Theodore Ts'o

MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a concern of not
holding I_SYNC for too long.  But this shouldn't be a concern since
I_LOCK and I_SYNC have been separated.  So make it be a tunable and
change the default to be 32768.

This change is helpful for ext4 since it means we write out large file
in bigger chunks than just 4 megabytes at a time, so that when we have
multiple large files in the page cache waiting for writeback, the
files don't end up getting interleaved.  There shouldn't be any downside.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 include/linux/writeback.h |    1 +
 kernel/sysctl.c           |    8 ++++++++
 mm/page-writeback.c       |   27 ++++++++++++++-------------
 3 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 3224820..c1e6c08 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -110,6 +110,7 @@ extern int vm_dirty_ratio;
 extern unsigned long vm_dirty_bytes;
 extern unsigned int dirty_writeback_interval;
 extern unsigned int dirty_expire_interval;
+extern unsigned int max_writeback_pages;
 extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 58be760..06d1c4c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1104,6 +1104,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_writeback_pages",
+		.data		= &max_writeback_pages,
+		.maxlen		= sizeof(max_writeback_pages),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= VM_NR_PDFLUSH_THREADS,
 		.procname	= "nr_pdflush_threads",
 		.data		= &nr_pdflush_threads,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..eac8026 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -36,15 +36,6 @@
 #include <linux/pagevec.h>
 
 /*
- * The maximum number of pages to writeout in a single bdflush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES	1024
-
-/*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
  */
@@ -64,6 +55,16 @@ static inline long sync_writeback_pages(void)
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
+ * The maximum number of pages to writeout in a single bdflush/kupdate
+ * operation.  We used to limit this to 1024 pages to avoid holding
+ * I_SYNC against an inode for a long period of times, but since
+ * I_SYNC has been separated out from I_LOCK, the only time a process
+ * waits for I_SYNC is when it is calling fsync() or otherwise forcing
+ * out the inode.
+ */
+unsigned int max_writeback_pages = 32768;
+
+/*
  * Start background writeback (via pdflush) at this percentage
  */
 int dirty_background_ratio = 10;
@@ -708,10 +709,10 @@ static void background_writeout(unsigned long _min_pages)
 			break;
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.nr_to_write = max_writeback_pages;
 		wbc.pages_skipped = 0;
 		writeback_inodes(&wbc);
-		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		min_pages -= max_writeback_pages - wbc.nr_to_write;
 		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
 			/* Wrote less than expected */
 			if (wbc.encountered_congestion || wbc.more_io)
@@ -783,7 +784,7 @@ static void wb_kupdate(unsigned long arg)
 	while (nr_to_write > 0) {
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.nr_to_write = max_writeback_pages;
 		writeback_inodes(&wbc);
 		if (wbc.nr_to_write > 0) {
 			if (wbc.encountered_congestion || wbc.more_io)
@@ -791,7 +792,7 @@ static void wb_kupdate(unsigned long arg)
 			else
 				break;	/* All the old data is written */
 		}
-		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		nr_to_write -= max_writeback_pages - wbc.nr_to_write;
 	}
 	if (time_before(next_jif, jiffies + HZ))
 		next_jif = jiffies + HZ;
-- 
1.6.3.2.1.gb9f7d.dirty

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
  2009-08-30  2:54 ` Theodore Ts'o
@ 2009-08-30 16:52   ` Christoph Hellwig
  -1 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2009-08-30 16:52 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason, jens.axboe

On Sat, Aug 29, 2009 at 10:54:18PM -0400, Theodore Ts'o wrote:
> MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a concern of not
> holding I_SYNC for too long.  But this shouldn't be a concern since
> I_LOCK and I_SYNC have been separated.  So make it be a tunable and
> change the default to be 32768.
> 
> This change is helpful for ext4 since it means we write out large file
> in bigger chunks than just 4 megabytes at a time, so that when we have
> multiple large files in the page cache waiting for writeback, the
> files don't end up getting interleaved.  There shouldn't be any downside.
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=13930

The current writeback sizes are defintively too small, we shoved in
a hack into XFS to bump up nr_to_write to four times the value the
VM sends us to be able to saturate medium sized RAID arrays in XFS.

Turns out this was not enough and at least for Chris Masons array
we only started seaturating at * 16.  I suspect you patch will give
a similar effect.

>  /*
> + * The maximum number of pages to writeout in a single bdflush/kupdate
> + * operation.  We used to limit this to 1024 pages to avoid holding
> + * I_SYNC against an inode for a long period of times, but since
> + * I_SYNC has been separated out from I_LOCK, the only time a process
> + * waits for I_SYNC is when it is calling fsync() or otherwise forcing
> + * out the inode.
> + */
> +unsigned int max_writeback_pages = 32768;

Now while I'm sure this a a much much better default than the brain dead
previous one I suspect we really need to scale it based on the amount of
available or dirty RAM to keep larger systems busy.  Also that I_SYNC
comment doesn't make too much sense to me - if we do a sync/fsync we
do need to write out all data anyway, so waiting for bdflush/kupdate
is not a problem.  The only thing where it could matter is for Jan's
new O_SYNC implementation.

And btw, I think referring to the historic code in the comment is not
a good idea, it's just going to ocnfuse the heck out of everyone looking
at it in the future.  The information above makes sense for the commit
message.

And the other big question is how this interacts with Jens' new per-bdi
flushing code that we still hope to merge in 2.6.32.

Maybe we'll actually get some sane writeback code for the first time.

> +
> +/*
>   * Start background writeback (via pdflush) at this percentage
>   */
>  int dirty_background_ratio = 10;
> @@ -708,10 +709,10 @@ static void background_writeout(unsigned long _min_pages)
>  			break;
>  		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;
> -		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> +		wbc.nr_to_write = max_writeback_pages;
>  		wbc.pages_skipped = 0;
>  		writeback_inodes(&wbc);
> -		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> +		min_pages -= max_writeback_pages - wbc.nr_to_write;
>  		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
>  			/* Wrote less than expected */
>  			if (wbc.encountered_congestion || wbc.more_io)
> @@ -783,7 +784,7 @@ static void wb_kupdate(unsigned long arg)
>  	while (nr_to_write > 0) {
>  		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;
> -		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> +		wbc.nr_to_write = max_writeback_pages;
>  		writeback_inodes(&wbc);
>  		if (wbc.nr_to_write > 0) {
>  			if (wbc.encountered_congestion || wbc.more_io)
> @@ -791,7 +792,7 @@ static void wb_kupdate(unsigned long arg)
>  			else
>  				break;	/* All the old data is written */
>  		}
> -		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> +		nr_to_write -= max_writeback_pages - wbc.nr_to_write;
>  	}
>  	if (time_before(next_jif, jiffies + HZ))
>  		next_jif = jiffies + HZ;
> -- 
> 1.6.3.2.1.gb9f7d.dirty
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
---end quoted text---

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
@ 2009-08-30 16:52   ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2009-08-30 16:52 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason, jens.axboe

On Sat, Aug 29, 2009 at 10:54:18PM -0400, Theodore Ts'o wrote:
> MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a concern of not
> holding I_SYNC for too long.  But this shouldn't be a concern since
> I_LOCK and I_SYNC have been separated.  So make it be a tunable and
> change the default to be 32768.
> 
> This change is helpful for ext4 since it means we write out large file
> in bigger chunks than just 4 megabytes at a time, so that when we have
> multiple large files in the page cache waiting for writeback, the
> files don't end up getting interleaved.  There shouldn't be any downside.
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=13930

The current writeback sizes are defintively too small, we shoved in
a hack into XFS to bump up nr_to_write to four times the value the
VM sends us to be able to saturate medium sized RAID arrays in XFS.

Turns out this was not enough and at least for Chris Masons array
we only started seaturating at * 16.  I suspect you patch will give
a similar effect.

>  /*
> + * The maximum number of pages to writeout in a single bdflush/kupdate
> + * operation.  We used to limit this to 1024 pages to avoid holding
> + * I_SYNC against an inode for a long period of times, but since
> + * I_SYNC has been separated out from I_LOCK, the only time a process
> + * waits for I_SYNC is when it is calling fsync() or otherwise forcing
> + * out the inode.
> + */
> +unsigned int max_writeback_pages = 32768;

Now while I'm sure this a a much much better default than the brain dead
previous one I suspect we really need to scale it based on the amount of
available or dirty RAM to keep larger systems busy.  Also that I_SYNC
comment doesn't make too much sense to me - if we do a sync/fsync we
do need to write out all data anyway, so waiting for bdflush/kupdate
is not a problem.  The only thing where it could matter is for Jan's
new O_SYNC implementation.

And btw, I think referring to the historic code in the comment is not
a good idea, it's just going to ocnfuse the heck out of everyone looking
at it in the future.  The information above makes sense for the commit
message.

And the other big question is how this interacts with Jens' new per-bdi
flushing code that we still hope to merge in 2.6.32.

Maybe we'll actually get some sane writeback code for the first time.

> +
> +/*
>   * Start background writeback (via pdflush) at this percentage
>   */
>  int dirty_background_ratio = 10;
> @@ -708,10 +709,10 @@ static void background_writeout(unsigned long _min_pages)
>  			break;
>  		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;
> -		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> +		wbc.nr_to_write = max_writeback_pages;
>  		wbc.pages_skipped = 0;
>  		writeback_inodes(&wbc);
> -		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> +		min_pages -= max_writeback_pages - wbc.nr_to_write;
>  		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
>  			/* Wrote less than expected */
>  			if (wbc.encountered_congestion || wbc.more_io)
> @@ -783,7 +784,7 @@ static void wb_kupdate(unsigned long arg)
>  	while (nr_to_write > 0) {
>  		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;
> -		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> +		wbc.nr_to_write = max_writeback_pages;
>  		writeback_inodes(&wbc);
>  		if (wbc.nr_to_write > 0) {
>  			if (wbc.encountered_congestion || wbc.more_io)
> @@ -791,7 +792,7 @@ static void wb_kupdate(unsigned long arg)
>  			else
>  				break;	/* All the old data is written */
>  		}
> -		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> +		nr_to_write -= max_writeback_pages - wbc.nr_to_write;
>  	}
>  	if (time_before(next_jif, jiffies + HZ))
>  		next_jif = jiffies + HZ;
> -- 
> 1.6.3.2.1.gb9f7d.dirty
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
---end quoted text---

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
  2009-08-30 16:52   ` Christoph Hellwig
@ 2009-08-30 18:17     ` Theodore Tso
  -1 siblings, 0 replies; 24+ messages in thread
From: Theodore Tso @ 2009-08-30 18:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason, jens.axboe

On Sun, Aug 30, 2009 at 12:52:29PM -0400, Christoph Hellwig wrote:
> On Sat, Aug 29, 2009 at 10:54:18PM -0400, Theodore Ts'o wrote:
> > MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a concern of not
> > holding I_SYNC for too long.  But this shouldn't be a concern since
> > I_LOCK and I_SYNC have been separated.  So make it be a tunable and
> > change the default to be 32768.
> > 
> > This change is helpful for ext4 since it means we write out large file
> > in bigger chunks than just 4 megabytes at a time, so that when we have
> > multiple large files in the page cache waiting for writeback, the
> > files don't end up getting interleaved.  There shouldn't be any downside.
> > 
> > http://bugzilla.kernel.org/show_bug.cgi?id=13930
> 
> The current writeback sizes are defintively too small, we shoved in
> a hack into XFS to bump up nr_to_write to four times the value the
> VM sends us to be able to saturate medium sized RAID arrays in XFS.

Hmm, should we make it be a per-superblock tunable so that it can
either be tuned on a per-block device basis or the filesystem code can
adjust it to their liking?  I thought about it, but decided maybe it
was better to keeping it simple.

> Turns out this was not enough and at least for Chris Masons array
> we only started seaturating at * 16.  I suspect you patch will give
> a similar effect.

So you think 16384 would be a better default?  The reason why I picked
32768 was because that was the size of the ext4 block group, but it
was otherwise it was totally arbitrary.  I haven't done any
benchmarking yet, which is one of the reasons why I thought about
making it a tunable.

> And btw, I think referring to the historic code in the comment is not
> a good idea, it's just going to ocnfuse the heck out of everyone looking
> at it in the future.  The information above makes sense for the commit
> message.

Yeah, good point.

> And the other big question is how this interacts with Jens' new per-bdi
> flushing code that we still hope to merge in 2.6.32.

Jens?  What do you think?  Fixing MAX_WRITEBACK_PAGES was something I
really wanted to merge in 2.6.32 since it makes a huge difference for
the block allocation layout for a "rsync -avH /old-fs /new-fs" when we
are copying bunch of large files (say, 800 meg iso images) and so the
fact that the writeback routine is writing out 4 megs at a time, means
that our files get horribly interleaved and thus get fragmented.

I initially thought about adding some massive workarounds in the
filesystem layer (which is I guess what XFS did), but I ultimately
decided this was begging to be solved in the page writeback code,
especially since it's *such* an easy fix.

> Maybe we'll actually get some sane writeback code for the first time.

To quote from "Fiddler on the Roof", from your lips to God's ears....

:-)

   	      	       	      	     	  - Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
@ 2009-08-30 18:17     ` Theodore Tso
  0 siblings, 0 replies; 24+ messages in thread
From: Theodore Tso @ 2009-08-30 18:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason, jens.axboe

On Sun, Aug 30, 2009 at 12:52:29PM -0400, Christoph Hellwig wrote:
> On Sat, Aug 29, 2009 at 10:54:18PM -0400, Theodore Ts'o wrote:
> > MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a concern of not
> > holding I_SYNC for too long.  But this shouldn't be a concern since
> > I_LOCK and I_SYNC have been separated.  So make it be a tunable and
> > change the default to be 32768.
> > 
> > This change is helpful for ext4 since it means we write out large file
> > in bigger chunks than just 4 megabytes at a time, so that when we have
> > multiple large files in the page cache waiting for writeback, the
> > files don't end up getting interleaved.  There shouldn't be any downside.
> > 
> > http://bugzilla.kernel.org/show_bug.cgi?id=13930
> 
> The current writeback sizes are defintively too small, we shoved in
> a hack into XFS to bump up nr_to_write to four times the value the
> VM sends us to be able to saturate medium sized RAID arrays in XFS.

Hmm, should we make it be a per-superblock tunable so that it can
either be tuned on a per-block device basis or the filesystem code can
adjust it to their liking?  I thought about it, but decided maybe it
was better to keeping it simple.

> Turns out this was not enough and at least for Chris Masons array
> we only started seaturating at * 16.  I suspect you patch will give
> a similar effect.

So you think 16384 would be a better default?  The reason why I picked
32768 was because that was the size of the ext4 block group, but it
was otherwise it was totally arbitrary.  I haven't done any
benchmarking yet, which is one of the reasons why I thought about
making it a tunable.

> And btw, I think referring to the historic code in the comment is not
> a good idea, it's just going to ocnfuse the heck out of everyone looking
> at it in the future.  The information above makes sense for the commit
> message.

Yeah, good point.

> And the other big question is how this interacts with Jens' new per-bdi
> flushing code that we still hope to merge in 2.6.32.

Jens?  What do you think?  Fixing MAX_WRITEBACK_PAGES was something I
really wanted to merge in 2.6.32 since it makes a huge difference for
the block allocation layout for a "rsync -avH /old-fs /new-fs" when we
are copying bunch of large files (say, 800 meg iso images) and so the
fact that the writeback routine is writing out 4 megs at a time, means
that our files get horribly interleaved and thus get fragmented.

I initially thought about adding some massive workarounds in the
filesystem layer (which is I guess what XFS did), but I ultimately
decided this was begging to be solved in the page writeback code,
especially since it's *such* an easy fix.

> Maybe we'll actually get some sane writeback code for the first time.

To quote from "Fiddler on the Roof", from your lips to God's ears....

:-)

   	      	       	      	     	  - Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
  2009-08-30 18:17     ` Theodore Tso
@ 2009-08-30 22:27       ` Christoph Hellwig
  -1 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2009-08-30 22:27 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel,
	chris.mason, jens.axboe

On Sun, Aug 30, 2009 at 02:17:31PM -0400, Theodore Tso wrote:
> > 
> > The current writeback sizes are defintively too small, we shoved in
> > a hack into XFS to bump up nr_to_write to four times the value the
> > VM sends us to be able to saturate medium sized RAID arrays in XFS.
> 
> Hmm, should we make it be a per-superblock tunable so that it can
> either be tuned on a per-block device basis or the filesystem code can
> adjust it to their liking?  I thought about it, but decided maybe it
> was better to keeping it simple.

I'm don't think tuning it on a per-filesystem basis is a good idea,
we had to resort to this for 2.6.30 as a quick hack, and we will ged
rid of it again in 2.6.31 one way or another.  I personally think we
should fight this cancer of per-filesystem hacks in the writeback code
as much as we can.  Right now people keep adding tuning hacks for
specific workloads there, and at least all the modern filesystems (ext4,
btrfs and XFS) have very similar requirements to the writeback code,
that is give the filesystem as much as possible to write at the same
time to do intelligent decisions based on that.  The VM writeback code
fails horribly at that right now.

> > Turns out this was not enough and at least for Chris Masons array
> > we only started seaturating at * 16.  I suspect you patch will give
> > a similar effect.
> 
> So you think 16384 would be a better default?  The reason why I picked
> 32768 was because that was the size of the ext4 block group, but it
> was otherwise it was totally arbitrary.  I haven't done any
> benchmarking yet, which is one of the reasons why I thought about
> making it a tunable.

It was just another arbitrary number.  My suspicion is that the exact
number does not matter, it just needs to be much much larger than the
current one.  And the latency concerns are also over-rated as the block
layer will tell us that we are congested if we push too much into a
queue.

> > And the other big question is how this interacts with Jens' new per-bdi
> > flushing code that we still hope to merge in 2.6.32.
> 
> Jens?  What do you think?  Fixing MAX_WRITEBACK_PAGES was something I
> really wanted to merge in 2.6.32 since it makes a huge difference for
> the block allocation layout for a "rsync -avH /old-fs /new-fs" when we
> are copying bunch of large files (say, 800 meg iso images) and so the
> fact that the writeback routine is writing out 4 megs at a time, means
> that our files get horribly interleaved and thus get fragmented.
> 
> I initially thought about adding some massive workarounds in the
> filesystem layer (which is I guess what XFS did),

XFS is relatively good at doing the disk block layout even with smaller
writeouts, so we do not have that fragmentation hack.  And the
for-2.6.30 big hack is one liner:

--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -1268,6 +1268,14 @@ xfs_vm_writepage(
 	if (!page_has_buffers(page))
 		create_empty_buffers(page, 1 << inode->i_blkbits, 0);
 
+
+	/*
+	 *  VM calculation for nr_to_write seems off.  Bump it way
+	 *  up, this gets simple streaming writes zippy again.
+	 *  To be reviewed again after Jens' writeback changes.
+	 */
+	wbc->nr_to_write *= 4;
+
 	/*
 	 * Convert delayed allocate, unwritten or unmapped space
 	 * to real space and flush out to disk.

This also went past -fsdevel, but I didn' manage to get the flames for
adding these hacks that I hoped to get ;-)

My stance is to wait for this until about -rc2 at which points Jens'
code is hopefully in and we can start doing all the fine-tuning,
including lots of benchmarking.

Btw, one thing I would really see from one of the big companies or
the LF is doing benchmarks like yours above or just a simple one or
two stream dd on some big machines weekly so we can immediately see
regressions once someones starts to tweak the VM again.  Preferably
including seekwatcher data.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
@ 2009-08-30 22:27       ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2009-08-30 22:27 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel,
	chris.mason, jens.axboe

On Sun, Aug 30, 2009 at 02:17:31PM -0400, Theodore Tso wrote:
> > 
> > The current writeback sizes are defintively too small, we shoved in
> > a hack into XFS to bump up nr_to_write to four times the value the
> > VM sends us to be able to saturate medium sized RAID arrays in XFS.
> 
> Hmm, should we make it be a per-superblock tunable so that it can
> either be tuned on a per-block device basis or the filesystem code can
> adjust it to their liking?  I thought about it, but decided maybe it
> was better to keeping it simple.

I'm don't think tuning it on a per-filesystem basis is a good idea,
we had to resort to this for 2.6.30 as a quick hack, and we will ged
rid of it again in 2.6.31 one way or another.  I personally think we
should fight this cancer of per-filesystem hacks in the writeback code
as much as we can.  Right now people keep adding tuning hacks for
specific workloads there, and at least all the modern filesystems (ext4,
btrfs and XFS) have very similar requirements to the writeback code,
that is give the filesystem as much as possible to write at the same
time to do intelligent decisions based on that.  The VM writeback code
fails horribly at that right now.

> > Turns out this was not enough and at least for Chris Masons array
> > we only started seaturating at * 16.  I suspect you patch will give
> > a similar effect.
> 
> So you think 16384 would be a better default?  The reason why I picked
> 32768 was because that was the size of the ext4 block group, but it
> was otherwise it was totally arbitrary.  I haven't done any
> benchmarking yet, which is one of the reasons why I thought about
> making it a tunable.

It was just another arbitrary number.  My suspicion is that the exact
number does not matter, it just needs to be much much larger than the
current one.  And the latency concerns are also over-rated as the block
layer will tell us that we are congested if we push too much into a
queue.

> > And the other big question is how this interacts with Jens' new per-bdi
> > flushing code that we still hope to merge in 2.6.32.
> 
> Jens?  What do you think?  Fixing MAX_WRITEBACK_PAGES was something I
> really wanted to merge in 2.6.32 since it makes a huge difference for
> the block allocation layout for a "rsync -avH /old-fs /new-fs" when we
> are copying bunch of large files (say, 800 meg iso images) and so the
> fact that the writeback routine is writing out 4 megs at a time, means
> that our files get horribly interleaved and thus get fragmented.
> 
> I initially thought about adding some massive workarounds in the
> filesystem layer (which is I guess what XFS did),

XFS is relatively good at doing the disk block layout even with smaller
writeouts, so we do not have that fragmentation hack.  And the
for-2.6.30 big hack is one liner:

--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -1268,6 +1268,14 @@ xfs_vm_writepage(
 	if (!page_has_buffers(page))
 		create_empty_buffers(page, 1 << inode->i_blkbits, 0);
 
+
+	/*
+	 *  VM calculation for nr_to_write seems off.  Bump it way
+	 *  up, this gets simple streaming writes zippy again.
+	 *  To be reviewed again after Jens' writeback changes.
+	 */
+	wbc->nr_to_write *= 4;
+
 	/*
 	 * Convert delayed allocate, unwritten or unmapped space
 	 * to real space and flush out to disk.

This also went past -fsdevel, but I didn' manage to get the flames for
adding these hacks that I hoped to get ;-)

My stance is to wait for this until about -rc2 at which points Jens'
code is hopefully in and we can start doing all the fine-tuning,
including lots of benchmarking.

Btw, one thing I would really see from one of the big companies or
the LF is doing benchmarks like yours above or just a simple one or
two stream dd on some big machines weekly so we can immediately see
regressions once someones starts to tweak the VM again.  Preferably
including seekwatcher data.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
  2009-08-30 22:27       ` Christoph Hellwig
@ 2009-08-31  3:08         ` Theodore Tso
  -1 siblings, 0 replies; 24+ messages in thread
From: Theodore Tso @ 2009-08-31  3:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason, jens.axboe

On Sun, Aug 30, 2009 at 06:27:10PM -0400, Christoph Hellwig wrote:
> I'm don't think tuning it on a per-filesystem basis is a good idea,
> we had to resort to this for 2.6.30 as a quick hack, and we will ged
> rid of it again in 2.6.31 one way or another.  I personally think we
> should fight this cancer of per-filesystem hacks in the writeback code
> as much as we can.  Right now people keep adding tuning hacks for
> specific workloads there, and at least all the modern filesystems (ext4,
> btrfs and XFS) have very similar requirements to the writeback code,
> that is give the filesystem as much as possible to write at the same
> time to do intelligent decisions based on that.  The VM writeback code
> fails horribly at that right now.

Yep; and Jens' patch doesn't change that.  It is still sending writes
out to the filesystem a piddling 1024 pages at a time.

> My stance is to wait for this until about -rc2 at which points Jens'
> code is hopefully in and we can start doing all the fine-tuning,
> including lots of benchmarking.

Well, I've ported my patch so it applies on top of Jens' per-bdi
patch.  It seems to be clearly needed; Jens, would you agree to add it
to your per-bdi patch series?  We can choose a different default if
you like, but making MAX_WRITEBACK_PAGES tunable seems to be clearly
necessary.

By the way, while I was testing my patch on top of v13 of the per-bdi
patches, I found something *very* curious.  I did a test where ran the
following commands on a freshly mkfs'ed ext4 filesystem:

	dd if=/dev/zero of=test1 bs=1024k count=128
	dd if=/dev/zero of=test2 bs=1024k count=128
	sync

I traced the calls to ext4_da_writepages() using ftrace, and found this:

      flush-8:16-1829  [001]    23.416351: ext4_da_writepages: dev sdb ino 12 nr_t_write 32759 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
      flush-8:16-1829  [000]    25.939354: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
      flush-8:16-1829  [000]    25.939486: ext4_da_writepages: dev sdb ino 13 nr_t_write 32759 pages_skipped 0 range_start 134180864 range_end 9223372036854775807 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
      flush-8:16-1829  [000]    27.055687: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
      flush-8:16-1829  [000]    27.055691: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
      flush-8:16-1829  [000]    27.878708: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1

The *first* time the per-bdi code called writepages on the second file
(test2, inode #13), range_start was 134180864 (which, curiously
enough, is 4096*32759, which was the value of nr_to_write passed to
ext4_da_writepages).  Given that the inode only had 32768 pages, the
fact that apparently *some* codepath called ext4_da_writepages
starting at logical block 32759, with nr_to_write set to 32759, seems
very curious indeed.  That doesn't seem right at all.  It's late, so I
won't try to trace it down now; plus which it's your code so I figure
you can probably figure it out faster....

     	     	       	    	    	       	  - Ted

>From 763fb19b3d14d73e71f5d88bd3b5f56869536ae5 Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <tytso@mit.edu>
Date: Sun, 30 Aug 2009 23:06:24 -0400
Subject: [PATCH] vm: Add an tuning knob for vm.max_writeback_pages

Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.

According to Christoph, the current writeback size is way too small,
and XFS had a hack that bumped out nr_to_write to four times the value
sent by the VM to be able to saturate medium-sized RAID arrays.  This
value was also problematic for ext4 as well, as it caused large files
to be come interleaved on disk by in 8 megabyte chunks (we bumped up
the nr_to_write by a factor of two).

So, in this patch, we make the MAX_WRITEBACK_PAGES a tunable, and
change the default to be 32768 blocks.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
This patch is designed to be applied  on top of the per-bdi v13 patchset

 fs/fs-writeback.c         |   15 +++------------
 include/linux/writeback.h |    1 +
 kernel/sysctl.c           |    8 ++++++++
 mm/page-writeback.c       |    6 ++++++
 4 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index dfb4767..7e66de7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -264,15 +264,6 @@ void bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
 	}
 }
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
-
 static inline bool over_bground_thresh(void)
 {
 	unsigned long background_thresh, dirty_thresh;
@@ -325,11 +316,11 @@ static long wb_writeback(struct bdi_writeback *wb, long nr_pages,
 
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.nr_to_write = max_writeback_pages;
 		wbc.pages_skipped = 0;
 		generic_sync_wb_inodes(wb, sb, &wbc);
-		nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		nr_pages -= max_writeback_pages - wbc.nr_to_write;
+		wrote += max_writeback_pages - wbc.nr_to_write;
 		/*
 		 * If we ran out of stuff to write, bail unless more_io got set
 		 */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index e070b91..a27fc00 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -100,6 +100,7 @@ extern int vm_dirty_ratio;
 extern unsigned long vm_dirty_bytes;
 extern unsigned int dirty_writeback_interval;
 extern unsigned int dirty_expire_interval;
+extern unsigned int max_writeback_pages;
 extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 58be760..06d1c4c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1104,6 +1104,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_writeback_pages",
+		.data		= &max_writeback_pages,
+		.maxlen		= sizeof(max_writeback_pages),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= VM_NR_PDFLUSH_THREADS,
 		.procname	= "nr_pdflush_threads",
 		.data		= &nr_pdflush_threads,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a727af9..46fe54f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -55,6 +55,12 @@ static inline long sync_writeback_pages(void)
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
+ * The maximum number of pages to write out in a single bdflush/kupdate
+ * operation.
+ */
+unsigned int max_writeback_pages = 32768;
+
+/*
  * Start background writeback (via pdflush) at this percentage
  */
 int dirty_background_ratio = 10;
-- 
1.6.3.2.1.gb9f7d.dirty


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
@ 2009-08-31  3:08         ` Theodore Tso
  0 siblings, 0 replies; 24+ messages in thread
From: Theodore Tso @ 2009-08-31  3:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, Ext4 Developers List, linux-fsdevel, chris.mason, jens.axboe

On Sun, Aug 30, 2009 at 06:27:10PM -0400, Christoph Hellwig wrote:
> I'm don't think tuning it on a per-filesystem basis is a good idea,
> we had to resort to this for 2.6.30 as a quick hack, and we will ged
> rid of it again in 2.6.31 one way or another.  I personally think we
> should fight this cancer of per-filesystem hacks in the writeback code
> as much as we can.  Right now people keep adding tuning hacks for
> specific workloads there, and at least all the modern filesystems (ext4,
> btrfs and XFS) have very similar requirements to the writeback code,
> that is give the filesystem as much as possible to write at the same
> time to do intelligent decisions based on that.  The VM writeback code
> fails horribly at that right now.

Yep; and Jens' patch doesn't change that.  It is still sending writes
out to the filesystem a piddling 1024 pages at a time.

> My stance is to wait for this until about -rc2 at which points Jens'
> code is hopefully in and we can start doing all the fine-tuning,
> including lots of benchmarking.

Well, I've ported my patch so it applies on top of Jens' per-bdi
patch.  It seems to be clearly needed; Jens, would you agree to add it
to your per-bdi patch series?  We can choose a different default if
you like, but making MAX_WRITEBACK_PAGES tunable seems to be clearly
necessary.

By the way, while I was testing my patch on top of v13 of the per-bdi
patches, I found something *very* curious.  I did a test where ran the
following commands on a freshly mkfs'ed ext4 filesystem:

	dd if=/dev/zero of=test1 bs=1024k count=128
	dd if=/dev/zero of=test2 bs=1024k count=128
	sync

I traced the calls to ext4_da_writepages() using ftrace, and found this:

      flush-8:16-1829  [001]    23.416351: ext4_da_writepages: dev sdb ino 12 nr_t_write 32759 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
      flush-8:16-1829  [000]    25.939354: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
      flush-8:16-1829  [000]    25.939486: ext4_da_writepages: dev sdb ino 13 nr_t_write 32759 pages_skipped 0 range_start 134180864 range_end 9223372036854775807 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
      flush-8:16-1829  [000]    27.055687: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
      flush-8:16-1829  [000]    27.055691: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
      flush-8:16-1829  [000]    27.878708: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1

The *first* time the per-bdi code called writepages on the second file
(test2, inode #13), range_start was 134180864 (which, curiously
enough, is 4096*32759, which was the value of nr_to_write passed to
ext4_da_writepages).  Given that the inode only had 32768 pages, the
fact that apparently *some* codepath called ext4_da_writepages
starting at logical block 32759, with nr_to_write set to 32759, seems
very curious indeed.  That doesn't seem right at all.  It's late, so I
won't try to trace it down now; plus which it's your code so I figure
you can probably figure it out faster....

     	     	       	    	    	       	  - Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
  2009-08-31  3:08         ` Theodore Tso
@ 2009-08-31 10:29           ` Jens Axboe
  -1 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2009-08-31 10:29 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel,
	chris.mason

On Sun, Aug 30 2009, Theodore Tso wrote:
> On Sun, Aug 30, 2009 at 06:27:10PM -0400, Christoph Hellwig wrote:
> > I'm don't think tuning it on a per-filesystem basis is a good idea,
> > we had to resort to this for 2.6.30 as a quick hack, and we will ged
> > rid of it again in 2.6.31 one way or another.  I personally think we
> > should fight this cancer of per-filesystem hacks in the writeback code
> > as much as we can.  Right now people keep adding tuning hacks for
> > specific workloads there, and at least all the modern filesystems (ext4,
> > btrfs and XFS) have very similar requirements to the writeback code,
> > that is give the filesystem as much as possible to write at the same
> > time to do intelligent decisions based on that.  The VM writeback code
> > fails horribly at that right now.
> 
> Yep; and Jens' patch doesn't change that.  It is still sending writes
> out to the filesystem a piddling 1024 pages at a time.

Right, I didn't want to change too much of the logic, tuning is better
left as follow up patches. There is one change, though - where the
current logic splits ->nr_to_write between devices, with the writeback
patches each get the "full" MAX_WRITEBACK_PAGES.

> > My stance is to wait for this until about -rc2 at which points Jens'
> > code is hopefully in and we can start doing all the fine-tuning,
> > including lots of benchmarking.
> 
> Well, I've ported my patch so it applies on top of Jens' per-bdi
> patch.  It seems to be clearly needed; Jens, would you agree to add it
> to your per-bdi patch series?  We can choose a different default if
> you like, but making MAX_WRITEBACK_PAGES tunable seems to be clearly
> necessary.

I don't mind adding it, but do we really want to export the value? If we
plan on making this dynamically adaptable soon, then we'll be stuck with
some proc file that doesn't really do much. I guess by then we can leave
it as a 'maximum ever' type control, which at least would do
something...

> By the way, while I was testing my patch on top of v13 of the per-bdi
> patches, I found something *very* curious.  I did a test where ran the
> following commands on a freshly mkfs'ed ext4 filesystem:
> 
> 	dd if=/dev/zero of=test1 bs=1024k count=128
> 	dd if=/dev/zero of=test2 bs=1024k count=128
> 	sync
> 
> I traced the calls to ext4_da_writepages() using ftrace, and found this:
> 
>       flush-8:16-1829  [001]    23.416351: ext4_da_writepages: dev sdb ino 12 nr_t_write 32759 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
>       flush-8:16-1829  [000]    25.939354: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
>       flush-8:16-1829  [000]    25.939486: ext4_da_writepages: dev sdb ino 13 nr_t_write 32759 pages_skipped 0 range_start 134180864 range_end 9223372036854775807 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
>       flush-8:16-1829  [000]    27.055687: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
>       flush-8:16-1829  [000]    27.055691: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
>       flush-8:16-1829  [000]    27.878708: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
> 
> The *first* time the per-bdi code called writepages on the second file
> (test2, inode #13), range_start was 134180864 (which, curiously
> enough, is 4096*32759, which was the value of nr_to_write passed to
> ext4_da_writepages).  Given that the inode only had 32768 pages, the
> fact that apparently *some* codepath called ext4_da_writepages
> starting at logical block 32759, with nr_to_write set to 32759, seems
> very curious indeed.  That doesn't seem right at all.  It's late, so I
> won't try to trace it down now; plus which it's your code so I figure
> you can probably figure it out faster....

Interesting, needs checking up on. I've prepared a v14 patchset today,
perhaps (if you have time), you can see if it reproduces there? I'm
running some performance tests today, but will make a note to look into
this after that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
@ 2009-08-31 10:29           ` Jens Axboe
  0 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2009-08-31 10:29 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel,
	chris.mason

On Sun, Aug 30 2009, Theodore Tso wrote:
> On Sun, Aug 30, 2009 at 06:27:10PM -0400, Christoph Hellwig wrote:
> > I'm don't think tuning it on a per-filesystem basis is a good idea,
> > we had to resort to this for 2.6.30 as a quick hack, and we will ged
> > rid of it again in 2.6.31 one way or another.  I personally think we
> > should fight this cancer of per-filesystem hacks in the writeback code
> > as much as we can.  Right now people keep adding tuning hacks for
> > specific workloads there, and at least all the modern filesystems (ext4,
> > btrfs and XFS) have very similar requirements to the writeback code,
> > that is give the filesystem as much as possible to write at the same
> > time to do intelligent decisions based on that.  The VM writeback code
> > fails horribly at that right now.
> 
> Yep; and Jens' patch doesn't change that.  It is still sending writes
> out to the filesystem a piddling 1024 pages at a time.

Right, I didn't want to change too much of the logic, tuning is better
left as follow up patches. There is one change, though - where the
current logic splits ->nr_to_write between devices, with the writeback
patches each get the "full" MAX_WRITEBACK_PAGES.

> > My stance is to wait for this until about -rc2 at which points Jens'
> > code is hopefully in and we can start doing all the fine-tuning,
> > including lots of benchmarking.
> 
> Well, I've ported my patch so it applies on top of Jens' per-bdi
> patch.  It seems to be clearly needed; Jens, would you agree to add it
> to your per-bdi patch series?  We can choose a different default if
> you like, but making MAX_WRITEBACK_PAGES tunable seems to be clearly
> necessary.

I don't mind adding it, but do we really want to export the value? If we
plan on making this dynamically adaptable soon, then we'll be stuck with
some proc file that doesn't really do much. I guess by then we can leave
it as a 'maximum ever' type control, which at least would do
something...

> By the way, while I was testing my patch on top of v13 of the per-bdi
> patches, I found something *very* curious.  I did a test where ran the
> following commands on a freshly mkfs'ed ext4 filesystem:
> 
> 	dd if=/dev/zero of=test1 bs=1024k count=128
> 	dd if=/dev/zero of=test2 bs=1024k count=128
> 	sync
> 
> I traced the calls to ext4_da_writepages() using ftrace, and found this:
> 
>       flush-8:16-1829  [001]    23.416351: ext4_da_writepages: dev sdb ino 12 nr_t_write 32759 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
>       flush-8:16-1829  [000]    25.939354: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
>       flush-8:16-1829  [000]    25.939486: ext4_da_writepages: dev sdb ino 13 nr_t_write 32759 pages_skipped 0 range_start 134180864 range_end 9223372036854775807 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
>       flush-8:16-1829  [000]    27.055687: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
>       flush-8:16-1829  [000]    27.055691: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
>       flush-8:16-1829  [000]    27.878708: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
> 
> The *first* time the per-bdi code called writepages on the second file
> (test2, inode #13), range_start was 134180864 (which, curiously
> enough, is 4096*32759, which was the value of nr_to_write passed to
> ext4_da_writepages).  Given that the inode only had 32768 pages, the
> fact that apparently *some* codepath called ext4_da_writepages
> starting at logical block 32759, with nr_to_write set to 32759, seems
> very curious indeed.  That doesn't seem right at all.  It's late, so I
> won't try to trace it down now; plus which it's your code so I figure
> you can probably figure it out faster....

Interesting, needs checking up on. I've prepared a v14 patchset today,
perhaps (if you have time), you can see if it reproduces there? I'm
running some performance tests today, but will make a note to look into
this after that.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
  2009-08-31 10:29           ` Jens Axboe
@ 2009-08-31 10:47             ` Jens Axboe
  -1 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2009-08-31 10:47 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel,
	chris.mason

On Mon, Aug 31 2009, Jens Axboe wrote:
> > I traced the calls to ext4_da_writepages() using ftrace, and found this:
> > 
> >       flush-8:16-1829  [001]    23.416351: ext4_da_writepages: dev sdb ino 12 nr_t_write 32759 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
> >       flush-8:16-1829  [000]    25.939354: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
> >       flush-8:16-1829  [000]    25.939486: ext4_da_writepages: dev sdb ino 13 nr_t_write 32759 pages_skipped 0 range_start 134180864 range_end 9223372036854775807 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
> >       flush-8:16-1829  [000]    27.055687: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
> >       flush-8:16-1829  [000]    27.055691: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
> >       flush-8:16-1829  [000]    27.878708: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
> > 
> > The *first* time the per-bdi code called writepages on the second file
> > (test2, inode #13), range_start was 134180864 (which, curiously
> > enough, is 4096*32759, which was the value of nr_to_write passed to
> > ext4_da_writepages).  Given that the inode only had 32768 pages, the
> > fact that apparently *some* codepath called ext4_da_writepages
> > starting at logical block 32759, with nr_to_write set to 32759, seems
> > very curious indeed.  That doesn't seem right at all.  It's late, so I
> > won't try to trace it down now; plus which it's your code so I figure
> > you can probably figure it out faster....
> 
> Interesting, needs checking up on. I've prepared a v14 patchset today,
> perhaps (if you have time), you can see if it reproduces there? I'm
> running some performance tests today, but will make a note to look into
> this after that.

It's because ext4 writepages sets ->range_start and wb_writeback() is
range cyclic, then the next iteration will have the previous end point
as the starting point. Looks like we need to clear ->range_start in
wb_writeback(), the better place is probably to do that in
fs/fs-writeback.c:generic_sync_wb_inodes() right after the
writeback_single_inode() call. This, btw, should be no different than
the current code, weird/correct or not :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
@ 2009-08-31 10:47             ` Jens Axboe
  0 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2009-08-31 10:47 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel,
	chris.mason

On Mon, Aug 31 2009, Jens Axboe wrote:
> > I traced the calls to ext4_da_writepages() using ftrace, and found this:
> > 
> >       flush-8:16-1829  [001]    23.416351: ext4_da_writepages: dev sdb ino 12 nr_t_write 32759 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
> >       flush-8:16-1829  [000]    25.939354: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
> >       flush-8:16-1829  [000]    25.939486: ext4_da_writepages: dev sdb ino 13 nr_t_write 32759 pages_skipped 0 range_start 134180864 range_end 9223372036854775807 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
> >       flush-8:16-1829  [000]    27.055687: ext4_da_writepages: dev sdb ino 12 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
> >       flush-8:16-1829  [000]    27.055691: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
> >       flush-8:16-1829  [000]    27.878708: ext4_da_writepages: dev sdb ino 13 nr_t_write 32768 pages_skipped 0 range_start 0 range_end 0 nonblocking 0 for_kupdate 0 for_reclaim 0 for_writepages 1 range_cyclic 1
> > 
> > The *first* time the per-bdi code called writepages on the second file
> > (test2, inode #13), range_start was 134180864 (which, curiously
> > enough, is 4096*32759, which was the value of nr_to_write passed to
> > ext4_da_writepages).  Given that the inode only had 32768 pages, the
> > fact that apparently *some* codepath called ext4_da_writepages
> > starting at logical block 32759, with nr_to_write set to 32759, seems
> > very curious indeed.  That doesn't seem right at all.  It's late, so I
> > won't try to trace it down now; plus which it's your code so I figure
> > you can probably figure it out faster....
> 
> Interesting, needs checking up on. I've prepared a v14 patchset today,
> perhaps (if you have time), you can see if it reproduces there? I'm
> running some performance tests today, but will make a note to look into
> this after that.

It's because ext4 writepages sets ->range_start and wb_writeback() is
range cyclic, then the next iteration will have the previous end point
as the starting point. Looks like we need to clear ->range_start in
wb_writeback(), the better place is probably to do that in
fs/fs-writeback.c:generic_sync_wb_inodes() right after the
writeback_single_inode() call. This, btw, should be no different than
the current code, weird/correct or not :-)

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
  2009-08-31 10:47             ` Jens Axboe
@ 2009-08-31 12:37               ` Theodore Tso
  -1 siblings, 0 replies; 24+ messages in thread
From: Theodore Tso @ 2009-08-31 12:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel,
	chris.mason

On Mon, Aug 31, 2009 at 12:47:49PM +0200, Jens Axboe wrote:
> It's because ext4 writepages sets ->range_start and wb_writeback() is
> range cyclic, then the next iteration will have the previous end point
> as the starting point. Looks like we need to clear ->range_start in
> wb_writeback(), the better place is probably to do that in
> fs/fs-writeback.c:generic_sync_wb_inodes() right after the
> writeback_single_inode() call. This, btw, should be no different than
> the current code, weird/correct or not :-)

Hmm, or we could have ext4_da_writepages save and restore
->range_start.  One of the things that's never been well documented is
exactly what the semantics are of the various fields in the wbc
struct, and who is allowed to modify which fields when.

If you have some time, it would be great if you could document the
rules filesystems should be following with respect to the wbc struct,
and then we can audit each filesystem to make sure they follow those
rules.  One of the things which is a bit scary about how the many wbc
flags work is that each time a filesystem wants some particular
behavior, it seems like we need to dive into writeback code, and
figure out some combination of flags/settings that make the page
writeback code do what we wants, and sometimes it's not clear whether
that was a designed-in semantic of the interface, or just something
that happened to work given the current implementation.

In any case, if one of the rules is that the filesystems' writepages
command shouldn't be modifying range_start, we can fix this problem up
by saving and restore range_start inside ext4_da_writepages().

							- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
@ 2009-08-31 12:37               ` Theodore Tso
  0 siblings, 0 replies; 24+ messages in thread
From: Theodore Tso @ 2009-08-31 12:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel,
	chris.mason

On Mon, Aug 31, 2009 at 12:47:49PM +0200, Jens Axboe wrote:
> It's because ext4 writepages sets ->range_start and wb_writeback() is
> range cyclic, then the next iteration will have the previous end point
> as the starting point. Looks like we need to clear ->range_start in
> wb_writeback(), the better place is probably to do that in
> fs/fs-writeback.c:generic_sync_wb_inodes() right after the
> writeback_single_inode() call. This, btw, should be no different than
> the current code, weird/correct or not :-)

Hmm, or we could have ext4_da_writepages save and restore
->range_start.  One of the things that's never been well documented is
exactly what the semantics are of the various fields in the wbc
struct, and who is allowed to modify which fields when.

If you have some time, it would be great if you could document the
rules filesystems should be following with respect to the wbc struct,
and then we can audit each filesystem to make sure they follow those
rules.  One of the things which is a bit scary about how the many wbc
flags work is that each time a filesystem wants some particular
behavior, it seems like we need to dive into writeback code, and
figure out some combination of flags/settings that make the page
writeback code do what we wants, and sometimes it's not clear whether
that was a designed-in semantic of the interface, or just something
that happened to work given the current implementation.

In any case, if one of the rules is that the filesystems' writepages
command shouldn't be modifying range_start, we can fix this problem up
by saving and restore range_start inside ext4_da_writepages().

							- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
  2009-08-31 10:47             ` Jens Axboe
  (?)
  (?)
@ 2009-08-31 15:54             ` Theodore Tso
  2009-08-31 20:36               ` Jens Axboe
  -1 siblings, 1 reply; 24+ messages in thread
From: Theodore Tso @ 2009-08-31 15:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel,
	chris.mason

At the risk of asking a stupid question, what *is* range_cyclic and
what is it trying to do?  I've been looking at the code and am I'm
getting myself very confused about what the code is trying to do and
what was its original intent.

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
  2009-08-31 15:54             ` Theodore Tso
@ 2009-08-31 20:36               ` Jens Axboe
  0 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2009-08-31 20:36 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel,
	chris.mason

On Mon, Aug 31 2009, Theodore Tso wrote:
> At the risk of asking a stupid question, what *is* range_cyclic and
> what is it trying to do?  I've been looking at the code and am I'm
> getting myself very confused about what the code is trying to do and
> what was its original intent.

Range cyclic means that the current writeback in ->writepages() should
start where it left off the last time, non-range cyclic starts off at
->range_start. ->writepages() can use mapping->writeback_index to store
such information.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
  2009-08-31 10:47             ` Jens Axboe
                               ` (2 preceding siblings ...)
  (?)
@ 2009-08-31 21:03             ` Theodore Tso
  2009-09-01  7:57               ` Aneesh Kumar K.V
  2009-09-01  9:17               ` Jens Axboe
  -1 siblings, 2 replies; 24+ messages in thread
From: Theodore Tso @ 2009-08-31 21:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel,
	chris.mason

On Mon, Aug 31, 2009 at 12:47:49PM +0200, Jens Axboe wrote:
> It's because ext4 writepages sets ->range_start and wb_writeback() is
> range cyclic, then the next iteration will have the previous end point
> as the starting point. Looks like we need to clear ->range_start in
> wb_writeback(), the better place is probably to do that in
> fs/fs-writeback.c:generic_sync_wb_inodes() right after the
> writeback_single_inode() call. This, btw, should be no different than
> the current code, weird/correct or not :-)

Thanks for pointing it out.  After staring at the code, I now believe
this is the best fix for now.  What do other folks think?

     	    	     	       	       - Ted

commit 39cac8147479b48cd45b768d184aa6a80f23a2f7
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Mon Aug 31 17:00:59 2009 -0400

    ext4: Restore wbc->range_start in ext4_da_writepages()
    
    To solve a lock inversion problem, we implement part of the
    range_cyclic algorithm in ext4_da_writepages().  (See commit 2acf2c26
    for more details.)
    
    As part of that change wbc->range_start was modified by ext4's
    writepages function, which causes its callers to get confused since
    they aren't expecting the filesystem to modify it.  The simplest fix
    is to save and restore wbc->range_start in ext4_da_writepages.
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d61fb52..ff659e7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2749,6 +2749,7 @@ static int ext4_da_writepages(struct address_space *mapping,
 	long pages_skipped;
 	int range_cyclic, cycled = 1, io_done = 0;
 	int needed_blocks, ret = 0, nr_to_writebump = 0;
+	loff_t range_start = wbc->range_start;
 	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
 
 	trace_ext4_da_writepages(inode, wbc);
@@ -2917,6 +2918,7 @@ out_writepages:
 	if (!no_nrwrite_index_update)
 		wbc->no_nrwrite_index_update = 0;
 	wbc->nr_to_write -= nr_to_writebump;
+	wbc->range_start = range_start;
 	trace_ext4_da_writepages_result(inode, wbc, ret, pages_written);
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
  2009-08-31 21:03             ` Theodore Tso
@ 2009-09-01  7:57               ` Aneesh Kumar K.V
  2009-09-01  9:17               ` Jens Axboe
  1 sibling, 0 replies; 24+ messages in thread
From: Aneesh Kumar K.V @ 2009-09-01  7:57 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jens Axboe, Christoph Hellwig, linux-mm, Ext4 Developers List,
	linux-fsdevel, chris.mason

On Mon, Aug 31, 2009 at 05:03:37PM -0400, Theodore Tso wrote:
> On Mon, Aug 31, 2009 at 12:47:49PM +0200, Jens Axboe wrote:
> > It's because ext4 writepages sets ->range_start and wb_writeback() is
> > range cyclic, then the next iteration will have the previous end point
> > as the starting point. Looks like we need to clear ->range_start in
> > wb_writeback(), the better place is probably to do that in
> > fs/fs-writeback.c:generic_sync_wb_inodes() right after the
> > writeback_single_inode() call. This, btw, should be no different than
> > the current code, weird/correct or not :-)
> 
> Thanks for pointing it out.  After staring at the code, I now believe
> this is the best fix for now.  What do other folks think?
> 
>      	    	     	       	       - Ted
> 
> commit 39cac8147479b48cd45b768d184aa6a80f23a2f7
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Mon Aug 31 17:00:59 2009 -0400
> 
>     ext4: Restore wbc->range_start in ext4_da_writepages()
>     
>     To solve a lock inversion problem, we implement part of the
>     range_cyclic algorithm in ext4_da_writepages().  (See commit 2acf2c26
>     for more details.)
>     
>     As part of that change wbc->range_start was modified by ext4's
>     writepages function, which causes its callers to get confused since
>     they aren't expecting the filesystem to modify it.  The simplest fix
>     is to save and restore wbc->range_start in ext4_da_writepages.
>     
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d61fb52..ff659e7 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2749,6 +2749,7 @@ static int ext4_da_writepages(struct address_space *mapping,
>  	long pages_skipped;
>  	int range_cyclic, cycled = 1, io_done = 0;
>  	int needed_blocks, ret = 0, nr_to_writebump = 0;
> +	loff_t range_start = wbc->range_start;
>  	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
> 
>  	trace_ext4_da_writepages(inode, wbc);
> @@ -2917,6 +2918,7 @@ out_writepages:
>  	if (!no_nrwrite_index_update)
>  		wbc->no_nrwrite_index_update = 0;
>  	wbc->nr_to_write -= nr_to_writebump;
> +	wbc->range_start = range_start;
>  	trace_ext4_da_writepages_result(inode, wbc, ret, pages_written);
>  	return ret;
>  }

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

We had range_start reset till af6f029d3836eb7264cd3fbb13a6baf0e5fdb5ea

-aneesh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
  2009-08-31 21:03             ` Theodore Tso
  2009-09-01  7:57               ` Aneesh Kumar K.V
@ 2009-09-01  9:17               ` Jens Axboe
  1 sibling, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2009-09-01  9:17 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel,
	chris.mason

On Mon, Aug 31 2009, Theodore Tso wrote:
> On Mon, Aug 31, 2009 at 12:47:49PM +0200, Jens Axboe wrote:
> > It's because ext4 writepages sets ->range_start and wb_writeback() is
> > range cyclic, then the next iteration will have the previous end point
> > as the starting point. Looks like we need to clear ->range_start in
> > wb_writeback(), the better place is probably to do that in
> > fs/fs-writeback.c:generic_sync_wb_inodes() right after the
> > writeback_single_inode() call. This, btw, should be no different than
> > the current code, weird/correct or not :-)
> 
> Thanks for pointing it out.  After staring at the code, I now believe
> this is the best fix for now.  What do other folks think?
> 
>      	    	     	       	       - Ted
> 
> commit 39cac8147479b48cd45b768d184aa6a80f23a2f7
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Mon Aug 31 17:00:59 2009 -0400
> 
>     ext4: Restore wbc->range_start in ext4_da_writepages()
>     
>     To solve a lock inversion problem, we implement part of the
>     range_cyclic algorithm in ext4_da_writepages().  (See commit 2acf2c26
>     for more details.)
>     
>     As part of that change wbc->range_start was modified by ext4's
>     writepages function, which causes its callers to get confused since
>     they aren't expecting the filesystem to modify it.  The simplest fix
>     is to save and restore wbc->range_start in ext4_da_writepages.
>     
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d61fb52..ff659e7 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2749,6 +2749,7 @@ static int ext4_da_writepages(struct address_space *mapping,
>  	long pages_skipped;
>  	int range_cyclic, cycled = 1, io_done = 0;
>  	int needed_blocks, ret = 0, nr_to_writebump = 0;
> +	loff_t range_start = wbc->range_start;
>  	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
>  
>  	trace_ext4_da_writepages(inode, wbc);
> @@ -2917,6 +2918,7 @@ out_writepages:
>  	if (!no_nrwrite_index_update)
>  		wbc->no_nrwrite_index_update = 0;
>  	wbc->nr_to_write -= nr_to_writebump;
> +	wbc->range_start = range_start;
>  	trace_ext4_da_writepages_result(inode, wbc, ret, pages_written);
>  	return ret;
>  }

I was going to suggest using range_start locally and not touching
->range_start, but I see you pass the wbc further down. So this looks
fine to me.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
  2009-08-30 18:17     ` Theodore Tso
  (?)
  (?)
@ 2009-09-01 18:00     ` Chris Mason
  2009-09-01 20:30         ` Theodore Tso
  -1 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2009-09-01 18:00 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Christoph Hellwig, linux-mm, Ext4 Developers List, linux-fsdevel,
	jens.axboe

On Sun, Aug 30, 2009 at 02:17:31PM -0400, Theodore Tso wrote:
[ ... ]

> Jens?  What do you think?  Fixing MAX_WRITEBACK_PAGES was something I
> really wanted to merge in 2.6.32 since it makes a huge difference for
> the block allocation layout for a "rsync -avH /old-fs /new-fs" when we
> are copying bunch of large files (say, 800 meg iso images) and so the
> fact that the writeback routine is writing out 4 megs at a time, means
> that our files get horribly interleaved and thus get fragmented.

I did some runs comparing mainline with Jens' current writeback queue.
This is just btrfs, but its a good example of how things are improving.

These graphs show us the 'compile' phase of compilebench, where it is
writing all the .o files into the 30 kernel trees.  Basically we have
one thread, creating a bunch of files based on the sizes of all the .o
files in a compiled kernel.  They are created in random order, similar
to the files produced from a make -j.

I haven't yet tried this without the max_writeback_pages patch, but the
graphs clearly show a speed improvement, and that the mainline code is
smearing writes across the drive while Jens' work is writing
sequentially.

Jens' writeback branch:
http://oss.oracle.com/~mason/seekwatcher/compilebench-writes-axboe.png

Mainline
http://oss.oracle.com/~mason/seekwatcher/compilebench-writes-pdflush.png

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
  2009-09-01 18:00     ` Chris Mason
@ 2009-09-01 20:30         ` Theodore Tso
  0 siblings, 0 replies; 24+ messages in thread
From: Theodore Tso @ 2009-09-01 20:30 UTC (permalink / raw)
  To: Chris Mason, Christoph Hellwig, linux-mm, Ext4 Developers List,
	linux-fsdevel, jens

On Tue, Sep 01, 2009 at 02:00:52PM -0400, Chris Mason wrote:
> 
> I haven't yet tried this without the max_writeback_pages patch, but the
> graphs clearly show a speed improvement, and that the mainline code is
> smearing writes across the drive while Jens' work is writing
> sequentially.

FYI, you don't need to revert the max_writebacks_pages patch; the
whole point of making it a tunable was to make it easier to run
benchmarks.  If you want to get the effects of the original setting of
MAX_WRITEBACK_PAGES before the patch, just run as root: 

    sysctl vm.max_writeback_pages=1024

						- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages
@ 2009-09-01 20:30         ` Theodore Tso
  0 siblings, 0 replies; 24+ messages in thread
From: Theodore Tso @ 2009-09-01 20:30 UTC (permalink / raw)
  To: Chris Mason, Christoph Hellwig, linux-mm, Ext4 Developers List,
	linux-fsdevel, jens.axboe

On Tue, Sep 01, 2009 at 02:00:52PM -0400, Chris Mason wrote:
> 
> I haven't yet tried this without the max_writeback_pages patch, but the
> graphs clearly show a speed improvement, and that the mainline code is
> smearing writes across the drive while Jens' work is writing
> sequentially.

FYI, you don't need to revert the max_writebacks_pages patch; the
whole point of making it a tunable was to make it easier to run
benchmarks.  If you want to get the effects of the original setting of
MAX_WRITEBACK_PAGES before the patch, just run as root: 

    sysctl vm.max_writeback_pages=1024

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2009-09-01 20:30 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-30  2:54 [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages Theodore Ts'o
2009-08-30  2:54 ` Theodore Ts'o
2009-08-30 16:52 ` Christoph Hellwig
2009-08-30 16:52   ` Christoph Hellwig
2009-08-30 18:17   ` Theodore Tso
2009-08-30 18:17     ` Theodore Tso
2009-08-30 22:27     ` Christoph Hellwig
2009-08-30 22:27       ` Christoph Hellwig
2009-08-31  3:08       ` Theodore Tso
2009-08-31  3:08         ` Theodore Tso
2009-08-31 10:29         ` Jens Axboe
2009-08-31 10:29           ` Jens Axboe
2009-08-31 10:47           ` Jens Axboe
2009-08-31 10:47             ` Jens Axboe
2009-08-31 12:37             ` Theodore Tso
2009-08-31 12:37               ` Theodore Tso
2009-08-31 15:54             ` Theodore Tso
2009-08-31 20:36               ` Jens Axboe
2009-08-31 21:03             ` Theodore Tso
2009-09-01  7:57               ` Aneesh Kumar K.V
2009-09-01  9:17               ` Jens Axboe
2009-09-01 18:00     ` Chris Mason
2009-09-01 20:30       ` Theodore Tso
2009-09-01 20:30         ` Theodore Tso

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.