[PATCH (RESEND)] don't scan/accumulate more pages than mballoc will allocate

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH (RESEND)] don't scan/accumulate more pages than mballoc will allocate
@ 2010-03-29 15:29 Eric Sandeen
  2010-04-05 13:11 ` tytso
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Sandeen @ 2010-03-29 15:29 UTC (permalink / raw)
  To: ext4 development

(resend, email sent Friday seems lost)

There was a bug reported on RHEL5 that a 10G dd on a 12G box
had a very, very slow sync after that.

At issue was the loop in write_cache_pages scanning all the way
to the end of the 10G file, even though the subsequent call
to mpage_da_submit_io would only actually write a smallish amt; then
we went back to the write_cache_pages loop ... wasting tons of time
in calling __mpage_da_writepage for thousands of pages we would
just revisit (many times) later.

Upstream it's not such a big issue for sys_sync because we get
to the loop with a much smaller nr_to_write, which limits the loop.

However, talking with Aneesh he realized that fsync upstream still
gets here with a very large nr_to_write and we face the same problem.

This patch makes mpage_add_bh_to_extent stop the loop after we've
accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
causes the write_cache_pages loop to break.

Repeating the test with a dirty_ratio of 80 (to leave something for
fsync to do), I don't see huge IO performance gains, but the reduction
in cpu usage is striking: 80% usage with stock, and 2% with the
below patch.  Instrumenting the loop in write_cache_pages clearly
shows that we are wasting time here.

It'd be better to not have a magic number of 2048 in here, so I'll
look for a cleaner way to get this info out of mballoc; I still need
to look at what Aneesh has in the patch queue, that might help.
This is something we could probably put in for now, though; the 2048
is already enshrined in a comment in inode.c, at least.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c
+++ linux-2.6/fs/ext4/inode.c
@@ -2318,6 +2318,10 @@ static void mpage_add_bh_to_extent(struc
 	sector_t next;
 	int nrblocks = mpd->b_size >> mpd->inode->i_blkbits;

+	/* Don't go larger than mballoc is willing to allocate */
+	if (nrblocks >= 2048)
+		goto flush_it;
+
 	/* check if thereserved journal credits might overflow */
 	if (!(EXT4_I(mpd->inode)->i_flags & EXT4_EXTENTS_FL)) {
 		if (nrblocks >= EXT4_MAX_TRANS_DATA) {

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH (RESEND)] don't scan/accumulate more pages than mballoc will allocate
  2010-03-29 15:29 [PATCH (RESEND)] don't scan/accumulate more pages than mballoc will allocate Eric Sandeen
@ 2010-04-05 13:11 ` tytso
  2010-04-05 14:42   ` Eric Sandeen
  0 siblings, 1 reply; 5+ messages in thread
From: tytso @ 2010-04-05 13:11 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: ext4 development

On Mon, Mar 29, 2010 at 10:29:37AM -0500, Eric Sandeen wrote:
> This patch makes mpage_add_bh_to_extent stop the loop after we've
> accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
> causes the write_cache_pages loop to break.
> 
> Repeating the test with a dirty_ratio of 80 (to leave something for
> fsync to do), I don't see huge IO performance gains, but the reduction
> in cpu usage is striking: 80% usage with stock, and 2% with the
> below patch.  Instrumenting the loop in write_cache_pages clearly
> shows that we are wasting time here.
> 
> It'd be better to not have a magic number of 2048 in here, so I'll
> look for a cleaner way to get this info out of mballoc; I still need
> to look at what Aneesh has in the patch queue, that might help.
> This is something we could probably put in for now, though; the 2048
> is already enshrined in a comment in inode.c, at least.

I wonder if a better way of fixing this is to changing
mpage_da_map_pages() to call ext4_get_blocks() multiple times.  This
should be a lot easier after we integrate mpage_da_submit_io() into
mpage_da_map_pages().  That way we can way more efficient; in a loop,
we accumulate the pages, call ext4_get_blocks(), then submit the IO
(as a single block I/O submission, instead of 4k at a time through
ext4_writepages()), and then call ext4_get_blocks() again, etc.

I'm willing to include this patch as an interim stopgap, but
eventually, I think we need to refactor and reorganize
mpage_da_map_pages() and and mpage_da_submit_IO(), and let them call
mballoc (via ext4_get_blocks) multiple times in a loop.

Thoughts, suggestions?

					- Ted

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH (RESEND)] don't scan/accumulate more pages than mballoc will allocate
  2010-04-05 13:11 ` tytso
@ 2010-04-05 14:42   ` Eric Sandeen
  2010-04-08  2:10     ` [PATCH] ext4: " Theodore Ts'o
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Sandeen @ 2010-04-05 14:42 UTC (permalink / raw)
  To: tytso; +Cc: ext4 development

tytso@mit.edu wrote:
> On Mon, Mar 29, 2010 at 10:29:37AM -0500, Eric Sandeen wrote:
>> This patch makes mpage_add_bh_to_extent stop the loop after we've
>> accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
>> causes the write_cache_pages loop to break.
>>
>> Repeating the test with a dirty_ratio of 80 (to leave something for
>> fsync to do), I don't see huge IO performance gains, but the reduction
>> in cpu usage is striking: 80% usage with stock, and 2% with the
>> below patch.  Instrumenting the loop in write_cache_pages clearly
>> shows that we are wasting time here.
>>
>> It'd be better to not have a magic number of 2048 in here, so I'll
>> look for a cleaner way to get this info out of mballoc; I still need
>> to look at what Aneesh has in the patch queue, that might help.
>> This is something we could probably put in for now, though; the 2048
>> is already enshrined in a comment in inode.c, at least.
> 
> I wonder if a better way of fixing this is to changing
> mpage_da_map_pages() to call ext4_get_blocks() multiple times.  This

That sounds reasonable, I'll look into writing something up and testing
it a bit.

Up to you whether the initial patch goes in, I know it's kind of
stopgap/hacky.

thanks,
-Eric

> should be a lot easier after we integrate mpage_da_submit_io() into
> mpage_da_map_pages().  That way we can way more efficient; in a loop,
> we accumulate the pages, call ext4_get_blocks(), then submit the IO
> (as a single block I/O submission, instead of 4k at a time through
> ext4_writepages()), and then call ext4_get_blocks() again, etc.



> I'm willing to include this patch as an interim stopgap, but
> eventually, I think we need to refactor and reorganize
> mpage_da_map_pages() and and mpage_da_submit_IO(), and let them call
> mballoc (via ext4_get_blocks) multiple times in a loop.
> 
> Thoughts, suggestions?
> 
> 					- Ted


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] ext4: don't scan/accumulate more pages than mballoc will allocate
  2010-04-05 14:42   ` Eric Sandeen
@ 2010-04-08  2:10     ` Theodore Ts'o
  2010-04-08  2:31       ` Eric Sandeen
  0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2010-04-08  2:10 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: From: Eric Sandeen, Theodore Ts'o

From: From: Eric Sandeen <sandeen@redhat.com>

There was a bug reported on RHEL5 that a 10G dd on a 12G box
had a very, very slow sync after that.

At issue was the loop in write_cache_pages scanning all the way
to the end of the 10G file, even though the subsequent call
to mpage_da_submit_io would only actually write a smallish amt; then
we went back to the write_cache_pages loop ... wasting tons of time
in calling __mpage_da_writepage for thousands of pages we would
just revisit (many times) later.

Upstream it's not such a big issue for sys_sync because we get
to the loop with a much smaller nr_to_write, which limits the loop.

However, talking with Aneesh he realized that fsync upstream still
gets here with a very large nr_to_write and we face the same problem.

This patch makes mpage_add_bh_to_extent stop the loop after we've
accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
causes the write_cache_pages loop to break.

Repeating the test with a dirty_ratio of 80 (to leave something for
fsync to do), I don't see huge IO performance gains, but the reduction
in cpu usage is striking: 80% usage with stock, and 2% with the
below patch.  Instrumenting the loop in write_cache_pages clearly
shows that we are wasting time here.

Eventually we need to change mpage_da_map_pages() also submit its I/O
to the block layer, subsuming mpage_da_submit_io(), and then change it
call ext4_get_blocks() multiple times.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---

This is the slightly revised version of Eric's patch that I've added to
the ext4 patch queue. -- Ted

 fs/ext4/inode.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5c6ca10..2c12926 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2349,6 +2349,15 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
 	sector_t next;
 	int nrblocks = mpd->b_size >> mpd->inode->i_blkbits;

+	/* 
+	 * XXX Don't go larger than mballoc is willing to allocate
+	 * This is a stopgap solution.  We eventually need to fold
+	 * mpage_da_submit_io() into this function and then call
+	 * ext4_get_blocks() multiple times in a loop
+	 */
+	if (nrblocks >= 8*1024*1024/mpd->inode->i_sb->s_blocksize)
+		goto flush_it;
+
 	/* check if thereserved journal credits might overflow */
 	if (!(EXT4_I(mpd->inode)->i_flags & EXT4_EXTENTS_FL)) {
 		if (nrblocks >= EXT4_MAX_TRANS_DATA) {
-- 
1.6.6.1.1.g974db.dirty

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] ext4: don't scan/accumulate more pages than mballoc will allocate
  2010-04-08  2:10     ` [PATCH] ext4: " Theodore Ts'o
@ 2010-04-08  2:31       ` Eric Sandeen
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Sandeen @ 2010-04-08  2:31 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List

Theodore Ts'o wrote:
> From: From: Eric Sandeen <sandeen@redhat.com>
> 
> There was a bug reported on RHEL5 that a 10G dd on a 12G box
> had a very, very slow sync after that.
> 
> At issue was the loop in write_cache_pages scanning all the way
> to the end of the 10G file, even though the subsequent call
> to mpage_da_submit_io would only actually write a smallish amt; then
> we went back to the write_cache_pages loop ... wasting tons of time
> in calling __mpage_da_writepage for thousands of pages we would
> just revisit (many times) later.
> 
> Upstream it's not such a big issue for sys_sync because we get
> to the loop with a much smaller nr_to_write, which limits the loop.
> 
> However, talking with Aneesh he realized that fsync upstream still
> gets here with a very large nr_to_write and we face the same problem.
> 
> This patch makes mpage_add_bh_to_extent stop the loop after we've
> accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
> causes the write_cache_pages loop to break.
> 
> Repeating the test with a dirty_ratio of 80 (to leave something for
> fsync to do), I don't see huge IO performance gains, but the reduction
> in cpu usage is striking: 80% usage with stock, and 2% with the
> below patch.  Instrumenting the loop in write_cache_pages clearly
> shows that we are wasting time here.
> 
> Eventually we need to change mpage_da_map_pages() also submit its I/O
> to the block layer, subsuming mpage_da_submit_io(), and then change it
> call ext4_get_blocks() multiple times.
> 
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
> 
> This is the slightly revised version of Eric's patch that I've added to
> the ext4 patch queue. -- Ted

Seems fine, thanks.

-Eric

>  fs/ext4/inode.c |    9 +++++++++
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5c6ca10..2c12926 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2349,6 +2349,15 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
>  	sector_t next;
>  	int nrblocks = mpd->b_size >> mpd->inode->i_blkbits;
>  
> +	/* 
> +	 * XXX Don't go larger than mballoc is willing to allocate
> +	 * This is a stopgap solution.  We eventually need to fold
> +	 * mpage_da_submit_io() into this function and then call
> +	 * ext4_get_blocks() multiple times in a loop
> +	 */
> +	if (nrblocks >= 8*1024*1024/mpd->inode->i_sb->s_blocksize)
> +		goto flush_it;
> +
>  	/* check if thereserved journal credits might overflow */
>  	if (!(EXT4_I(mpd->inode)->i_flags & EXT4_EXTENTS_FL)) {
>  		if (nrblocks >= EXT4_MAX_TRANS_DATA) {


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-04-08  2:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-29 15:29 [PATCH (RESEND)] don't scan/accumulate more pages than mballoc will allocate Eric Sandeen
2010-04-05 13:11 ` tytso
2010-04-05 14:42   ` Eric Sandeen
2010-04-08  2:10     ` [PATCH] ext4: " Theodore Ts'o
2010-04-08  2:31       ` Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.