* [PATCH (RESEND)] don't scan/accumulate more pages than mballoc will allocate
@ 2010-03-29 15:29 Eric Sandeen
2010-04-05 13:11 ` tytso
0 siblings, 1 reply; 5+ messages in thread
From: Eric Sandeen @ 2010-03-29 15:29 UTC (permalink / raw)
To: ext4 development
(resend, email sent Friday seems lost)
There was a bug reported on RHEL5 that a 10G dd on a 12G box
had a very, very slow sync after that.
At issue was the loop in write_cache_pages scanning all the way
to the end of the 10G file, even though the subsequent call
to mpage_da_submit_io would only actually write a smallish amt; then
we went back to the write_cache_pages loop ... wasting tons of time
in calling __mpage_da_writepage for thousands of pages we would
just revisit (many times) later.
Upstream it's not such a big issue for sys_sync because we get
to the loop with a much smaller nr_to_write, which limits the loop.
However, talking with Aneesh he realized that fsync upstream still
gets here with a very large nr_to_write and we face the same problem.
This patch makes mpage_add_bh_to_extent stop the loop after we've
accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
causes the write_cache_pages loop to break.
Repeating the test with a dirty_ratio of 80 (to leave something for
fsync to do), I don't see huge IO performance gains, but the reduction
in cpu usage is striking: 80% usage with stock, and 2% with the
below patch. Instrumenting the loop in write_cache_pages clearly
shows that we are wasting time here.
It'd be better to not have a magic number of 2048 in here, so I'll
look for a cleaner way to get this info out of mballoc; I still need
to look at what Aneesh has in the patch queue, that might help.
This is something we could probably put in for now, though; the 2048
is already enshrined in a comment in inode.c, at least.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---
Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c
+++ linux-2.6/fs/ext4/inode.c
@@ -2318,6 +2318,10 @@ static void mpage_add_bh_to_extent(struc
sector_t next;
int nrblocks = mpd->b_size >> mpd->inode->i_blkbits;
+ /* Don't go larger than mballoc is willing to allocate */
+ if (nrblocks >= 2048)
+ goto flush_it;
+
/* check if thereserved journal credits might overflow */
if (!(EXT4_I(mpd->inode)->i_flags & EXT4_EXTENTS_FL)) {
if (nrblocks >= EXT4_MAX_TRANS_DATA) {
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH (RESEND)] don't scan/accumulate more pages than mballoc will allocate
2010-03-29 15:29 [PATCH (RESEND)] don't scan/accumulate more pages than mballoc will allocate Eric Sandeen
@ 2010-04-05 13:11 ` tytso
2010-04-05 14:42 ` Eric Sandeen
0 siblings, 1 reply; 5+ messages in thread
From: tytso @ 2010-04-05 13:11 UTC (permalink / raw)
To: Eric Sandeen; +Cc: ext4 development
On Mon, Mar 29, 2010 at 10:29:37AM -0500, Eric Sandeen wrote:
> This patch makes mpage_add_bh_to_extent stop the loop after we've
> accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
> causes the write_cache_pages loop to break.
>
> Repeating the test with a dirty_ratio of 80 (to leave something for
> fsync to do), I don't see huge IO performance gains, but the reduction
> in cpu usage is striking: 80% usage with stock, and 2% with the
> below patch. Instrumenting the loop in write_cache_pages clearly
> shows that we are wasting time here.
>
> It'd be better to not have a magic number of 2048 in here, so I'll
> look for a cleaner way to get this info out of mballoc; I still need
> to look at what Aneesh has in the patch queue, that might help.
> This is something we could probably put in for now, though; the 2048
> is already enshrined in a comment in inode.c, at least.
I wonder if a better way of fixing this is to changing
mpage_da_map_pages() to call ext4_get_blocks() multiple times. This
should be a lot easier after we integrate mpage_da_submit_io() into
mpage_da_map_pages(). That way we can way more efficient; in a loop,
we accumulate the pages, call ext4_get_blocks(), then submit the IO
(as a single block I/O submission, instead of 4k at a time through
ext4_writepages()), and then call ext4_get_blocks() again, etc.
I'm willing to include this patch as an interim stopgap, but
eventually, I think we need to refactor and reorganize
mpage_da_map_pages() and and mpage_da_submit_IO(), and let them call
mballoc (via ext4_get_blocks) multiple times in a loop.
Thoughts, suggestions?
- Ted
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH (RESEND)] don't scan/accumulate more pages than mballoc will allocate
2010-04-05 13:11 ` tytso
@ 2010-04-05 14:42 ` Eric Sandeen
2010-04-08 2:10 ` [PATCH] ext4: " Theodore Ts'o
0 siblings, 1 reply; 5+ messages in thread
From: Eric Sandeen @ 2010-04-05 14:42 UTC (permalink / raw)
To: tytso; +Cc: ext4 development
tytso@mit.edu wrote:
> On Mon, Mar 29, 2010 at 10:29:37AM -0500, Eric Sandeen wrote:
>> This patch makes mpage_add_bh_to_extent stop the loop after we've
>> accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
>> causes the write_cache_pages loop to break.
>>
>> Repeating the test with a dirty_ratio of 80 (to leave something for
>> fsync to do), I don't see huge IO performance gains, but the reduction
>> in cpu usage is striking: 80% usage with stock, and 2% with the
>> below patch. Instrumenting the loop in write_cache_pages clearly
>> shows that we are wasting time here.
>>
>> It'd be better to not have a magic number of 2048 in here, so I'll
>> look for a cleaner way to get this info out of mballoc; I still need
>> to look at what Aneesh has in the patch queue, that might help.
>> This is something we could probably put in for now, though; the 2048
>> is already enshrined in a comment in inode.c, at least.
>
> I wonder if a better way of fixing this is to changing
> mpage_da_map_pages() to call ext4_get_blocks() multiple times. This
That sounds reasonable, I'll look into writing something up and testing
it a bit.
Up to you whether the initial patch goes in, I know it's kind of
stopgap/hacky.
thanks,
-Eric
> should be a lot easier after we integrate mpage_da_submit_io() into
> mpage_da_map_pages(). That way we can way more efficient; in a loop,
> we accumulate the pages, call ext4_get_blocks(), then submit the IO
> (as a single block I/O submission, instead of 4k at a time through
> ext4_writepages()), and then call ext4_get_blocks() again, etc.
> I'm willing to include this patch as an interim stopgap, but
> eventually, I think we need to refactor and reorganize
> mpage_da_map_pages() and and mpage_da_submit_IO(), and let them call
> mballoc (via ext4_get_blocks) multiple times in a loop.
>
> Thoughts, suggestions?
>
> - Ted
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH] ext4: don't scan/accumulate more pages than mballoc will allocate
2010-04-05 14:42 ` Eric Sandeen
@ 2010-04-08 2:10 ` Theodore Ts'o
2010-04-08 2:31 ` Eric Sandeen
0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2010-04-08 2:10 UTC (permalink / raw)
To: Ext4 Developers List; +Cc: From: Eric Sandeen, Theodore Ts'o
From: From: Eric Sandeen <sandeen@redhat.com>
There was a bug reported on RHEL5 that a 10G dd on a 12G box
had a very, very slow sync after that.
At issue was the loop in write_cache_pages scanning all the way
to the end of the 10G file, even though the subsequent call
to mpage_da_submit_io would only actually write a smallish amt; then
we went back to the write_cache_pages loop ... wasting tons of time
in calling __mpage_da_writepage for thousands of pages we would
just revisit (many times) later.
Upstream it's not such a big issue for sys_sync because we get
to the loop with a much smaller nr_to_write, which limits the loop.
However, talking with Aneesh he realized that fsync upstream still
gets here with a very large nr_to_write and we face the same problem.
This patch makes mpage_add_bh_to_extent stop the loop after we've
accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
causes the write_cache_pages loop to break.
Repeating the test with a dirty_ratio of 80 (to leave something for
fsync to do), I don't see huge IO performance gains, but the reduction
in cpu usage is striking: 80% usage with stock, and 2% with the
below patch. Instrumenting the loop in write_cache_pages clearly
shows that we are wasting time here.
Eventually we need to change mpage_da_map_pages() also submit its I/O
to the block layer, subsuming mpage_da_submit_io(), and then change it
call ext4_get_blocks() multiple times.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
This is the slightly revised version of Eric's patch that I've added to
the ext4 patch queue. -- Ted
fs/ext4/inode.c | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5c6ca10..2c12926 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2349,6 +2349,15 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
sector_t next;
int nrblocks = mpd->b_size >> mpd->inode->i_blkbits;
+ /*
+ * XXX Don't go larger than mballoc is willing to allocate
+ * This is a stopgap solution. We eventually need to fold
+ * mpage_da_submit_io() into this function and then call
+ * ext4_get_blocks() multiple times in a loop
+ */
+ if (nrblocks >= 8*1024*1024/mpd->inode->i_sb->s_blocksize)
+ goto flush_it;
+
/* check if thereserved journal credits might overflow */
if (!(EXT4_I(mpd->inode)->i_flags & EXT4_EXTENTS_FL)) {
if (nrblocks >= EXT4_MAX_TRANS_DATA) {
--
1.6.6.1.1.g974db.dirty
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] ext4: don't scan/accumulate more pages than mballoc will allocate
2010-04-08 2:10 ` [PATCH] ext4: " Theodore Ts'o
@ 2010-04-08 2:31 ` Eric Sandeen
0 siblings, 0 replies; 5+ messages in thread
From: Eric Sandeen @ 2010-04-08 2:31 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: Ext4 Developers List
Theodore Ts'o wrote:
> From: From: Eric Sandeen <sandeen@redhat.com>
>
> There was a bug reported on RHEL5 that a 10G dd on a 12G box
> had a very, very slow sync after that.
>
> At issue was the loop in write_cache_pages scanning all the way
> to the end of the 10G file, even though the subsequent call
> to mpage_da_submit_io would only actually write a smallish amt; then
> we went back to the write_cache_pages loop ... wasting tons of time
> in calling __mpage_da_writepage for thousands of pages we would
> just revisit (many times) later.
>
> Upstream it's not such a big issue for sys_sync because we get
> to the loop with a much smaller nr_to_write, which limits the loop.
>
> However, talking with Aneesh he realized that fsync upstream still
> gets here with a very large nr_to_write and we face the same problem.
>
> This patch makes mpage_add_bh_to_extent stop the loop after we've
> accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
> causes the write_cache_pages loop to break.
>
> Repeating the test with a dirty_ratio of 80 (to leave something for
> fsync to do), I don't see huge IO performance gains, but the reduction
> in cpu usage is striking: 80% usage with stock, and 2% with the
> below patch. Instrumenting the loop in write_cache_pages clearly
> shows that we are wasting time here.
>
> Eventually we need to change mpage_da_map_pages() also submit its I/O
> to the block layer, subsuming mpage_da_submit_io(), and then change it
> call ext4_get_blocks() multiple times.
>
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>
> This is the slightly revised version of Eric's patch that I've added to
> the ext4 patch queue. -- Ted
Seems fine, thanks.
-Eric
> fs/ext4/inode.c | 9 +++++++++
> 1 files changed, 9 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5c6ca10..2c12926 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2349,6 +2349,15 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
> sector_t next;
> int nrblocks = mpd->b_size >> mpd->inode->i_blkbits;
>
> + /*
> + * XXX Don't go larger than mballoc is willing to allocate
> + * This is a stopgap solution. We eventually need to fold
> + * mpage_da_submit_io() into this function and then call
> + * ext4_get_blocks() multiple times in a loop
> + */
> + if (nrblocks >= 8*1024*1024/mpd->inode->i_sb->s_blocksize)
> + goto flush_it;
> +
> /* check if thereserved journal credits might overflow */
> if (!(EXT4_I(mpd->inode)->i_flags & EXT4_EXTENTS_FL)) {
> if (nrblocks >= EXT4_MAX_TRANS_DATA) {
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2010-04-08 2:31 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-29 15:29 [PATCH (RESEND)] don't scan/accumulate more pages than mballoc will allocate Eric Sandeen
2010-04-05 13:11 ` tytso
2010-04-05 14:42 ` Eric Sandeen
2010-04-08 2:10 ` [PATCH] ext4: " Theodore Ts'o
2010-04-08 2:31 ` Eric Sandeen
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.