linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Theodore Tso <tytso@mit.edu>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jens Axboe <jens.axboe@oracle.com>,
	Linux Kernel Developers List <linux-kernel@vger.kernel.org>,
	Ext4 Developers List <linux-ext4@vger.kernel.org>
Subject: Re: [GIT PULL] Ext3 latency fixes
Date: Sat, 4 Apr 2009 09:57:19 -0400	[thread overview]
Message-ID: <20090404135719.GA9812@mit.edu> (raw)
In-Reply-To: <alpine.LFD.2.00.0904031329410.7007@localhost.localdomain>

On Fri, Apr 03, 2009 at 01:41:54PM -0700, Linus Torvalds wrote:
> 
> Hmm. So I decided to try with "data=writeback" to see if it really makes 
> that big of a difference. It does help, but I still easily trigger 
> multi-second pauses

Using ext3 data=writeback, your "write big (2GB) file and sync" as the
background workload, and fsync-tester, I was able to reduce the
latency down to under 1 second...

fsync time: 0.1717
fsync time: 0.0205
fsync time: 0.1814
fsync time: 0.7408
fsync time: 0.1955
fsync time: 0.0327
fsync time: 0.0456
fsync time: 0.0563

...by doing two things: (1) Applying the attached patch, which fixes
one last critical place where we were using WRITE instead of
WRITE_SYNC --- the commit record was being written by
sync_dirty_buffer(), and this of _all_ places needs to use WRITE_SYNC,
since it waits for the buffer to be written write after submitting the
I/O, and (2) using the anticipatory I/O scheduler instead of the cfq
I/O scheduler.

(1) brought things down from 2-3.5 seconds on my system to 1-2
seconds, and (2) brought things down to what you see above.  I think
what is going on with the cfq scheduler is that it's using time slices
to make sure sync and async operations never completely starve each
other out, and in this case we need to tune the I/O scheduling
parameters so that for this workload, the synchronous operations don't
end up getting impeded by the asynchronous writes caused by background
"write big file and sync" task.

In any case, Jens may want to look at this test case (ext3
data=writeback, 'while : ; do time sh -c "dd if=/dev/zero of=bigfile
bs=8M count=256 ; sync"; done', and fsync-tester) as a good way to see
how cfq might be improved.

On another thread, it's been commented that there are still workloads
for which people are quietly switching from CFQ to AS, and this is
bad, because it causes us not to try to figure out why our default I/O
scheduler still as these 1% of cases where people need to use another
scheduler.  Well, here's one such case which is relatively easy to
reproduce.

> Are we perhaps ending up doing those regular 'bigfile' writes as 
> WRITE_SYNC, just because of the global "sync()" call? That's probably a 
> bad idea. A "sync" is about pure throughput. It's not about latency like 
> "fsync()" is.

Well, at least on my system where I did this round of testing (4gig
X61s with a 5400 RPM laptop drive), most of the time we weren't
writing the bigfile writes because of the sync, but because dd and
pdflush processes was trying to flush out the dirty pages from the big
write operation.  At the moment where "dd" completes and the "sync"
command is executed, the fsync latency jumped up to about 4-5 seconds
before this last round of changes.  After adding the attached patch
and switching to the AS I/O scheduler, at the moment of the sync the
fsync latency was just over a second (1.1 to 1.2 seconds).  The rest
of the time we are averaging between a 1/4 and a 1/3 of a second, with
rare fsync latency spikes up to about 3/4 of a second, as show at the
beginning of this message.

(Maybe on a system with a lot more memory, the dirty pages don't start
getting flushed to disk until the sync command, but that's not what
I'm seeing on my 4 gig laptop.)

In answer to your question, "sync" does the writes in two passes;
first it pushes out writes with wbc.sync_mode set to WB_SYNC_NONE, and
then it calls the page writeback routines a second time with
WB_SYNC_ALL.  So most of the writes should go out with WRITE, except
that the page writeback routines aren't as aggressive about pushing
out _all_ pages in WB_SYNC_NONE, so I believe some of the pages would
still be written on the WB_SYNC_ALL, and thus would go out using
WRITE_SYNc.  This is based on 2-3 month old memory of how things
worked in the page-writeback routines, which is the last time I traced
the very deep call trees involved in this area.  I'd have to run a
blktrace experiment to see for sure how many of the writes were going
out as WRITE vs. WRITE_SYNC in the case of the 'sync' command.

In any case, I recommend you take the following attached patch, and
then try out ext3 data=writeback with anticipatory I/O scheduler.
Hopefully you'll be pleased with the results.

							- Ted

>From 6d293d2aa42d43c120f113bde55f7b0d6f3f35ae Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <tytso@mit.edu>
Date: Sat, 4 Apr 2009 09:17:38 -0400
Subject: [PATCH] sync_dirty_buffer: Use WRITE_SYNC instead of WRITE

The sync_dirty_buffer() function submits a buffer for write, and then
synchronously waits for it.  It clearly should use WRITE_SYNC instead
of WRITE.  This significantly reduces ext3's fsync() latency when
there is a huge background task writing data asyncronously in the
background, since ext3 uses sync_dirty_buffer() to write the commit
block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/buffer.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 2ed4b68..78ed086 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3038,7 +3038,7 @@ int sync_dirty_buffer(struct buffer_head *bh)
 	if (test_clear_buffer_dirty(bh)) {
 		get_bh(bh);
 		bh->b_end_io = end_buffer_write_sync;
-		ret = submit_bh(WRITE, bh);
+		ret = submit_bh(WRITE_SYNC, bh);
 		wait_on_buffer(bh);
 		if (buffer_eopnotsupp(bh)) {
 			clear_buffer_eopnotsupp(bh);
-- 
1.5.6.3


  reply	other threads:[~2009-04-04 13:57 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-03  7:01 [GIT PULL] Ext3 latency fixes Theodore Ts'o
2009-04-03  7:01 ` [PATCH 1/4] block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks Theodore Ts'o
2009-04-03  7:01   ` [PATCH 2/4] ext3: Use WRITE_SYNC for commits which are caused by fsync() Theodore Ts'o
2009-04-03  7:01     ` [PATCH 3/4] ext3: Add replace-on-truncate hueristics for data=writeback mode Theodore Ts'o
2009-04-03  7:01       ` [PATCH 4/4] ext3: Add replace-on-rename " Theodore Ts'o
2009-04-03 18:24 ` [GIT PULL] Ext3 latency fixes Linus Torvalds
2009-04-03 18:47   ` Jens Axboe
2009-04-03 19:13     ` Theodore Tso
2009-04-03 21:01     ` Chris Mason
2009-04-03 19:02   ` Linus Torvalds
2009-04-03 20:41     ` Linus Torvalds
2009-04-04 13:57       ` Theodore Tso [this message]
2009-04-04 15:16         ` Jens Axboe
2009-04-04 15:57           ` Linus Torvalds
2009-04-04 16:06             ` Linus Torvalds
2009-04-04 17:36               ` Jens Axboe
2009-04-04 17:34             ` Jens Axboe
2009-04-04 17:44               ` Linus Torvalds
2009-04-04 18:00                 ` Trenton D. Adams
2009-04-04 18:01                 ` Jens Axboe
2009-04-04 18:10                   ` Linus Torvalds
2009-04-04 23:22                   ` Theodore Tso
2009-04-04 23:33                     ` Arjan van de Ven
2009-04-05  0:10                       ` Theodore Tso
2009-04-05 15:05                         ` Arjan van de Ven
2009-04-05 17:01                         ` Linus Torvalds
2009-04-05 17:15                           ` Mark Lord
2009-04-05 20:57                             ` Jeff Garzik
2009-04-05 23:48                               ` Arjan van de Ven
2009-04-06  2:32                                 ` Mark Lord
2009-04-06  5:47                                 ` Jeff Garzik
2009-04-07 18:18                                   ` Linus Torvalds
2009-04-07 18:22                                     ` Linus Torvalds
2009-04-07 19:40                                     ` [PATCH libata: add SSD detection hueristic; move SSD setup to ata_dev_configure (was Re: [GIT PULL] Ext3 latency fixes) Jeff Garzik
2009-04-09 18:21                                       ` Tejun Heo
2009-04-18  3:02                                         ` George Spelvin
2009-04-06  8:13                             ` [GIT PULL] Ext3 latency fixes Jens Axboe
2009-04-05 18:56                           ` Arjan van de Ven
2009-04-05 19:34                             ` Linus Torvalds
2009-04-05 20:06                               ` Arjan van de Ven
2009-04-06  6:25                               ` Jens Axboe
2009-04-06  6:05                           ` Theodore Tso
2009-04-06  6:23                           ` Jens Axboe
2009-04-06  8:16                       ` Jens Axboe
2009-04-06 14:48                         ` Linus Torvalds
2009-04-06 15:09                           ` Jens Axboe
2009-04-06  6:15                     ` Jens Axboe
2009-04-04 20:18               ` Ingo Molnar
2009-04-06 21:50                 ` Lennart Sorensen
2009-04-07 13:31                   ` Mark Lord
2009-04-07 14:48                     ` Lennart Sorensen
2009-04-07 19:21                       ` Mark Lord
2009-04-07 19:57                         ` Lennart Sorensen
2009-04-04 20:56               ` Arjan van de Ven
2009-04-06  7:06                 ` Jens Axboe
2009-04-07 15:39             ` Indan Zupancic
2009-04-04 19:18           ` Theodore Tso
2009-04-06  8:12             ` Jens Axboe
2009-04-04 22:13         ` Linus Torvalds
2009-04-04 22:19           ` Linus Torvalds
2009-04-05  0:20           ` Theodore Tso
2009-04-03 19:54   ` Theodore Tso
2009-04-08 23:40 Theodore Ts'o
2009-04-09 15:49 ` Linus Torvalds
2009-04-09 16:23   ` Chris Mason
2009-04-09 17:49     ` Jan Kara
2009-04-09 18:10       ` Chris Mason
2009-04-09 19:04         ` Jan Kara
2009-04-09 17:36   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090404135719.GA9812@mit.edu \
    --to=tytso@mit.edu \
    --cc=jens.axboe@oracle.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).